## Web Scraping using Selenium

Selenium is a portable framework for testing web applications. Selenium provides a playback tool for authoring functional tests without the need to learn a test scripting language. It can also be used to scrape data from web windows. For further details please refer https://www.guru99.com/selenium-tutorial.html

**Youtube** : 
To get an overall understanding of webscraping please watch https://youtu.be/nN0OD6HLDJk

In [9]:
from IPython.display import HTML
import warnings
warnings.filterwarnings('ignore')

# Youtube
HTML('<iframe width="980" height="340" src="https://www.youtube.com/embed/nN0OD6HLDJk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

## Scraping IPL data for most runs across season from iplt20.com
***
This file mostly focuses on scraping IPL data from iplt20.com using selenium as a driver. We are extracting data for **Most Runs** and **Most Wickets** across season from all the teams.
***

<img src="https://cloudfront.timesnownews.com/media/Orange_32.jpg" width="1000" />

## Installing selenium package

In [2]:
# Installing selenium
'''
use this pip install to download the packages of selenium 
'''
! pip install selenium



## Importing Libraries

In [5]:
'''
Importing important libraries that we'd require in the further process
'''

from time import sleep
import sys
import pandas as pd 
import numpy as np
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email import encoders

## Chrome Driver path

In [6]:
'''
Passing path of the chrome webdriver
'''
chromedriver_path = r'C:\Users\KUSH\Desktop\WebScraping\chromedriver.exe'

In [4]:
'''
driver initiates the session for the chrome webdriver
An automated way to visit to a specific page. Mostly used when page asks you to log in  
'''

driver = webdriver.Chrome(executable_path=chromedriver_path)
# driver.maximize_window()
# sleep(10)
# driver.get("https://www.iplt20.com")
# sleep(10)
# driver.find_elements_by_xpath("//div[contains(@class, 'main-nav__drop-down')]")[2].click()
# sleep(2)
# driver.find_elements_by_xpath("//a[contains(@class, 'main-nav__link') and contains(text(), 'By Season')]")[0].click()
# sleep(5)


## Defining page_scrape() function that will perform the scraping





**page_scrape()** is a function defined that will scrape the data from the table present on the selected **URL**. It takes two arguments:
- **Season**
- **Dataframe**

Season : Since we have data starting from 2008 till 2021 on iplt20.com, we'd run the given function for all those years
***
Dataframe : We'd like to extract two different sets of data i.e. "Most Runs" and "Most Wickets"
***

In [21]:
'''
Defining main function for scraping data. This function will scrape all the texts from the data and store it into a dataframe
by appending it on an iterative call
'''

def get_text(webelement): #this will eliminate the next line tag and create the name as one. ex "Shikhar\nDhawan" -> Shikhar Dhawan
    return webelement.get_attribute("innerText").replace("\n", " ")

def page_scrape(season, df): 
    '''
    Functions to scrape a page details for a particular year. This takes two arguments:
    1. Season -> year from which you want to scrape the data
    2. df -> Dataframe (An empty dataframe) 
    '''
    
    table_rows = driver.find_elements_by_xpath("//table//tr[contains(@class, 'js-row')]")
    for ele in table_rows :
        row_data = list(map(get_text, ele.find_elements_by_tag_name("td")))
        if (row_data[0]=='1'):
            row_data.extend([ele.get_attribute('data-nationality'), season, ele.get_attribute('class').split()[2]])
        else:
            row_data.extend([ele.get_attribute('data-nationality'), season, ele.get_attribute('class').split()[1]])
        df.loc[len(df)] = row_data


## Scraping data of most runs across years in IPL 

****
This is calling the page_scrape function to scrape data from the table for the **MOST RUNS**
****


In [24]:
'''
Since this IPL website doesn't as for any login id or password, we can extract out data from the URL of the 
page where table is present.
'''
'''
This snippet will scrape data for the most runs by players from all the team across different seasons of IPL
'''

season_url = "https://www.iplt20.com/stats/{}/most-runs"
driver.get("https://www.iplt20.com/stats/{}/most-runs".format('2021'))

table_heading = driver.find_element_by_xpath("//table//tr[contains(@class, 'top-players__header')]")
globals()['column_values'] = list(table_heading.get_attribute("innerText").split())
column_values.extend(['Nationality', 'Season', 'Team'])

df_runs = pd.DataFrame(columns = column_values)

for season in range(2021, 2007, -1):
    driver.get(season_url.format(season))
    page_scrape(season, df_runs)
    sys.stdout.write('\rData scraping completed for year \033[0;37;40m %d ' %season)
    sys.stdout.flush()

Data scraping completed for year [0;37;40m 2008 

In [25]:
# Adding an extra column named 'Status' that will tell us whether the player was out when he made the highest score of the
# season

df_runs['status']=np.where(df_runs['HS'].str.contains(r'[*]'),'Not Out','Out') # define "NOT OUT" based on the given expression 
df_runs.head() #Displaying top 5 rows from the dataframe

Unnamed: 0,POS,PLAYER,Mat,Inns,NO,Runs,HS,Avg,BF,SR,100,50,4s,6s,Nationality,Season,Team,status
0,1,Shikhar Dhawan,8,8,1,380,92,54.28,283,134.27,0,3,43,8,Indian,2021,DC,Out
1,2,KL Rahul,7,7,2,331,91*,66.2,243,136.21,0,4,27,16,Indian,2021,PBKS,Not Out
2,3,Faf du Plessis,7,7,2,320,95*,64.0,220,145.45,0,4,29,13,Overseas,2021,CSK,Not Out
3,4,Prithvi Shaw,8,8,0,308,82,38.5,185,166.48,0,3,37,12,Indian,2021,DC,Out
4,5,Sanju Samson,7,7,1,277,119,46.16,190,145.78,1,0,26,11,Indian,2021,RR,Out


## Scraping data of most wickets across years in IPL 

***
This is calling the page_scrape function to scrape data from the table for the **MOST WICKETS**
***


In [26]:
'''
This snippet will scrape data for the most wickets by players from all the team across different seasons of IPL
'''

season_url = "https://www.iplt20.com/stats/{}/most-wickets"
driver.get("https://www.iplt20.com/stats/{}/most-wickets".format('2021'))
table_heading = driver.find_element_by_xpath("//table//tr[contains(@class, 'top-players__header')]")
globals()['column_values'] = list(table_heading.get_attribute("innerText").split())
column_values.extend(['Nationality', 'Season', 'Team'])

df_wickets = pd.DataFrame(columns = column_values)

for season in range(2021, 2007, -1):
    driver.get(season_url.format(season))
    page_scrape(season, df_wickets)
    sys.stdout.write('\rData scraping completed for year \033[0;37;40m %d ' %season)
    sys.stdout.flush()

df_wickets.head() # displaying top 5 rows from the dataframe

Data scraping completed for year [0;37;40m 2008 

Unnamed: 0,POS,PLAYER,Mat,Inns,Ov,Runs,Wkts,BBI,Avg,Econ,SR,4w,5w,Nationality,Season,Team
0,1,Harshal Patel,7,7,28,257,17,5/27,15.11,9.17,9.88,0,1,Indian,2021,RCB
1,2,Avesh Khan,8,8,30,231,14,3/32,16.5,7.7,12.85,0,0,Indian,2021,DC
2,3,Chris Morris,7,7,26,224,14,4/23,16.0,8.61,11.14,1,0,Overseas,2021,RR
3,4,Rahul Chahar,7,7,28,202,11,4/27,18.36,7.21,15.27,1,0,Indian,2021,MI
4,5,Rashid Khan,7,7,28,172,10,3/36,17.2,6.14,16.8,0,0,Overseas,2021,SRH


## Writing dataframes to an Excel sheet 
***
This is used to write two dataframes in different sheets of excel file
***

In [10]:
writer = pd.ExcelWriter('ipl_data.xlsx', engine='xlsxwriter')

# Write each dataframe to a different worksheet.
df_runs.to_excel(writer, sheet_name='most runs')
df_wickets.to_excel(writer, sheet_name='most wickets')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

## Sending mail of the table
***
This sends the mail of the generated report to the provided mail address. For example purpose I've used my both the email Ids
***

In [11]:
'''
Sending mail with attached excel file that we created in the above snippet
'''

import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email import encoders
import getpass

fromaddr = 'uditpandey97@gmail.com'
toaddr = ['uditpandeyofficial@gmail.com']
password = getpass.getpass(prompt='Please enter your account password: ')

msg = MIMEMultipart()

msg['From'] = fromaddr
msg['To'] = ", ".join(toaddr)
msg['Subject'] = "IPL Data" #Subject of the mail

body = "Hey, please find attached document (Excel) with data for most runs and most wickets"

msg.attach(MIMEText(body, 'plain'))

filename = "IPL_Data.xlsx"  #file name that would show up on the mail
attachment = open(r"C:\Users\KUSH\Desktop\WebScrapping\ipl_data.xlsx", 
"rb")

part = MIMEBase('application', 'octet-stream')
part.set_payload((attachment).read())
encoders.encode_base64(part)
part.add_header('Content-Disposition', "attachment; filename= %s" % filename)

msg.attach(part)

server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
try:
    server.login(fromaddr, password) #Type Password
    server.sendmail(fromaddr, toaddr, msg.as_string())
except Exception as ex:
    print("Your password is wrong")
else:     # If all the statement inside try block runs successfully and doesn't raise any exception then else block will execute
    print("Request successfully submitted")
    print("\nMail has been sent")
finally:
    server.quit()
attachment.close()

Please enter your account password: ········
Request successfully submitted

Mail has been sent
