# Dynamic Web Scraping for Sports Data
In this notebook, we will expand on the concepts used in the static web scraping notebook. We will still be using the Premier League's official site, this time scraping all of the current game results for this season. We have some added complexity this time that makes dynamic web scraping necessary to have access to the data we want.

The site we are working with is shown below. This page uses lazy loading, a principle that ensures that content is only loaded when a user requests it. In this case, "requesting" the data means scrolling down enough so that the end of the currently loaded content is reached. However, we want all the content in our dataset, and we will automate web actions with Selenium to access it.

![](images/pl-results.png)

The library we will be using is Selenium. This library is great with web automation, making it an option for many use cases, from anything such as web scraping (our use case) to website testing. Install Selenium below.

In [None]:
%pip install selenium
%pip install beautifulsoup4
%pip install pandas

The key difference that sets Selenium apart from BeautifulSoup is the static vs. dynamic nature of the web scraping we are going to be doing. With BeautifulSoup, we retrieve HTML content from a page at a snapshot in time that we pass into a BeautifulSoup object, where it is then parsed.  After we make the get request, that HTML object is not going to automatically change; it's static. On the other hand, with Selenium, we use a webdriver to start an instance of a web browser. This API gives us dynamic access to the browser, meaning that we can interact with the site as a human might in realtime. In our use case, we will be approaching a simple but crucial problem: our data will not all load until we click away several elements and scroll down on the page. Once we do this, we can simply retrieve the HTML of the page, and then use BeautifulSoup as we did in the last example.

We will be using the webdriver and Service APIs from Selenium to start. 

In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

We will be using the Premier League's results page to make a simple dataset that has information on every match in the leauge so far this season.

In [3]:
results_url = "https://www.premierleague.com/results"

The function below creates a webdriver, which gives us a programmatic entrypoint into the web browser. 

The code below uses Chrome, though other browsers can be used (for more information: https://selenium-python.readthedocs.io/installation.html#drivers)

*Note: in order to run the code, your ChromeDriver version must be identical to the version of Chrome installed on your device.*

To download ChromeDriver:
- Chrome version <=114>: https://chromedriver.chromium.org/downloads
- Chrome version >114: https://googlechromelabs.github.io/chrome-for-testing/

In the repository, I currently have ChromeDriver version 121.X.XXXX.XX.

This function returns a webdriver.Chrome instance using the ./chromedriver.exe path; therefore, place your chromedriver.exe file in the base directory of this project. An example chromedriver.exe file is already present, but ensure you use the correct version to avoid errors.

We also have the concept of headless below: currently, when we run our webdriver, it will open with a visible and interactive version of the browser. Headless allows the chrome instance to run without a UI. This can cause some differences in functionality at times, and in this case we will keep headless off. However, it is a useful feature to allow web automation tasks to run in the background, and to save resources.

In [4]:
def get_driver(headless: bool = False):
    # Path to the chromedriver executable
    chromedriver_path = './chromedriver.exe'

    # Set headless mode if specified
    options = webdriver.ChromeOptions()
    if headless:
        options.add_argument('--headless')

    # Start and return the chrome insance
    return webdriver.Chrome(service=Service(executable_path=chromedriver_path), options=options)

Below is a simple instance of initializing a driver, making a get request, and then quitting the session.

In [5]:
driver = get_driver()

# Navigate to the url
driver.get(results_url)

# Quit the driver
driver.quit()

### By Class
Using the "By" class, we can access elements by many different methods:
- ID = "id"
- NAME = "name"
- XPATH = "xpath"
- LINK_TEXT = "link text"
- PARTIAL_LINK_TEXT = "partial link text"
- TAG_NAME = "tag name"
- CLASS_NAME = "class name"
- CSS_SELECTOR = "css selector"

This class is used in conjunction with the find_element() and find_elements() APIs to give us different ways of specifying criteria in which to look for elements in the dynamic DOM.

In [6]:
from selenium.webdriver.common.by import By

Below, we will use the find_element() and click() APIs. The click() function does exactly what you might expect - given an element, it will perform a click on it. The code below closes two popups that show up when you enter the site without any cookies.

In [7]:
driver = get_driver()
driver.get(results_url)

accept_cookies_id, close_advert_id = "onetrust-accept-btn-handler", "advertClose"

driver.find_element(By.ID, accept_cookies_id).click()
driver.find_element(By.ID, close_advert_id).click()

driver.quit()

### Explicit and Implicit Waits
With dynamic websites, things load in from a variety of different sources. We can't expect every item we want to show up onscreen immediately. However, we might know a generally expected amount of time we will need to guarantee an element to exist in the DOM. Therefore, we can use waits to specify how much we want to pause before trying to see if an element is there.

Explicit - Wait a specific amount of time to find a certain element
Implicit - When finding any element, wait a certain amount of time

Use explicit - gives us more customization over the code, and can avoid problems from wait times being too great or too little on individual cases.

### Expected Conditions

This API can be used in conjuntion with waits - we wait EITHER for an expected condition to be true, or until the time limit is exceeded.

Examples of Expected Conditions (EC):
- title_is
- title_contains
- presence_of_element_located

Below, we will use two waits, building on the last example. We simply wait for the presence of the elemnts we are trying to find, giving a 10 second buffer before an error is thrown. expected_conditions returns a boolean, and we can pass it into the .until() method of a WebDriverWait() object that if it evaluates to true, we will get the element returned, in which we can call a click().


In [8]:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = get_driver()
driver.get(results_url)

accept_cookies_id, close_advert_id, invalid_element_id = "onetrust-accept-btn-handler", "advertClose", "not-an-element"

try:
    accept_cookies_element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, accept_cookies_id))
    ).click()

    close_advert_element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, close_advert_id))
    ).click()

    # THE CODE BELOW WILL NOT WORK
    # Keep it uncommented to see how an invalid element will fail with this code
    # invalid_element = WebDriverWait(driver, 10).until(
    #     EC.presence_of_element_located((By.ID, invalid_element_id))
    # ).click()
finally:
    driver.quit()

### Using XPaths
We can also parse through the DOM with XPaths. XPaths give us a different, but similar, way of parsing HTML data. XPaths are intended to be used with XML, but can often adequately work with HTML pages. We will not go extremely in depth to these concepts here, but you can read more about them below. We will use one XPath example in our next codeblock.

More information: 
- https://www.w3schools.com/xml/xpath_intro.asp
- https://scrapfly.io/blog/parsing-html-with-xpath/

### Scrolling using ActionChains
We can use the ActionChains library for various actions in the browser. In this case, we will pass the object returned by WebDriverWait to ActionChains in order to scroll to it, in turn loading the rest of the content on the page.

We will use XPath to check for the existence of an element containing the text of the date that we want to scroll down to. Once we confirm this is true, we can scrape the HTML data from the page, close the browser, and parse as we did previously with BeautifulSoup, eventually creating a Pandas dataframe that is printed below.

In [34]:
from selenium.webdriver.common.action_chains import ActionChains
from bs4 import BeautifulSoup as bs
import pandas as pd

driver = get_driver()
driver.get(results_url)

accept_cookies_id, close_advert_id = "onetrust-accept-btn-handler", "advertClose"

try:
    accept_cookies_element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, accept_cookies_id))
    ).click()

    close_advert_element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, close_advert_id))
    ).click()

    # Scroll to footer to activate JavaScript load of data
    ActionChains(driver).scroll_to_element(
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "footer"))
        )
    ).perform()

    # waiting until all the data has loaded onto the page
    date_string_to_find = "Friday 11 August 2023"
    date_xpath = f"//*[contains(text(),'{date_string_to_find}')]"
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, date_xpath))
    )

    # Once we fetch the HTML, we can use beautifulsoup to parse through the webpage
    pl_html = driver.find_element(By.TAG_NAME, "html").get_attribute("innerHTML")
    driver.quit()
    
    soup = bs(pl_html, "html.parser")

    dates, home, away, home_score, away_score, stadium = [], [], [], [], [], []
    col_names = "date", "home", "away", "home_score", "away_score", "stadium"

    for date in soup.select(".fixtures__date-container"):
        match_date = date.find("time").text
        for _ in range(len(date.select(".match-fixture"))):
            dates.append(match_date)

    match_list= [i for i in soup.select(".matchList > .match-fixture")]
    home = [i['data-home'] for i in match_list]
    away = [i['data-away'] for i in match_list]
    scores = [i.text.split("-") for i in soup.select(".match-fixture__score")]
    home_score = [score[0] for score in scores]
    away_score = [score[1] for score in scores]
    stadium = [i['data-venue'] for i in match_list]

    cols = [dates, home, away, home_score, away_score, stadium]
    data = dict()
    for i in range(len(col_names)):
        data[col_names[i]] = cols[i]

    df = pd.DataFrame(data=data)
    print(df)
except Exception as e:
    print("There was an error.")
    print(e)

                          date           home            away home_score   
0     Tuesday 20 February 2024       Man City       Brentford          1  \
1      Monday 19 February 2024        Everton  Crystal Palace          1   
2      Sunday 18 February 2024  Sheffield Utd        Brighton          0   
3      Sunday 18 February 2024          Luton         Man Utd          1   
4    Saturday 17 February 2024      Brentford       Liverpool          1   
..                         ...            ...             ...        ...   
244    Saturday 12 August 2023       Brighton           Luton          4   
245    Saturday 12 August 2023        Everton          Fulham          0   
246    Saturday 12 August 2023  Sheffield Utd  Crystal Palace          0   
247    Saturday 12 August 2023      Newcastle     Aston Villa          5   
248      Friday 11 August 2023        Burnley        Man City          0   

    away_score                             stadium  
0            0          Etihad Sta

Selenium is a powerful library that we only scratched the surface of today. You can do other actions such as run JavaScript in the browser, simulate keyboard and mouse input, and interact with a wide variety of dynamic elements on the page. Your specific use case will determine what you need to use, but many of the concepts used in this notebook, although simple, can get you through most actions needed to retrieve data from a dynamic webpage.

For more information:
- https://selenium-python.readthedocs.io/