# Notebook Summary: Scraping and Analyzing EPL Match Data

This notebook performs web scraping to collect English Premier League (EPL) match data from the understat.com website for multiple seasons. The data is then processed and analyzed using Python libraries such as `requests`, `BeautifulSoup`, `pandas`, and `json`.

## Libraries Used
- `requests`: Used for making HTTP requests to retrieve HTML content.
- `BeautifulSoup`: Utilized for HTML parsing.
- `re`: Regular expressions module for string manipulation.
- `json`: For handling JSON data.
- `codecs`: Used for decoding strings.
- `pandas`: A powerful data manipulation library.
- `json_normalize`: Part of pandas, employed to normalize JSON data into tabular form.
- `time`: Included for introducing delays between web requests.

## Key Functions

1. **`get_season_html(season):`**
   - Constructs the URL based on the EPL league and the specified season.
   - Sends an HTTP GET request to the constructed URL.
   - Retrieves the HTML content from the response.

2. **`parse_html_content(html_content):`**
   - Uses BeautifulSoup to parse HTML content.
   - Finds all script tags in the HTML.
   - Accesses a specific script tag (index 2) and extracts JSON data from it.

3. **`normalized_dataframe(teams_data):`**
   - Processes the teams' data by creating individual DataFrames for each team.
   - Normalizes the 'history' column using `json_normalize`.
   - Concatenates the normalized DataFrame with the original DataFrame for each team.
   - Returns a list of DataFrames for all teams.

4. **Seasonal Data Scraping and Normalization:**
   - Defines a list of EPL seasons.
   - Iterates through each season, fetching HTML content, parsing it, and creating normalized DataFrames.
   - Introduces a 5-second delay between seasons to manage web scraping etiquette.

5. **Concatenating DataFrames:**
   - Combines all the individual DataFrames into a single DataFrame, `final_df`.
   - Displays the shape of the final DataFrame.

6. **Data Inspection:**
   - Displays the first and last rows of the DataFrame.
   - Shows a sample of 5 random rows.

7. **Data Export:**
   - Exports the final DataFrame to a CSV file located in the './data/' directory.


In [160]:
import requests
from bs4 import BeautifulSoup
import re
import json
import codecs
import pandas as pd
from pandas import json_normalize
import time  

In [154]:
# define EPL seasons 
seasons = [2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023]

In [159]:
def get_season_html(season):
    # Construct the URL based on the league (EPL) and season
    url = f"https://understat.com/league/EPL/{season}"

    # Send an HTTP GET request to the constructed URL
    response = requests.get(url)

    # Get the content of the response, which typically contains the HTML content of the web page
    html_content = response.content

    # Return the HTML content
    return html_content


In [157]:
def parse_html_content(html_content):
    # Parse HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find all script tags in the HTML
    scripts = soup.find_all('script')

    # Access the script tag at index 2 (change index if needed)
    target_script = scripts[2]

    # Convert the script content to a string
    target_string = str(target_script.contents[0])

    # Decode the string using unicode_escape
    cleaned_string = codecs.decode(target_string, 'unicode_escape')

    # Extract the relevant JSON data from the decoded string
    # (Note: The specific indices [30:-4] may need adjustment based on the data structure)
    teams_data = json.loads(cleaned_string[30:-4])

    # Return the extracted teams_data
    return teams_data


In [158]:
def normalized_dataframe(teams_data):
    # Create an empty list to store individual team DataFrames
    teams_normalized_dfs = []

    # Iterate through each team's data
    for team_id, team_data in teams_data.items():
        # Create a DataFrame from the team's data
        team_df = pd.DataFrame(team_data)

        # Normalize the 'history' column using json_normalize and concatenate it with the original DataFrame
        team_normalized_df = pd.concat([team_df.drop(['history'], axis=1), 
                                        json_normalize(team_df['history'])], axis=1)

        # Append the normalized DataFrame to the list
        teams_normalized_dfs.append(team_normalized_df)

    # Return the final DataFrame
    return teams_normalized_dfs


In [162]:
# Create an empty list to store normalized DataFrames
normalized_dfs = []

# Iterate through each season
for season in seasons:
    # Fetch HTML content for the current season
    season_html_content = get_season_html(season)

    # Parse HTML content to obtain data
    season_parsed_data = parse_html_content(season_html_content)

    # Create normalized DataFrame for the current season
    season_normalized_df = normalized_dataframe(season_parsed_data)

    # Extend the list with the normalized DataFrames for the current season
    normalized_dfs.extend(season_normalized_df)

    # Add a 5-second delay before fetching data for the next season
    time.sleep(5)

# The 'normalized_dfs' list now contains all the normalized DataFrames for each season


In [163]:
# Create a single DataFrame by concatenating all individual team DataFrames
final_df = pd.concat(normalized_dfs, ignore_index=True)

In [164]:
final_df.shape

(7246, 23)

In [165]:
final_df.head()

Unnamed: 0,id,title,h_a,xG,xGA,npxG,npxGA,deep,deep_allowed,scored,...,date,wins,draws,loses,pts,npxGD,ppda.att,ppda.def,ppda_allowed.att,ppda_allowed.def
0,71,Aston Villa,a,0.909774,0.423368,0.909774,0.423368,4,3,1,...,2014-08-16 15:00:00,1,0,0,3,0.486406,323,23,132,32
1,71,Aston Villa,h,0.507525,0.699295,0.507525,0.699295,4,7,0,...,2014-08-23 12:45:00,0,1,0,1,-0.19177,326,21,180,21
2,71,Aston Villa,h,0.639316,0.28888,0.639316,0.28888,6,7,2,...,2014-08-31 13:30:00,1,0,0,3,0.350436,366,13,278,24
3,71,Aston Villa,a,0.701676,0.728097,0.701676,0.728097,1,5,1,...,2014-09-13 17:30:00,1,0,0,3,-0.026421,486,9,91,14
4,71,Aston Villa,h,0.649013,1.36224,0.649013,1.36224,0,7,0,...,2014-09-20 15:00:00,0,0,1,0,-0.713227,531,12,170,22


In [166]:
final_df.tail()

Unnamed: 0,id,title,h_a,xG,xGA,npxG,npxGA,deep,deep_allowed,scored,...,date,wins,draws,loses,pts,npxGD,ppda.att,ppda.def,ppda_allowed.att,ppda_allowed.def
7241,256,Luton,h,0.689037,2.23353,0.689037,2.23353,4,9,1,...,2023-12-10 14:00:00,0,0,1,0,-1.544493,297,28,153,16
7242,256,Luton,h,1.81844,1.41488,1.81844,1.41488,2,16,1,...,2023-12-23 15:00:00,1,0,0,3,0.40356,311,28,117,22
7243,256,Luton,a,0.715575,3.61788,0.715575,3.61788,3,7,3,...,2023-12-26 15:00:00,1,0,0,3,-2.902305,156,21,187,25
7244,256,Luton,h,2.64093,1.57463,2.64093,1.57463,5,4,2,...,2023-12-30 12:30:00,0,0,1,0,1.0663,247,32,222,25
7245,256,Luton,a,0.965167,1.52023,0.965167,1.52023,7,5,1,...,2024-01-12 19:45:00,0,1,0,1,-0.555063,140,21,305,16


In [168]:
final_df.sample(5)

Unnamed: 0,id,title,h_a,xG,xGA,npxG,npxGA,deep,deep_allowed,scored,...,date,wins,draws,loses,pts,npxGD,ppda.att,ppda.def,ppda_allowed.att,ppda_allowed.def
628,89,Manchester United,h,1.19228,1.15706,1.19228,1.15706,4,5,0,...,2015-01-11 16:00:00,0,0,1,0,0.03522,192,20,280,14
435,84,Swansea,h,0.601468,1.41578,0.601468,1.41578,6,10,1,...,2014-12-26 15:00:00,1,0,0,3,-0.814312,376,28,278,19
6504,87,Liverpool,h,2.1515,2.36567,2.1515,2.36567,13,7,3,...,2022-10-01 14:00:00,0,1,0,1,-0.21417,259,20,260,15
120,75,Leicester,h,1.57773,0.957718,1.57773,0.957718,1,9,2,...,2014-10-04 15:00:00,0,1,0,1,0.620012,153,20,195,29
4958,86,Newcastle United,a,0.400364,2.6026,0.400364,2.6026,2,7,0,...,2021-01-23 20:00:00,0,0,1,0,-2.202236,338,21,190,24


In [169]:
final_df.to_csv('./data/scraped_match_data.csv')

## Conclusion
This notebook provides a systematic approach to web scraping EPL match data, processing it, and creating a comprehensive DataFrame for further analysis. It adheres to web scraping best practices, including introducing delays between requests to avoid overloading the target website's servers. The resulting CSV file, 'scraped_match_data.csv,' can be used for various analytical tasks related to EPL match performances.