# Business Understanding

## Title: Machine Learning Powered Soccer Match Prediction 


![Soccer Image](./images/cropped_epl_image.webp)


**Authors**: 
1. Dennis Kobia
2. Jane Mwangi
3. Rose Kyalo
4. Brytone Omare
5. Ivy Ndunge
6. Wayne Korir

## 1. Background
The betting industry in Kenya has undergone a transformative shift, evolving into a multi-billion-dollar industry. According to the Betting Control and Licensing Board of Kenya, the sector has grown exponentially, contributing significantly to the nation's economic development. The latest available data from the board indicates that the industry generated over Ksh 204 billion (approximately USD 2 billion) in revenue in the last fiscal year alone ([Betting Control and Licensing Board - Kenya](https://bclb.co.ke/)).

Within this burgeoning industry, soccer betting has emerged as the undisputed frontrunner, capturing the imagination of millions of Kenyan sports enthusiasts. A survey conducted by GeoPoll, a mobile-based market research platform, revealed that over 60% of mobile gamers in Kenya actively engage in soccer betting, making it the dominant betting category ([GeoPoll](https://www.geopoll.com/blog/the-rise-of-mobile-gaming-in-africa/)).

The allure of soccer betting lies in its dynamic and unpredictable nature, where punters navigate through a plethora of betting options. From predicting the winner of a match to anticipating the total goals scored, the process involves a meticulous analysis of team and player statistics, historical performances, and real-time match dynamics. Kenyan sports enthusiasts, fueled by their passion for the game, seek to translate their knowledge into profitable betting decisions ([Daily Nation](https://www.nation.co.ke/kenya/sports/-/1090/1090/-/h96psf/-/index.html)).

Recognizing the demand for more informed and accurate betting decisions, the intersection of technology and sports betting has witnessed a surge in predictive modeling. The Kenya Gazette, an official government publication, acknowledges the potential of predictive models in providing valuable insights to bettors, helping them make data-driven decisions to enhance their success rate ([Kenya Gazette](https://www.kenyagazette.co.ke/)).

This research project focuses on the development of a simplified soccer match outcome prediction system for the English Premier League. The envisioned product is a user-friendly system where a client inputs a match ID, and the system, powered by machine learning, provides a prediction for the match outcome. Leveraging advanced algorithms, the system aims to offer straightforward and reliable match predictions based on historical data and relevant match features.

Amid the remarkable growth in the Kenyan betting industry, there exists a research gap in the application of machine learning to provide simplified match predictions for soccer enthusiasts. This project aims to address this gap by introducing cutting-edge data science techniques into the realm of Kenyan soccer betting, offering users a practical and accessible tool to enhance their betting experience.

## 2. Problem Statement
The surge in popularity of sports betting, particularly in the vibrant betting industry of Kenya, has prompted an increasing demand for accurate and data-driven predictions, especially in the context of soccer matches. Despite the availability of vast amounts of football-related data, there is a notable gap in providing simplified and user-friendly prediction systems for Kenyan soccer enthusiasts.

### 2.1 Challenges

i. **Lack of Accessibility:**
   Existing machine learning predictive models often lack accessibility for the average user, requiring a level of expertise in data interpretation.

ii. **Limited Utilization of Advanced Techniques:**
    Additionally, current betting prediction platforms may not utilize advanced machine learning techniques to offer precise and tailored predictions.

iii. **Accuracy Limitations:**
     Current predictive models for soccer match outcomes, even with advanced techniques, have encountered challenges in achieving high accuracy. Studies, such as Md. Ashiqur Rahman's research titled "A deep learning framework for football match prediction," have reported a maximum accuracy of approximately 63% using advanced models. Addressing and surpassing this accuracy threshold poses a notable challenge in the development of soccer match prediction systems.

iv. **Complexity in Feature Selection:**
    The vast array of available features in soccer match data introduces complexity in selecting the most influential factors for accurate predictions. Properly navigating through these numerous features and identifying the key variables that significantly impact match outcomes is a substantial challenge. Effective feature selection will play a crucial role in optimizing the prediction model's performance and enhancing accuracy.




### 2.2 Opportunity
The opportunity lies in developing a simplified and user-friendly soccer match outcome prediction system for the English Premier League. This system aims to empower users with accurate match predictions, leveraging machine learning algorithms to process historical data and relevant match features.

## 3. Research Questions

### 3.1 General Research Questions
i. How can machine learning algorithms be effectively employed to predict soccer match outcomes?

ii. What are the key features and historical data points that significantly influence match predictions?

iii. How can a simplified prediction system be designed to cater to the needs of Kenyan soccer betting enthusiasts?

### 3.2 Exploratory Data Analysis (EDA) Research Questions
i. What are the distribution patterns of key performance metrics, such as team ranks, points, and goal differentials, across the English Premier League teams?

ii. Are there any correlations or trends between a team's historical performance metrics and its current rank in the league standings?

iii. How do teams' performance metrics vary when playing at home versus playing away?

iv. Can patterns or trends be identified in the outcomes of matches based on the results of the last five games' results for each team?

v. What role does goal difference play in determining the outcome of matches, and how does it correlate with the teams' overall performance?

### 3.3 Additional Machine Learning Model Assessment Questions
i. How sensitive is the prediction model to changes in input features, and which features contribute most significantly to prediction outcomes?

ii. What impact do changes in the training dataset size and composition have on the model's predictive performance?

## 4. Project Objectives

### 4.1 General Objective
1.4.1.1 To Develop a Soccer Match Outcome Prediction System: Create a robust and user-friendly machine learning-based system capable of predicting the outcome of English Premier League soccer matches, catering specifically to the preferences and needs of Kenyan enthusiast and other EPL enthusiast across the world.

### 4.2 Specific Objectives
**4.2.1** To Implement Advanced Machine Learning Algorithms for Soccer Match Prediction: Integrate sophisticated machine learning algorithms, such as ensemble methods like Random Forests or deep learning models like LSTMs, to significantly enhance the accuracy and predictive power of the soccer match outcome prediction system.

**4.2.2** To Conduct Feature Selection for Enhanced Predictive Insights: Conduct a comprehensive analysis to identify and select key features and historical data points that significantly influence soccer match outcomes, including player stats, team form, historical head-to-head performances, and external factors like weather and injuries

**4.2.3** To Iteratively Optimize the Prediction Model for Accuracy: Implement an iterative optimization process to continuously refine and improve the prediction model, incorporating feedback from user testing and internal evaluations, and adjusting parameters like weights and hyperparameters to achieve an accuracy target of at least 70% in predicting English Premier League soccer match outcomes.

**4.2.4** To Validate Predictions Against Historical Match Data: Validate the prediction system by comparing its predictions against actual outcomes using a 5-year historical dataset of English Premier League matches, establishing the reliability and effectiveness of the developed system.

**4.2.5** To deploy the optimized model and design a simple User-Friendly Interface for Soccer Match Predictions: Deploy the model using a framework like Flask and design an intuitive and user-friendly interface that allows users to select matches, view relevant statistics, and easily obtain clear predictions. The interface should cater to users with varying technical backgrounds and ensure a seamless experience.

## 4.3 Project Scope

### 4.3.1 In Scope
- For this project, we will only be predicting the following markets:
  1. Home win, draw, or away win
  2. Total number of goals scored
  3. Fouls, specifically: Number of yellow cards and red cards

### 4.3.2 Out of Scope
- In-Depth Player Analysis: Detailed analysis of individual player performance, including statistics such as goals scored by specific players, player injuries, etc.

- Live Match Prediction: Predicting match outcomes in real-time during live matches.

- Betting Odds Calculation: Calculating or providing betting odds for predicted outcomes.

- Player Transfer and Team Management: Analyzing or predicting player transfers, team management decisions, or other off-field aspects.

- Predictions for Other Football Leagues: Extending predictions to other football leagues beyond the English Premier League.


## 5. Success Criteria

### 5.1 Model Performance
#### 5.1.1 Correct Outcome Prediction (Win, Lose, Draw)
- Achieve a baseline prediction accuracy of at least 70% for determining the correct outcome (win, lose, draw) of English Premier League soccer matches.

#### 5.1.2 Consistency
- Demonstrate consistency in predictions across multiple test datasets, indicated by a low variance in prediction accuracy.

#### 5.1.3 Correct Outcome Prediction (Over and Under Markets - Total Goals)
- Achieve a prediction accuracy of at least 75% for markets related to the total number of goals scored in English Premier League soccer matches.

#### 5.1.4 Benchmark Comparison
- Outperform or match the performance of existing benchmark models or prediction systems in the context of English Premier League match predictions. 

## 6. Data Understanding

#### 6.1 Data Source
The primary data source for this project will be the Football Data API ([Footsystats API](https://footystats.org/api/)), providing comprehensive information on English Premier League teams, matches, and performance metrics.

#### 6.2 Data Points and Descriptions
| Data Point                  | Description                                                  |
|-----------------------------|--------------------------------------------------------------|
| Team Rank                   | The current rank of the team within the English Premier League standings.          |
| Team Name and Crest          | The official name and emblem of the team for identification.                       |
| Played Matches              | The total number of matches played by the team in the current season.              |
| Wins                        | The total number of matches won by the team.                                    |
| Losses                      | The total number of matches lost by the team.                                   |
| Draws                       | The total number of matches drawn by the team.                                  |
| Points                      | The total points earned by the team based on wins and draws.                      |
| Last Five Games Results     | The outcomes (win, lose, draw) of the team's last five matches.                  |
| Goal Difference             | The numerical difference between goals scored and goals conceded.                 |
| Differential                | A calculated metric representing the difference between wins and losses.         |
| Goals For                   | The total number of goals scored by the team.                                   |
| Goals Against               | The total number of goals conceded by the team.                               |
| Win Percentage              | The percentage of matches won by the team.                                    |
| Won in Group                | The number of matches won by the team when playing in a group.                  |
| Lost in Group               | The number of matches lost by the team when playing in a group.                  |
| Win Percentage in Groups    | The percentage of matches won by the team when playing in a group.             |
| Won at Home                 | The number of matches won by the team when playing at their home stadium.      |
| Won Away                    | The number of matches won by the team when playing away.                        |
| Lost at Home                | The number of matches lost by the team when playing at their home stadium.     |
| Lost Away                   | The number of matches lost by the team when playing away.                       |

#### 6.3 Data Retrieval Process
1. Utilize the Footsystat API to retrieve real-time and historical data for English Premier League teams.
2. Implement requests to obtain specific metrics outlined above for each team, considering relevant time frames for historical performance.



# Retrieving Data using an API

In [1]:
# Import the necessary libraries
import requests  # Library for making HTTP requests
import pandas as pd  # Library for data manipulation and analysis
from dotenv import dotenv_values
from bs4 import BeautifulSoup
import re
import json
import codecs
from pandas import json_normalize
import time  

# importing visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# configure notebook for inline plotting
%matplotlib inline

# config pandas to display more than 20 columns
pd.set_option('display.max_columns',250)

# set grid style 
sns.set_style('darkgrid')

In [2]:
secrets = dotenv_values('.env')
# Define the API key variable to hold the access key for the Footystats API
api_key = secrets['API_KEY']


In [3]:
# Define the URL for accessing the Football Data API's league list endpoint, including the API key
url = "https://api.football-data-api.com/league-list?key=" + api_key

# Make an HTTP GET request to the defined URL using the requests library
response = requests.get(url)


In [4]:
# Convert the response content to JSON format using the .json() method
data = response.json()


In [5]:
# Retrieve the keys of the 'data' dictionary to inspect its structure
data.keys()


dict_keys(['success', 'pager', 'metadata', 'data', 'message'])

In [6]:
# Extract the 'country' values from each dictionary in the 'data' list using list comprehension
countries = [data.get('country') for data in data['data']]

# Print the first 5 elements of the 'countries' list
print(countries[:5])


['USA', 'Scotland', 'Germany', 'Europe', 'Malaysia']


In [7]:
# Find the index of the value "England" in the 'countries' list
england_index = countries.index("England")

england_index

5

In [8]:
# Access information about seasons from the 'data' dictionary
seasons_info = data["data"][5]["season"]

# Print the 'seasons_info' variable
seasons_info


[{'id': 9, 'year': 20162017},
 {'id': 10, 'year': 20152016},
 {'id': 11, 'year': 20142015},
 {'id': 12, 'year': 20132014},
 {'id': 161, 'year': 20172018},
 {'id': 246, 'year': 20122013},
 {'id': 1625, 'year': 20182019},
 {'id': 2012, 'year': 20192020},
 {'id': 3119, 'year': 20112012},
 {'id': 3121, 'year': 20102011},
 {'id': 3125, 'year': 20092010},
 {'id': 3131, 'year': 20082009},
 {'id': 3137, 'year': 20072008},
 {'id': 4759, 'year': 20202021},
 {'id': 6135, 'year': 20212022},
 {'id': 7704, 'year': 20222023},
 {'id': 9660, 'year': 20232024}]

In [9]:
# Extract the 'id' values from each dictionary in the 'seasons_info' list using list comprehension
season_ids = [season.get("id") for season in seasons_info]

# Print the 'season_ids' list
print(season_ids)


[9, 10, 11, 12, 161, 246, 1625, 2012, 3119, 3121, 3125, 3131, 3137, 4759, 6135, 7704, 9660]


In [10]:
def get_league_matches(api_key, season_id):
    """
    Retrieve league matches data for a specific season.

    Parameters:
    - api_key (str): API key for accessing the Football Data API.
    - season_id (int): ID of the specific season.

    Returns:
    - response: HTTP response object containing the retrieved data.
    """
    # Construct the URL for accessing league matches data for the specified season
    url = f"https://api.football-data-api.com/league-matches?key={api_key}&season_id={season_id}"
    
    # Make an HTTP GET request to the defined URL using the requests library
    response = requests.get(url)
    
    # Return the response object
    return response


In [11]:
def create_dataframe(api_key, season_ids):
    """
    Create a Pandas DataFrame by fetching and concatenating league matches data for multiple seasons.

    Parameters:
    - api_key (str): API key for accessing the Football Data API.
    - season_ids (list): List of season IDs for which data will be fetched.

    Returns:
    - concatenated_df: Pandas DataFrame containing concatenated league matches data for the specified seasons.
    """
    # Initialize an empty list to store individual DataFrames for each season
    list_of_dfs = []

    # Iterate through each season ID in the provided list
    for season_id in season_ids:
        try:
            # Fetch league matches data for the current season
            response = get_league_matches(api_key, season_id)
            data = response.json()
            
            # Create a DataFrame from the fetched data
            df = pd.DataFrame(data["data"])
            
            # Append the DataFrame to the list
            list_of_dfs.append(df)
        except:
            # Handle errors and exit the function if an error occurs
            print("There was an error.")
            exit()

    # Concatenate the DataFrames in the list to create a single DataFrame
    concatenated_df = pd.concat(list_of_dfs, ignore_index=True)
    
    # Return the concatenated DataFrame
    return concatenated_df


In [12]:
# Create a DataFrame containing league matches data for multiple seasons using the create_dataframe function
matches_df = create_dataframe(api_key, season_ids)

# Display the first few rows of the DataFrame to inspect the data
matches_df.head()


Unnamed: 0,id,homeID,awayID,season,status,roundID,game_week,revised_game_week,homeGoals,awayGoals,homeGoalCount,awayGoalCount,totalGoalCount,team_a_corners,team_b_corners,totalCornerCount,team_a_offsides,team_b_offsides,team_a_yellow_cards,team_b_yellow_cards,team_a_red_cards,team_b_red_cards,team_a_shotsOnTarget,team_b_shotsOnTarget,team_a_shotsOffTarget,team_b_shotsOffTarget,team_a_shots,team_b_shots,team_a_fouls,team_b_fouls,team_a_possession,team_b_possession,refereeID,coach_a_ID,coach_b_ID,stadium_name,stadium_location,team_a_cards_num,team_b_cards_num,odds_ft_1,odds_ft_x,odds_ft_2,odds_ft_over05,odds_ft_over15,odds_ft_over25,odds_ft_over35,odds_ft_over45,odds_ft_under05,odds_ft_under15,odds_ft_under25,odds_ft_under35,odds_ft_under45,odds_btts_yes,odds_btts_no,odds_team_a_cs_yes,odds_team_a_cs_no,odds_team_b_cs_yes,odds_team_b_cs_no,odds_doublechance_1x,odds_doublechance_12,odds_doublechance_x2,odds_1st_half_result_1,odds_1st_half_result_x,odds_1st_half_result_2,odds_2nd_half_result_1,odds_2nd_half_result_x,odds_2nd_half_result_2,odds_dnb_1,odds_dnb_2,odds_corners_over_75,odds_corners_over_85,odds_corners_over_95,odds_corners_over_105,odds_corners_over_115,odds_corners_under_75,odds_corners_under_85,odds_corners_under_95,odds_corners_under_105,odds_corners_under_115,odds_corners_1,odds_corners_x,odds_corners_2,odds_team_to_score_first_1,odds_team_to_score_first_x,odds_team_to_score_first_2,odds_win_to_nil_1,odds_win_to_nil_2,odds_1st_half_over05,odds_1st_half_over15,odds_1st_half_over25,odds_1st_half_over35,odds_1st_half_under05,odds_1st_half_under15,odds_1st_half_under25,odds_1st_half_under35,odds_2nd_half_over05,odds_2nd_half_over15,odds_2nd_half_over25,odds_2nd_half_over35,odds_2nd_half_under05,odds_2nd_half_under15,odds_2nd_half_under25,odds_2nd_half_under35,odds_btts_1st_half_yes,odds_btts_1st_half_no,odds_btts_2nd_half_yes,odds_btts_2nd_half_no,overallGoalCount,ht_goals_team_a,ht_goals_team_b,goals_2hg_team_a,goals_2hg_team_b,GoalCount_2hg,HTGoalCount,date_unix,winningTeam,no_home_away,btts_potential,btts_fhg_potential,btts_2hg_potential,goalTimingDisabled,attendance,corner_timings_recorded,card_timings_recorded,team_a_fh_corners,team_b_fh_corners,team_a_2h_corners,team_b_2h_corners,corner_fh_count,corner_2h_count,team_a_fh_cards,team_b_fh_cards,team_a_2h_cards,team_b_2h_cards,total_fh_cards,total_2h_cards,attacks_recorded,team_a_dangerous_attacks,team_b_dangerous_attacks,team_a_attacks,team_b_attacks,team_a_xg,team_b_xg,total_xg,team_a_penalties_won,team_b_penalties_won,team_a_penalty_goals,team_b_penalty_goals,team_a_penalty_missed,team_b_penalty_missed,pens_recorded,goal_timings_recorded,team_a_0_10_min_goals,team_b_0_10_min_goals,team_a_corners_0_10_min,team_b_corners_0_10_min,team_a_cards_0_10_min,team_b_cards_0_10_min,throwins_recorded,team_a_throwins,team_b_throwins,freekicks_recorded,team_a_freekicks,team_b_freekicks,goalkicks_recorded,team_a_goalkicks,team_b_goalkicks,o45_potential,o35_potential,o25_potential,o15_potential,o05_potential,o15HT_potential,o05HT_potential,o05_2H_potential,o15_2H_potential,corners_potential,offsides_potential,cards_potential,avg_potential,home_url,home_image,home_name,away_url,away_image,away_name,home_ppg,away_ppg,pre_match_home_ppg,pre_match_away_ppg,pre_match_teamA_overall_ppg,pre_match_teamB_overall_ppg,u45_potential,u35_potential,u25_potential,u15_potential,u05_potential,corners_o85_potential,corners_o95_potential,corners_o105_potential,team_a_xg_prematch,team_b_xg_prematch,total_xg_prematch,match_url,competition_id,matches_completed_minimum,over05,over15,over25,over35,over45,over55,btts,homeGoals_timings,awayGoals_timings
0,2155,150,108,2016/2017,complete,19,1,-1,"[45+1, 57]",[47],2,1,3,5,3,8,1,0,2,2,0,0,4,6,7,9,11,15,7,17,50,50,685,196.0,160.0,KCOM Stadium (Hull),,2,2,3.41,3.19,2.39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,3,1,0,1,1,2,1,1471087800,150,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,2,2,0,2,2,-1,0,0,0,0,0.0,0.0,0.0,0,1,0,1,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/hull-city-afc-150,teams/england-hull-city-afc.png,Hull City,/clubs/leicester-city-fc-108,teams/england-leicester-city-fc.png,Leicester City,1.47,0.53,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/leicester-city-fc-vs-hull-city-afc-h2...,9,38,True,True,True,False,False,False,True,"[45+1, 57]",[47]
1,2156,145,154,2016/2017,complete,19,1,-1,[],[82],0,1,1,7,4,11,2,2,3,2,0,0,6,8,6,7,12,15,10,14,55,45,688,197.0,198.0,Turf Moor (Burnley),,3,2,2.45,3.22,3.26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,1471096800,154,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,1,0,2,2,1,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/burnley-fc-145,teams/england-burnley-fc.png,Burnley,/clubs/swansea-city-afc-154,teams/wales-swansea-city-afc.png,Swansea City,1.74,0.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/burnley-fc-vs-swansea-city-afc-h2h-st...,9,38,True,False,False,False,False,False,False,[],[82]
2,2157,143,142,2016/2017,complete,19,1,-1,[],[74],0,1,1,3,6,9,0,2,2,2,0,0,7,6,5,6,12,12,12,14,54,46,360,199.0,200.0,Selhurst Park (London),,2,2,2.2,3.25,3.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,1471096800,142,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,1,2,1,1,3,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/crystal-palace-fc-143,teams/england-crystal-palace-fc.png,Crystal Palace,/clubs/west-bromwich-albion-fc-142,teams/england-west-bromwich-albion-fc.png,West Bromwich Albion,1.05,0.84,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/west-bromwich-albion-fc-vs-crystal-pa...,9,38,True,False,False,False,False,False,False,[],[74]
3,2158,144,92,2016/2017,complete,19,1,-1,[5],[59],1,1,2,5,6,11,4,0,0,0,0,0,7,4,3,2,10,6,10,14,41,59,537,201.0,156.0,Goodison Park (Liverpool),,0,0,3.13,3.36,2.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,1471096800,-1,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/everton-fc-144,teams/england-everton-fc.png,Everton,/clubs/tottenham-hotspur-fc-92,teams/england-tottenham-hotspur-fc.png,Tottenham Hotspur,2.26,1.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/tottenham-hotspur-fc-vs-everton-fc-h2...,9,38,True,True,False,False,False,False,True,[5],[59]
4,2159,147,141,2016/2017,complete,19,1,-1,[11],[67],1,1,2,9,6,15,1,3,3,5,0,0,3,2,5,8,8,10,18,13,46,54,693,202.0,203.0,Riverside Stadium (Middlesbrough),,3,5,2.49,3.2,3.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,1471096800,-1,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,2,2,1,3,4,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/middlesbrough-fc-147,teams/england-middlesbrough-fc.png,Middlesbrough,/clubs/stoke-city-fc-141,teams/england-stoke-city-fc.png,Stoke City,0.95,0.89,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/stoke-city-fc-vs-middlesbrough-fc-h2h...,9,38,True,True,False,False,False,False,True,[11],[67]


In [13]:
# Inspect the resulting dataframe
matches_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6460 entries, 0 to 6459
Columns: 215 entries, id to awayGoals_timings
dtypes: bool(7), float64(80), int64(112), object(16)
memory usage: 10.3+ MB


In [14]:
# Inspect the shape of  dataframe
matches_df.shape

(6460, 215)

# Scraping Data

In [15]:
# define EPL seasons 
seasons = [2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023]

In [16]:
def get_season_html(season):
    # Construct the URL based on the league (EPL) and season
    url = f"https://understat.com/league/EPL/{season}"

    # Send an HTTP GET request to the constructed URL
    response = requests.get(url)

    # Get the content of the response, which typically contains the HTML content of the web page
    html_content = response.content

    # Return the HTML content
    return html_content


In [17]:
def parse_html_content(html_content):
    # Parse HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find all script tags in the HTML
    scripts = soup.find_all('script')

    # Access the script tag at index 2 (change index if needed)
    target_script = scripts[2]

    # Convert the script content to a string
    target_string = str(target_script.contents[0])

    # Decode the string using unicode_escape
    cleaned_string = codecs.decode(target_string, 'unicode_escape')

    # Extract the relevant JSON data from the decoded string
    # (Note: The specific indices [30:-4] may need adjustment based on the data structure)
    teams_data = json.loads(cleaned_string[30:-4])

    # Return the extracted teams_data
    return teams_data


In [18]:
def normalized_dataframe(teams_data):
    # Create an empty list to store individual team DataFrames
    teams_normalized_dfs = []

    # Iterate through each team's data
    for team_id, team_data in teams_data.items():
        # Create a DataFrame from the team's data
        team_df = pd.DataFrame(team_data)

        # Normalize the 'history' column using json_normalize and concatenate it with the original DataFrame
        team_normalized_df = pd.concat([team_df.drop(['history'], axis=1), 
                                        json_normalize(team_df['history'])], axis=1)

        # Append the normalized DataFrame to the list
        teams_normalized_dfs.append(team_normalized_df)

    # Return the final DataFrame
    return teams_normalized_dfs


In [19]:
# Create an empty list to store normalized DataFrames
normalized_dfs = []

# Iterate through each season
for season in seasons:
    # Fetch HTML content for the current season
    season_html_content = get_season_html(season)

    # Parse HTML content to obtain data
    season_parsed_data = parse_html_content(season_html_content)

    # Create normalized DataFrame for the current season
    season_normalized_df = normalized_dataframe(season_parsed_data)

    # Extend the list with the normalized DataFrames for the current season
    normalized_dfs.extend(season_normalized_df)

    # Add a 5-second delay before fetching data for the next season
    time.sleep(5)

# The 'normalized_dfs' list now contains all the normalized DataFrames for each season


In [20]:
# Create a single DataFrame by concatenating all individual team DataFrames
final_df = pd.concat(normalized_dfs, ignore_index=True)

In [21]:
final_df.shape

(7256, 23)

In [22]:
final_df.head()

Unnamed: 0,id,title,h_a,xG,xGA,npxG,npxGA,deep,deep_allowed,scored,missed,xpts,result,date,wins,draws,loses,pts,npxGD,ppda.att,ppda.def,ppda_allowed.att,ppda_allowed.def
0,71,Aston Villa,a,0.909774,0.423368,0.909774,0.423368,4,3,1,0,1.8322,w,2014-08-16 15:00:00,1,0,0,3,0.486406,323,23,132,32
1,71,Aston Villa,h,0.507525,0.699295,0.507525,0.699295,4,7,0,0,1.1057,d,2014-08-23 12:45:00,0,1,0,1,-0.19177,326,21,180,21
2,71,Aston Villa,h,0.639316,0.28888,0.639316,0.28888,6,7,2,1,1.6075,w,2014-08-31 13:30:00,1,0,0,3,0.350436,366,13,278,24
3,71,Aston Villa,a,0.701676,0.728097,0.701676,0.728097,1,5,1,0,1.3252,w,2014-09-13 17:30:00,1,0,0,3,-0.026421,486,9,91,14
4,71,Aston Villa,h,0.649013,1.36224,0.649013,1.36224,0,7,0,3,0.6912,l,2014-09-20 15:00:00,0,0,1,0,-0.713227,531,12,170,22


In [23]:
final_df.to_csv('./data/scraped_match_data.csv', index=False)

# Data Aggregation

In [24]:
# read data retrieved from API
api_data = pd.read_csv('./data/Matches.csv', header=0)

In [25]:
# inspect api data
api_data.head()

Unnamed: 0,id,homeID,awayID,season,status,roundID,game_week,revised_game_week,homeGoals,awayGoals,homeGoalCount,awayGoalCount,totalGoalCount,team_a_corners,team_b_corners,totalCornerCount,team_a_offsides,team_b_offsides,team_a_yellow_cards,team_b_yellow_cards,team_a_red_cards,team_b_red_cards,team_a_shotsOnTarget,team_b_shotsOnTarget,team_a_shotsOffTarget,team_b_shotsOffTarget,team_a_shots,team_b_shots,team_a_fouls,team_b_fouls,team_a_possession,team_b_possession,refereeID,coach_a_ID,coach_b_ID,stadium_name,stadium_location,team_a_cards_num,team_b_cards_num,odds_ft_1,odds_ft_x,odds_ft_2,odds_ft_over05,odds_ft_over15,odds_ft_over25,odds_ft_over35,odds_ft_over45,odds_ft_under05,odds_ft_under15,odds_ft_under25,odds_ft_under35,odds_ft_under45,odds_btts_yes,odds_btts_no,odds_team_a_cs_yes,odds_team_a_cs_no,odds_team_b_cs_yes,odds_team_b_cs_no,odds_doublechance_1x,odds_doublechance_12,odds_doublechance_x2,odds_1st_half_result_1,odds_1st_half_result_x,odds_1st_half_result_2,odds_2nd_half_result_1,odds_2nd_half_result_x,odds_2nd_half_result_2,odds_dnb_1,odds_dnb_2,odds_corners_over_75,odds_corners_over_85,odds_corners_over_95,odds_corners_over_105,odds_corners_over_115,odds_corners_under_75,odds_corners_under_85,odds_corners_under_95,odds_corners_under_105,odds_corners_under_115,odds_corners_1,odds_corners_x,odds_corners_2,odds_team_to_score_first_1,odds_team_to_score_first_x,odds_team_to_score_first_2,odds_win_to_nil_1,odds_win_to_nil_2,odds_1st_half_over05,odds_1st_half_over15,odds_1st_half_over25,odds_1st_half_over35,odds_1st_half_under05,odds_1st_half_under15,odds_1st_half_under25,odds_1st_half_under35,odds_2nd_half_over05,odds_2nd_half_over15,odds_2nd_half_over25,odds_2nd_half_over35,odds_2nd_half_under05,odds_2nd_half_under15,odds_2nd_half_under25,odds_2nd_half_under35,odds_btts_1st_half_yes,odds_btts_1st_half_no,odds_btts_2nd_half_yes,odds_btts_2nd_half_no,overallGoalCount,ht_goals_team_a,ht_goals_team_b,goals_2hg_team_a,goals_2hg_team_b,GoalCount_2hg,HTGoalCount,date_unix,winningTeam,no_home_away,btts_potential,btts_fhg_potential,btts_2hg_potential,goalTimingDisabled,attendance,corner_timings_recorded,card_timings_recorded,team_a_fh_corners,team_b_fh_corners,team_a_2h_corners,team_b_2h_corners,corner_fh_count,corner_2h_count,team_a_fh_cards,team_b_fh_cards,team_a_2h_cards,team_b_2h_cards,total_fh_cards,total_2h_cards,attacks_recorded,team_a_dangerous_attacks,team_b_dangerous_attacks,team_a_attacks,team_b_attacks,team_a_xg,team_b_xg,total_xg,team_a_penalties_won,team_b_penalties_won,team_a_penalty_goals,team_b_penalty_goals,team_a_penalty_missed,team_b_penalty_missed,pens_recorded,goal_timings_recorded,team_a_0_10_min_goals,team_b_0_10_min_goals,team_a_corners_0_10_min,team_b_corners_0_10_min,team_a_cards_0_10_min,team_b_cards_0_10_min,throwins_recorded,team_a_throwins,team_b_throwins,freekicks_recorded,team_a_freekicks,team_b_freekicks,goalkicks_recorded,team_a_goalkicks,team_b_goalkicks,o45_potential,o35_potential,o25_potential,o15_potential,o05_potential,o15HT_potential,o05HT_potential,o05_2H_potential,o15_2H_potential,corners_potential,offsides_potential,cards_potential,avg_potential,home_url,home_image,home_name,away_url,away_image,away_name,home_ppg,away_ppg,pre_match_home_ppg,pre_match_away_ppg,pre_match_teamA_overall_ppg,pre_match_teamB_overall_ppg,u45_potential,u35_potential,u25_potential,u15_potential,u05_potential,corners_o85_potential,corners_o95_potential,corners_o105_potential,team_a_xg_prematch,team_b_xg_prematch,total_xg_prematch,match_url,competition_id,matches_completed_minimum,over05,over15,over25,over35,over45,over55,btts,homeGoals_timings,awayGoals_timings
0,2155,150,108,2016/2017,complete,19,1,-1,"['45+1', '57']",['47'],2,1,3,5,3,8,1,0,2,2,0,0,4,6,7,9,11,15,7,17,50,50,685.0,196.0,160.0,KCOM Stadium (Hull),,2,2,3.41,3.19,2.39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,3,1,0,1,1,2,1,1471087800,150,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,2,2,0,2,2,-1,0,0,0,0,0.0,0.0,0.0,0,1,0,1,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/hull-city-afc-150,teams/england-hull-city-afc.png,Hull City,/clubs/leicester-city-fc-108,teams/england-leicester-city-fc.png,Leicester City,1.47,0.53,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/leicester-city-fc-vs-hull-city-afc-h2...,9,38,True,True,True,False,False,False,True,"['45+1', '57']",['47']
1,2156,145,154,2016/2017,complete,19,1,-1,[],['82'],0,1,1,7,4,11,2,2,3,2,0,0,6,8,6,7,12,15,10,14,55,45,688.0,197.0,198.0,Turf Moor (Burnley),,3,2,2.45,3.22,3.26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,1471096800,154,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,1,0,2,2,1,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/burnley-fc-145,teams/england-burnley-fc.png,Burnley,/clubs/swansea-city-afc-154,teams/wales-swansea-city-afc.png,Swansea City,1.74,0.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/burnley-fc-vs-swansea-city-afc-h2h-st...,9,38,True,False,False,False,False,False,False,[],['82']
2,2157,143,142,2016/2017,complete,19,1,-1,[],['74'],0,1,1,3,6,9,0,2,2,2,0,0,7,6,5,6,12,12,12,14,54,46,360.0,199.0,200.0,Selhurst Park (London),,2,2,2.2,3.25,3.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,1471096800,142,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,1,2,1,1,3,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/crystal-palace-fc-143,teams/england-crystal-palace-fc.png,Crystal Palace,/clubs/west-bromwich-albion-fc-142,teams/england-west-bromwich-albion-fc.png,West Bromwich Albion,1.05,0.84,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/west-bromwich-albion-fc-vs-crystal-pa...,9,38,True,False,False,False,False,False,False,[],['74']
3,2158,144,92,2016/2017,complete,19,1,-1,['5'],['59'],1,1,2,5,6,11,4,0,0,0,0,0,7,4,3,2,10,6,10,14,41,59,537.0,201.0,156.0,Goodison Park (Liverpool),,0,0,3.13,3.36,2.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,1471096800,-1,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/everton-fc-144,teams/england-everton-fc.png,Everton,/clubs/tottenham-hotspur-fc-92,teams/england-tottenham-hotspur-fc.png,Tottenham Hotspur,2.26,1.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/tottenham-hotspur-fc-vs-everton-fc-h2...,9,38,True,True,False,False,False,False,True,['5'],['59']
4,2159,147,141,2016/2017,complete,19,1,-1,['11'],['67'],1,1,2,9,6,15,1,3,3,5,0,0,3,2,5,8,8,10,18,13,46,54,693.0,202.0,203.0,Riverside Stadium (Middlesbrough),,3,5,2.49,3.2,3.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,1471096800,-1,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,2,2,1,3,4,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/middlesbrough-fc-147,teams/england-middlesbrough-fc.png,Middlesbrough,/clubs/stoke-city-fc-141,teams/england-stoke-city-fc.png,Stoke City,0.95,0.89,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/stoke-city-fc-vs-middlesbrough-fc-h2h...,9,38,True,True,False,False,False,False,True,['11'],['67']


In [26]:
# read scraped data 
scraped_data = pd.read_csv('./data/scraped_match_data.csv', header=0)

In [27]:
# inspect scraped data
scraped_data.head()

Unnamed: 0,id,title,h_a,xG,xGA,npxG,npxGA,deep,deep_allowed,scored,missed,xpts,result,date,wins,draws,loses,pts,npxGD,ppda.att,ppda.def,ppda_allowed.att,ppda_allowed.def
0,71,Aston Villa,a,0.909774,0.423368,0.909774,0.423368,4,3,1,0,1.8322,w,2014-08-16 15:00:00,1,0,0,3,0.486406,323,23,132,32
1,71,Aston Villa,h,0.507525,0.699295,0.507525,0.699295,4,7,0,0,1.1057,d,2014-08-23 12:45:00,0,1,0,1,-0.19177,326,21,180,21
2,71,Aston Villa,h,0.639316,0.28888,0.639316,0.28888,6,7,2,1,1.6075,w,2014-08-31 13:30:00,1,0,0,3,0.350436,366,13,278,24
3,71,Aston Villa,a,0.701676,0.728097,0.701676,0.728097,1,5,1,0,1.3252,w,2014-09-13 17:30:00,1,0,0,3,-0.026421,486,9,91,14
4,71,Aston Villa,h,0.649013,1.36224,0.649013,1.36224,0,7,0,3,0.6912,l,2014-09-20 15:00:00,0,0,1,0,-0.713227,531,12,170,22


Before proceeding further, it's essential to drop all entries from before the 2014 season from api_data. This step is necessary due to the unavailability of scraped data for those seasons, ensuring a uniform dataset for aggregation.

In [28]:
api_data['season'].unique()

array(['2016/2017', '2015/2016', '2014/2015', '2013/2014', '2017/2018',
       '2012/2013', '2018/2019', '2019/2020', '2011/2012', '2010/2011',
       '2009/2010', '2008/2009', '2007/2008', '2020/2021', '2021/2022',
       '2022/2023', '2023/2024'], dtype=object)

In [29]:
# define list of seasons to drop
seasons_to_drop = ['2013/2014', '2012/2013', '2011/2012', '2010/2011', 
                   '2009/2010', '2008/2009', '2007/2008']

In [30]:
# Filter out rows where the "season" column is in the list
api_data = api_data[~api_data['season'].isin(seasons_to_drop)]

In [31]:
# confirm only relevant season data remains
api_data['season'].unique()

array(['2016/2017', '2015/2016', '2014/2015', '2017/2018', '2018/2019',
       '2019/2020', '2020/2021', '2021/2022', '2022/2023', '2023/2024'],
      dtype=object)

## Harmonizing naming conventions

In [32]:
# Group the data by 'homeID' and get unique team names for each group
grouped_data = api_data.groupby('homeID')['home_name'].unique()

# Create a dictionary using zip
# - Explode the nested lists in 'home_name' to individual elements
# - Use zip to pair up each team name with its corresponding 'homeID'
# - Convert the pairs into a dictionary
team_id_mapping = dict(zip(grouped_data.explode(), grouped_data.index))

In [33]:
# view team mappings
print(team_id_mapping)

{'Arsenal': 59, 'Tottenham Hotspur': 92, 'Manchester City': 93, 'Leicester City': 108, 'Stoke City': 141, 'West Bromwich Albion': 142, 'Crystal Palace': 143, 'Everton': 144, 'Burnley': 145, 'Southampton': 146, 'Middlesbrough': 147, 'AFC Bournemouth': 148, 'Manchester United': 149, 'Hull City': 150, 'Liverpool': 151, 'Chelsea': 152, 'West Ham United': 153, 'Swansea City': 154, 'Watford': 155, 'Sunderland': 156, 'Newcastle United': 157, 'Aston Villa': 158, 'Norwich City': 159, 'Queens Park Rangers': 160, 'Cardiff City': 161, 'Fulham': 162, 'Brighton & Hove Albion': 209, 'Nottingham Forest': 211, 'Huddersfield Town': 217, 'Brentford': 218, 'Leeds United': 222, 'Wolverhampton Wanderers': 223, 'Sheffield United': 251, 'Luton Town': 271}


In [34]:
team_names_api_data = list(team_id_mapping.keys())

print(team_names_api_data)

['Arsenal', 'Tottenham Hotspur', 'Manchester City', 'Leicester City', 'Stoke City', 'West Bromwich Albion', 'Crystal Palace', 'Everton', 'Burnley', 'Southampton', 'Middlesbrough', 'AFC Bournemouth', 'Manchester United', 'Hull City', 'Liverpool', 'Chelsea', 'West Ham United', 'Swansea City', 'Watford', 'Sunderland', 'Newcastle United', 'Aston Villa', 'Norwich City', 'Queens Park Rangers', 'Cardiff City', 'Fulham', 'Brighton & Hove Albion', 'Nottingham Forest', 'Huddersfield Town', 'Brentford', 'Leeds United', 'Wolverhampton Wanderers', 'Sheffield United', 'Luton Town']


In [35]:
len(team_names_api_data)

34

In [36]:
team_names_scraped_data = scraped_data['title'].unique()

In [37]:
print(team_names_scraped_data)

['Aston Villa' 'Everton' 'Southampton' 'Leicester' 'West Bromwich Albion'
 'Sunderland' 'Crystal Palace' 'Chelsea' 'West Ham' 'Tottenham' 'Arsenal'
 'Swansea' 'Stoke' 'Newcastle United' 'Liverpool' 'Manchester City'
 'Manchester United' 'Hull' 'Burnley' 'Queens Park Rangers' 'Bournemouth'
 'Norwich' 'Watford' 'Middlesbrough' 'Huddersfield' 'Brighton' 'Cardiff'
 'Fulham' 'Wolverhampton Wanderers' 'Sheffield United' 'Leeds' 'Brentford'
 'Nottingham Forest' 'Luton']


In [38]:
len(team_names_scraped_data)

34

In [39]:
# Get the teams that are in team_names_scraped_data but not in team_names_api_data
teams_not_in_api_data = [team for team in team_names_scraped_data if team not in team_names_api_data]

In [40]:
print(teams_not_in_api_data)

['Leicester', 'West Ham', 'Tottenham', 'Swansea', 'Stoke', 'Hull', 'Bournemouth', 'Norwich', 'Huddersfield', 'Brighton', 'Cardiff', 'Leeds', 'Luton']


In [41]:
# Define a dictionary that maps original team names to their corresponding new names in the scraped_data 
names = {'Leicester': 'Leicester City', 'West Ham': 'West Ham United', 
         'Tottenham': 'Tottenham Hotspur', 'Swansea':'Swansea City', 
         'Stoke': 'Stoke City', 'Hull': 'Hull City', 'Bournemouth': 'AFC Bournemouth',
         'Norwich':'Norwich City', 'Huddersfield':'Huddersfield Town', 'Brighton':'Brighton & Hove Albion', 
         'Cardiff':'Cardiff City', 'Leeds': 'Leeds United', 'Luton':'Luton Town'}

In [42]:
# Replace names in the 'title' column
scraped_data['title'].replace(names, inplace=True)

In [43]:
# redefine team_names variable for scraped data
team_names_scraped_data = scraped_data['title'].unique()

In [44]:
# confirm renaming worked as planned
teams_not_in_api_data = [team for team in team_names_scraped_data if team not in team_names_api_data]

# there should be no naming mismatches
print(teams_not_in_api_data)

[]


## harmonize team_id columns
- Here we will adopt the team_ids from the api_data in the scraped_data dataframe

In [45]:
# Map team names to their corresponding teamID using the team_id_mapping dictionary
scraped_data['teamID'] = scraped_data['title'].map(team_id_mapping)

In [46]:
scraped_data.drop(columns=['id'], inplace=True)

In [47]:
scraped_data.head()

Unnamed: 0,title,h_a,xG,xGA,npxG,npxGA,deep,deep_allowed,scored,missed,xpts,result,date,wins,draws,loses,pts,npxGD,ppda.att,ppda.def,ppda_allowed.att,ppda_allowed.def,teamID
0,Aston Villa,a,0.909774,0.423368,0.909774,0.423368,4,3,1,0,1.8322,w,2014-08-16 15:00:00,1,0,0,3,0.486406,323,23,132,32,158
1,Aston Villa,h,0.507525,0.699295,0.507525,0.699295,4,7,0,0,1.1057,d,2014-08-23 12:45:00,0,1,0,1,-0.19177,326,21,180,21,158
2,Aston Villa,h,0.639316,0.28888,0.639316,0.28888,6,7,2,1,1.6075,w,2014-08-31 13:30:00,1,0,0,3,0.350436,366,13,278,24,158
3,Aston Villa,a,0.701676,0.728097,0.701676,0.728097,1,5,1,0,1.3252,w,2014-09-13 17:30:00,1,0,0,3,-0.026421,486,9,91,14,158
4,Aston Villa,h,0.649013,1.36224,0.649013,1.36224,0,7,0,3,0.6912,l,2014-09-20 15:00:00,0,0,1,0,-0.713227,531,12,170,22,158


In [48]:
# Reorder columns with 'teamID' as the first column
scraped_data = scraped_data[['teamID', 'title', 'h_a', 'xG', 'xGA', 'npxG', 'npxGA', 'deep', 'deep_allowed',
                             'scored', 'missed', 'xpts', 'result', 'date', 'wins', 'draws', 'loses',
                             'pts', 'npxGD', 'ppda.att', 'ppda.def', 'ppda_allowed.att',
                             'ppda_allowed.def']]

In [49]:
scraped_data.head()

Unnamed: 0,teamID,title,h_a,xG,xGA,npxG,npxGA,deep,deep_allowed,scored,missed,xpts,result,date,wins,draws,loses,pts,npxGD,ppda.att,ppda.def,ppda_allowed.att,ppda_allowed.def
0,158,Aston Villa,a,0.909774,0.423368,0.909774,0.423368,4,3,1,0,1.8322,w,2014-08-16 15:00:00,1,0,0,3,0.486406,323,23,132,32
1,158,Aston Villa,h,0.507525,0.699295,0.507525,0.699295,4,7,0,0,1.1057,d,2014-08-23 12:45:00,0,1,0,1,-0.19177,326,21,180,21
2,158,Aston Villa,h,0.639316,0.28888,0.639316,0.28888,6,7,2,1,1.6075,w,2014-08-31 13:30:00,1,0,0,3,0.350436,366,13,278,24
3,158,Aston Villa,a,0.701676,0.728097,0.701676,0.728097,1,5,1,0,1.3252,w,2014-09-13 17:30:00,1,0,0,3,-0.026421,486,9,91,14
4,158,Aston Villa,h,0.649013,1.36224,0.649013,1.36224,0,7,0,3,0.6912,l,2014-09-20 15:00:00,0,0,1,0,-0.713227,531,12,170,22


## Change date columns to date object

In [50]:
scraped_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7256 entries, 0 to 7255
Data columns (total 23 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   teamID            7256 non-null   int64  
 1   title             7256 non-null   object 
 2   h_a               7256 non-null   object 
 3   xG                7256 non-null   float64
 4   xGA               7256 non-null   float64
 5   npxG              7256 non-null   float64
 6   npxGA             7256 non-null   float64
 7   deep              7256 non-null   int64  
 8   deep_allowed      7256 non-null   int64  
 9   scored            7256 non-null   int64  
 10  missed            7256 non-null   int64  
 11  xpts              7256 non-null   float64
 12  result            7256 non-null   object 
 13  date              7256 non-null   object 
 14  wins              7256 non-null   int64  
 15  draws             7256 non-null   int64  
 16  loses             7256 non-null   int64  


In [51]:
# Convert 'date' column to datetime format
scraped_data['date'] = pd.to_datetime(scraped_data['date'])

In [52]:
# Extract the date part from the 'date' column
scraped_data['date'] = scraped_data['date'].dt.date

In [53]:
scraped_data[['date']].head()

Unnamed: 0,date
0,2014-08-16
1,2014-08-23
2,2014-08-31
3,2014-09-13
4,2014-09-20


In [54]:
# Convert the 'date_unix' column to datetime
api_data['date_unix'] = pd.to_datetime(api_data['date_unix'], unit='s')

In [55]:
api_data['date'] = api_data['date_unix'].dt.date

In [56]:
api_data[['date']].head()

Unnamed: 0,date
0,2016-08-13
1,2016-08-13
2,2016-08-13
3,2016-08-13
4,2016-08-13


## Data Aggregation

In [57]:
# Define a list of features from the 'scraped_data' dataframe to include in the merged dataframe
scraped_data_feats = ['xG', 'xGA', 'npxG', 'npxGA', 'deep', 'deep_allowed', 'scored', 'missed', 
                      'xpts', 'npxGD', 'ppda.att', 'ppda.def', 'ppda_allowed.att', 'ppda_allowed.def']

In [58]:
# Sort the DataFrame by the 'date' column in ascending order
sorted_scraped_data = scraped_data.sort_values(by='date', ascending=False)

In [59]:
# Create empty columns for home and away features
# Alternative 2: Using assign with empty Series
api_data = api_data.assign(**{f"{feat}_home": None for feat in scraped_data_feats})
api_data = api_data.assign(**{f"{feat}_away": None for feat in scraped_data_feats})

In [60]:
api_data.head()

Unnamed: 0,id,homeID,awayID,season,status,roundID,game_week,revised_game_week,homeGoals,awayGoals,homeGoalCount,awayGoalCount,totalGoalCount,team_a_corners,team_b_corners,totalCornerCount,team_a_offsides,team_b_offsides,team_a_yellow_cards,team_b_yellow_cards,team_a_red_cards,team_b_red_cards,team_a_shotsOnTarget,team_b_shotsOnTarget,team_a_shotsOffTarget,team_b_shotsOffTarget,team_a_shots,team_b_shots,team_a_fouls,team_b_fouls,team_a_possession,team_b_possession,refereeID,coach_a_ID,coach_b_ID,stadium_name,stadium_location,team_a_cards_num,team_b_cards_num,odds_ft_1,odds_ft_x,odds_ft_2,odds_ft_over05,odds_ft_over15,odds_ft_over25,odds_ft_over35,odds_ft_over45,odds_ft_under05,odds_ft_under15,odds_ft_under25,odds_ft_under35,odds_ft_under45,odds_btts_yes,odds_btts_no,odds_team_a_cs_yes,odds_team_a_cs_no,odds_team_b_cs_yes,odds_team_b_cs_no,odds_doublechance_1x,odds_doublechance_12,odds_doublechance_x2,odds_1st_half_result_1,odds_1st_half_result_x,odds_1st_half_result_2,odds_2nd_half_result_1,odds_2nd_half_result_x,odds_2nd_half_result_2,odds_dnb_1,odds_dnb_2,odds_corners_over_75,odds_corners_over_85,odds_corners_over_95,odds_corners_over_105,odds_corners_over_115,odds_corners_under_75,odds_corners_under_85,odds_corners_under_95,odds_corners_under_105,odds_corners_under_115,odds_corners_1,odds_corners_x,odds_corners_2,odds_team_to_score_first_1,odds_team_to_score_first_x,odds_team_to_score_first_2,odds_win_to_nil_1,odds_win_to_nil_2,odds_1st_half_over05,odds_1st_half_over15,odds_1st_half_over25,odds_1st_half_over35,odds_1st_half_under05,odds_1st_half_under15,odds_1st_half_under25,odds_1st_half_under35,odds_2nd_half_over05,odds_2nd_half_over15,odds_2nd_half_over25,odds_2nd_half_over35,odds_2nd_half_under05,odds_2nd_half_under15,odds_2nd_half_under25,odds_2nd_half_under35,odds_btts_1st_half_yes,odds_btts_1st_half_no,odds_btts_2nd_half_yes,odds_btts_2nd_half_no,overallGoalCount,ht_goals_team_a,ht_goals_team_b,goals_2hg_team_a,goals_2hg_team_b,GoalCount_2hg,HTGoalCount,date_unix,winningTeam,no_home_away,btts_potential,btts_fhg_potential,btts_2hg_potential,goalTimingDisabled,attendance,corner_timings_recorded,card_timings_recorded,team_a_fh_corners,team_b_fh_corners,team_a_2h_corners,team_b_2h_corners,corner_fh_count,corner_2h_count,team_a_fh_cards,team_b_fh_cards,team_a_2h_cards,team_b_2h_cards,total_fh_cards,total_2h_cards,attacks_recorded,team_a_dangerous_attacks,team_b_dangerous_attacks,team_a_attacks,team_b_attacks,team_a_xg,team_b_xg,total_xg,team_a_penalties_won,team_b_penalties_won,team_a_penalty_goals,team_b_penalty_goals,team_a_penalty_missed,team_b_penalty_missed,pens_recorded,goal_timings_recorded,team_a_0_10_min_goals,team_b_0_10_min_goals,team_a_corners_0_10_min,team_b_corners_0_10_min,team_a_cards_0_10_min,team_b_cards_0_10_min,throwins_recorded,team_a_throwins,team_b_throwins,freekicks_recorded,team_a_freekicks,team_b_freekicks,goalkicks_recorded,team_a_goalkicks,team_b_goalkicks,o45_potential,o35_potential,o25_potential,o15_potential,o05_potential,o15HT_potential,o05HT_potential,o05_2H_potential,o15_2H_potential,corners_potential,offsides_potential,cards_potential,avg_potential,home_url,home_image,home_name,away_url,away_image,away_name,home_ppg,away_ppg,pre_match_home_ppg,pre_match_away_ppg,pre_match_teamA_overall_ppg,pre_match_teamB_overall_ppg,u45_potential,u35_potential,u25_potential,u15_potential,u05_potential,corners_o85_potential,corners_o95_potential,corners_o105_potential,team_a_xg_prematch,team_b_xg_prematch,total_xg_prematch,match_url,competition_id,matches_completed_minimum,over05,over15,over25,over35,over45,over55,btts,homeGoals_timings,awayGoals_timings,date,xG_home,xGA_home,npxG_home,npxGA_home,deep_home,deep_allowed_home,scored_home,missed_home,xpts_home,npxGD_home,ppda.att_home,ppda.def_home,ppda_allowed.att_home,ppda_allowed.def_home,xG_away,xGA_away,npxG_away,npxGA_away,deep_away,deep_allowed_away,scored_away,missed_away,xpts_away,npxGD_away,ppda.att_away,ppda.def_away,ppda_allowed.att_away,ppda_allowed.def_away
0,2155,150,108,2016/2017,complete,19,1,-1,"['45+1', '57']",['47'],2,1,3,5,3,8,1,0,2,2,0,0,4,6,7,9,11,15,7,17,50,50,685.0,196.0,160.0,KCOM Stadium (Hull),,2,2,3.41,3.19,2.39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,3,1,0,1,1,2,1,2016-08-13 11:30:00,150,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,2,2,0,2,2,-1,0,0,0,0,0.0,0.0,0.0,0,1,0,1,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/hull-city-afc-150,teams/england-hull-city-afc.png,Hull City,/clubs/leicester-city-fc-108,teams/england-leicester-city-fc.png,Leicester City,1.47,0.53,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/leicester-city-fc-vs-hull-city-afc-h2...,9,38,True,True,True,False,False,False,True,"['45+1', '57']",['47'],2016-08-13,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2156,145,154,2016/2017,complete,19,1,-1,[],['82'],0,1,1,7,4,11,2,2,3,2,0,0,6,8,6,7,12,15,10,14,55,45,688.0,197.0,198.0,Turf Moor (Burnley),,3,2,2.45,3.22,3.26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,2016-08-13 14:00:00,154,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,1,0,2,2,1,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/burnley-fc-145,teams/england-burnley-fc.png,Burnley,/clubs/swansea-city-afc-154,teams/wales-swansea-city-afc.png,Swansea City,1.74,0.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/burnley-fc-vs-swansea-city-afc-h2h-st...,9,38,True,False,False,False,False,False,False,[],['82'],2016-08-13,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2157,143,142,2016/2017,complete,19,1,-1,[],['74'],0,1,1,3,6,9,0,2,2,2,0,0,7,6,5,6,12,12,12,14,54,46,360.0,199.0,200.0,Selhurst Park (London),,2,2,2.2,3.25,3.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,2016-08-13 14:00:00,142,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,1,2,1,1,3,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/crystal-palace-fc-143,teams/england-crystal-palace-fc.png,Crystal Palace,/clubs/west-bromwich-albion-fc-142,teams/england-west-bromwich-albion-fc.png,West Bromwich Albion,1.05,0.84,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/west-bromwich-albion-fc-vs-crystal-pa...,9,38,True,False,False,False,False,False,False,[],['74'],2016-08-13,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2158,144,92,2016/2017,complete,19,1,-1,['5'],['59'],1,1,2,5,6,11,4,0,0,0,0,0,7,4,3,2,10,6,10,14,41,59,537.0,201.0,156.0,Goodison Park (Liverpool),,0,0,3.13,3.36,2.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,2016-08-13 14:00:00,-1,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/everton-fc-144,teams/england-everton-fc.png,Everton,/clubs/tottenham-hotspur-fc-92,teams/england-tottenham-hotspur-fc.png,Tottenham Hotspur,2.26,1.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/tottenham-hotspur-fc-vs-everton-fc-h2...,9,38,True,True,False,False,False,False,True,['5'],['59'],2016-08-13,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2159,147,141,2016/2017,complete,19,1,-1,['11'],['67'],1,1,2,9,6,15,1,3,3,5,0,0,3,2,5,8,8,10,18,13,46,54,693.0,202.0,203.0,Riverside Stadium (Middlesbrough),,3,5,2.49,3.2,3.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,2016-08-13 14:00:00,-1,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,2,2,1,3,4,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/middlesbrough-fc-147,teams/england-middlesbrough-fc.png,Middlesbrough,/clubs/stoke-city-fc-141,teams/england-stoke-city-fc.png,Stoke City,0.95,0.89,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/stoke-city-fc-vs-middlesbrough-fc-h2h...,9,38,True,True,False,False,False,False,True,['11'],['67'],2016-08-13,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [61]:
# Iterate through each row in api_data
for index, row in api_data.iterrows():
    home_id = row['homeID']
    date = row['date']
    
    # Retrieve home stats from scraped_data dataframe
    result = sorted_scraped_data[(sorted_scraped_data['teamID'] == home_id) & (sorted_scraped_data['date'] < date)]
    # get mean for each statistics for the last  5 games before current match
    result_mean = result.head(15).describe().loc['mean']
    # Check if there are matching records
    if not result.empty:
        # Take the first matching record
        filtered_result_dict = result_mean.to_dict()
        
        # Filter out irrelevant columns from result
        filtered_result_dict = {key: value for key, value in filtered_result_dict.items() if key in scraped_data_feats}
        
        # Update values in the dataframe
        for key, value in filtered_result_dict.items():
            api_data.at[index, f'{key}_home'] = value

In [62]:
# Iterate through each row in api_data
for index, row in api_data.iterrows():
    away_id = row['awayID']
    date = row['date']

    
    # Retrieve away stats from scraped_data dataframe
    result = sorted_scraped_data[(sorted_scraped_data['teamID'] == away_id) & (sorted_scraped_data['date'] < date)]
    # get mean for each statistics for the last  5 games before current match
    result_mean = result.head(15).describe().loc['mean']
       
    # Check if there are matching records
    if not result.empty:
        # Take the first matching record
        filtered_result_dict = result_mean.to_dict()
        
        # Filter out irrelevant columns from result
        filtered_result_dict = {key: value for key, value in filtered_result_dict.items() if key in scraped_data_feats}
        
        # Update values in the dataframe
        for key, value in filtered_result_dict.items():
            api_data.at[index, f'{key}_away'] = value

In [63]:
api_data.head()

Unnamed: 0,id,homeID,awayID,season,status,roundID,game_week,revised_game_week,homeGoals,awayGoals,homeGoalCount,awayGoalCount,totalGoalCount,team_a_corners,team_b_corners,totalCornerCount,team_a_offsides,team_b_offsides,team_a_yellow_cards,team_b_yellow_cards,team_a_red_cards,team_b_red_cards,team_a_shotsOnTarget,team_b_shotsOnTarget,team_a_shotsOffTarget,team_b_shotsOffTarget,team_a_shots,team_b_shots,team_a_fouls,team_b_fouls,team_a_possession,team_b_possession,refereeID,coach_a_ID,coach_b_ID,stadium_name,stadium_location,team_a_cards_num,team_b_cards_num,odds_ft_1,odds_ft_x,odds_ft_2,odds_ft_over05,odds_ft_over15,odds_ft_over25,odds_ft_over35,odds_ft_over45,odds_ft_under05,odds_ft_under15,odds_ft_under25,odds_ft_under35,odds_ft_under45,odds_btts_yes,odds_btts_no,odds_team_a_cs_yes,odds_team_a_cs_no,odds_team_b_cs_yes,odds_team_b_cs_no,odds_doublechance_1x,odds_doublechance_12,odds_doublechance_x2,odds_1st_half_result_1,odds_1st_half_result_x,odds_1st_half_result_2,odds_2nd_half_result_1,odds_2nd_half_result_x,odds_2nd_half_result_2,odds_dnb_1,odds_dnb_2,odds_corners_over_75,odds_corners_over_85,odds_corners_over_95,odds_corners_over_105,odds_corners_over_115,odds_corners_under_75,odds_corners_under_85,odds_corners_under_95,odds_corners_under_105,odds_corners_under_115,odds_corners_1,odds_corners_x,odds_corners_2,odds_team_to_score_first_1,odds_team_to_score_first_x,odds_team_to_score_first_2,odds_win_to_nil_1,odds_win_to_nil_2,odds_1st_half_over05,odds_1st_half_over15,odds_1st_half_over25,odds_1st_half_over35,odds_1st_half_under05,odds_1st_half_under15,odds_1st_half_under25,odds_1st_half_under35,odds_2nd_half_over05,odds_2nd_half_over15,odds_2nd_half_over25,odds_2nd_half_over35,odds_2nd_half_under05,odds_2nd_half_under15,odds_2nd_half_under25,odds_2nd_half_under35,odds_btts_1st_half_yes,odds_btts_1st_half_no,odds_btts_2nd_half_yes,odds_btts_2nd_half_no,overallGoalCount,ht_goals_team_a,ht_goals_team_b,goals_2hg_team_a,goals_2hg_team_b,GoalCount_2hg,HTGoalCount,date_unix,winningTeam,no_home_away,btts_potential,btts_fhg_potential,btts_2hg_potential,goalTimingDisabled,attendance,corner_timings_recorded,card_timings_recorded,team_a_fh_corners,team_b_fh_corners,team_a_2h_corners,team_b_2h_corners,corner_fh_count,corner_2h_count,team_a_fh_cards,team_b_fh_cards,team_a_2h_cards,team_b_2h_cards,total_fh_cards,total_2h_cards,attacks_recorded,team_a_dangerous_attacks,team_b_dangerous_attacks,team_a_attacks,team_b_attacks,team_a_xg,team_b_xg,total_xg,team_a_penalties_won,team_b_penalties_won,team_a_penalty_goals,team_b_penalty_goals,team_a_penalty_missed,team_b_penalty_missed,pens_recorded,goal_timings_recorded,team_a_0_10_min_goals,team_b_0_10_min_goals,team_a_corners_0_10_min,team_b_corners_0_10_min,team_a_cards_0_10_min,team_b_cards_0_10_min,throwins_recorded,team_a_throwins,team_b_throwins,freekicks_recorded,team_a_freekicks,team_b_freekicks,goalkicks_recorded,team_a_goalkicks,team_b_goalkicks,o45_potential,o35_potential,o25_potential,o15_potential,o05_potential,o15HT_potential,o05HT_potential,o05_2H_potential,o15_2H_potential,corners_potential,offsides_potential,cards_potential,avg_potential,home_url,home_image,home_name,away_url,away_image,away_name,home_ppg,away_ppg,pre_match_home_ppg,pre_match_away_ppg,pre_match_teamA_overall_ppg,pre_match_teamB_overall_ppg,u45_potential,u35_potential,u25_potential,u15_potential,u05_potential,corners_o85_potential,corners_o95_potential,corners_o105_potential,team_a_xg_prematch,team_b_xg_prematch,total_xg_prematch,match_url,competition_id,matches_completed_minimum,over05,over15,over25,over35,over45,over55,btts,homeGoals_timings,awayGoals_timings,date,xG_home,xGA_home,npxG_home,npxGA_home,deep_home,deep_allowed_home,scored_home,missed_home,xpts_home,npxGD_home,ppda.att_home,ppda.def_home,ppda_allowed.att_home,ppda_allowed.def_home,xG_away,xGA_away,npxG_away,npxGA_away,deep_away,deep_allowed_away,scored_away,missed_away,xpts_away,npxGD_away,ppda.att_away,ppda.def_away,ppda_allowed.att_away,ppda_allowed.def_away
0,2155,150,108,2016/2017,complete,19,1,-1,"['45+1', '57']",['47'],2,1,3,5,3,8,1,0,2,2,0,0,4,6,7,9,11,15,7,17,50,50,685.0,196.0,160.0,KCOM Stadium (Hull),,2,2,3.41,3.19,2.39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,3,1,0,1,1,2,1,2016-08-13 11:30:00,150,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,2,2,0,2,2,-1,0,0,0,0,0.0,0.0,0.0,0,1,0,1,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/hull-city-afc-150,teams/england-hull-city-afc.png,Hull City,/clubs/leicester-city-fc-108,teams/england-leicester-city-fc.png,Leicester City,1.47,0.53,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/leicester-city-fc-vs-hull-city-afc-h2...,9,38,True,True,True,False,False,False,True,"['45+1', '57']",['47'],2016-08-13,1.05377,1.10703,1.05377,1.05629,3.93333,7.2,0.866667,1.2,1.38615,-0.00251373,280.333,20.1333,185.067,20.9333,1.97692,1.12848,1.77394,1.02699,6.33333,8.53333,1.73333,0.666667,1.92343,0.746946,238.133,25.7333,168.267,22.6667
1,2156,145,154,2016/2017,complete,19,1,-1,[],['82'],0,1,1,7,4,11,2,2,3,2,0,0,6,8,6,7,12,15,10,14,55,45,688.0,197.0,198.0,Turf Moor (Burnley),,3,2,2.45,3.22,3.26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,2016-08-13 14:00:00,154,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,1,0,2,2,1,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/burnley-fc-145,teams/england-burnley-fc.png,Burnley,/clubs/swansea-city-afc-154,teams/wales-swansea-city-afc.png,Swansea City,1.74,0.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/burnley-fc-vs-swansea-city-afc-h2h-st...,9,38,True,False,False,False,False,False,False,[],['82'],2016-08-13,0.982456,1.5711,0.931711,1.41887,5.73333,6.66667,0.466667,1.0,0.958993,-0.487157,274.467,23.6,179.533,20.2667,1.12589,1.60696,1.12589,1.60696,5.73333,7.4,1.33333,1.4,1.07717,-0.481071,217.933,28.0,256.467,26.3333
2,2157,143,142,2016/2017,complete,19,1,-1,[],['74'],0,1,1,3,6,9,0,2,2,2,0,0,7,6,5,6,12,12,12,14,54,46,360.0,199.0,200.0,Selhurst Park (London),,2,2,2.2,3.25,3.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,2016-08-13 14:00:00,142,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,1,2,1,1,3,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/crystal-palace-fc-143,teams/england-crystal-palace-fc.png,Crystal Palace,/clubs/west-bromwich-albion-fc-142,teams/england-west-bromwich-albion-fc.png,West Bromwich Albion,1.05,0.84,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/west-bromwich-albion-fc-vs-crystal-pa...,9,38,True,False,False,False,False,False,False,[],['74'],2016-08-13,1.11155,1.43062,1.06081,1.27839,3.73333,8.73333,1.0,1.6,1.13741,-0.217585,208.267,23.8667,182.667,24.8667,1.21161,1.33504,1.0582,1.2843,4.46667,7.8,0.8,1.2,1.29981,-0.226096,260.933,24.4667,137.867,21.8667
3,2158,144,92,2016/2017,complete,19,1,-1,['5'],['59'],1,1,2,5,6,11,4,0,0,0,0,0,7,4,3,2,10,6,10,14,41,59,537.0,201.0,156.0,Goodison Park (Liverpool),,0,0,3.13,3.36,2.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,2016-08-13 14:00:00,-1,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/everton-fc-144,teams/england-everton-fc.png,Everton,/clubs/tottenham-hotspur-fc-92,teams/england-tottenham-hotspur-fc.png,Tottenham Hotspur,2.26,1.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/tottenham-hotspur-fc-vs-everton-fc-h2...,9,38,True,True,False,False,False,False,True,['5'],['59'],2016-08-13,1.46906,1.61328,1.21534,1.51179,6.46667,9.66667,1.26667,1.4,1.35011,-0.296455,208.533,21.8,250.667,24.8,1.94859,1.01478,1.8471,0.964038,7.33333,4.73333,1.86667,1.06667,2.02045,0.883062,169.133,26.6,270.6,24.2667
4,2159,147,141,2016/2017,complete,19,1,-1,['11'],['67'],1,1,2,9,6,15,1,3,3,5,0,0,3,2,5,8,8,10,18,13,46,54,693.0,202.0,203.0,Riverside Stadium (Middlesbrough),,3,5,2.49,3.2,3.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,2016-08-13 14:00:00,-1,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,2,2,1,3,4,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/middlesbrough-fc-147,teams/england-middlesbrough-fc.png,Middlesbrough,/clubs/stoke-city-fc-141,teams/england-stoke-city-fc.png,Stoke City,0.95,0.89,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/stoke-city-fc-vs-middlesbrough-fc-h2h...,9,38,True,True,False,False,False,False,True,['11'],['67'],2016-08-13,,,,,,,,,,,,,,,0.959606,1.57779,0.908862,1.42556,5.73333,6.26667,1.13333,2.0,1.06113,-0.516701,221.8,22.8667,241.0,25.0667


In [64]:
api_data.to_csv('./data/api_scraped_data.csv', index=False)

# Data Cleaning & EDA

In [65]:
# read aggregated data into a dataframe
match_data = pd.read_csv('./data/api_scraped_data.csv', header=0)

In [66]:
# get shape of data
match_data.shape

(3800, 244)

In [67]:
# inspect the first few rows of the dataframe
match_data.head()

Unnamed: 0,id,homeID,awayID,season,status,roundID,game_week,revised_game_week,homeGoals,awayGoals,homeGoalCount,awayGoalCount,totalGoalCount,team_a_corners,team_b_corners,totalCornerCount,team_a_offsides,team_b_offsides,team_a_yellow_cards,team_b_yellow_cards,team_a_red_cards,team_b_red_cards,team_a_shotsOnTarget,team_b_shotsOnTarget,team_a_shotsOffTarget,team_b_shotsOffTarget,team_a_shots,team_b_shots,team_a_fouls,team_b_fouls,team_a_possession,team_b_possession,refereeID,coach_a_ID,coach_b_ID,stadium_name,stadium_location,team_a_cards_num,team_b_cards_num,odds_ft_1,odds_ft_x,odds_ft_2,odds_ft_over05,odds_ft_over15,odds_ft_over25,odds_ft_over35,odds_ft_over45,odds_ft_under05,odds_ft_under15,odds_ft_under25,odds_ft_under35,odds_ft_under45,odds_btts_yes,odds_btts_no,odds_team_a_cs_yes,odds_team_a_cs_no,odds_team_b_cs_yes,odds_team_b_cs_no,odds_doublechance_1x,odds_doublechance_12,odds_doublechance_x2,odds_1st_half_result_1,odds_1st_half_result_x,odds_1st_half_result_2,odds_2nd_half_result_1,odds_2nd_half_result_x,odds_2nd_half_result_2,odds_dnb_1,odds_dnb_2,odds_corners_over_75,odds_corners_over_85,odds_corners_over_95,odds_corners_over_105,odds_corners_over_115,odds_corners_under_75,odds_corners_under_85,odds_corners_under_95,odds_corners_under_105,odds_corners_under_115,odds_corners_1,odds_corners_x,odds_corners_2,odds_team_to_score_first_1,odds_team_to_score_first_x,odds_team_to_score_first_2,odds_win_to_nil_1,odds_win_to_nil_2,odds_1st_half_over05,odds_1st_half_over15,odds_1st_half_over25,odds_1st_half_over35,odds_1st_half_under05,odds_1st_half_under15,odds_1st_half_under25,odds_1st_half_under35,odds_2nd_half_over05,odds_2nd_half_over15,odds_2nd_half_over25,odds_2nd_half_over35,odds_2nd_half_under05,odds_2nd_half_under15,odds_2nd_half_under25,odds_2nd_half_under35,odds_btts_1st_half_yes,odds_btts_1st_half_no,odds_btts_2nd_half_yes,odds_btts_2nd_half_no,overallGoalCount,ht_goals_team_a,ht_goals_team_b,goals_2hg_team_a,goals_2hg_team_b,GoalCount_2hg,HTGoalCount,date_unix,winningTeam,no_home_away,btts_potential,btts_fhg_potential,btts_2hg_potential,goalTimingDisabled,attendance,corner_timings_recorded,card_timings_recorded,team_a_fh_corners,team_b_fh_corners,team_a_2h_corners,team_b_2h_corners,corner_fh_count,corner_2h_count,team_a_fh_cards,team_b_fh_cards,team_a_2h_cards,team_b_2h_cards,total_fh_cards,total_2h_cards,attacks_recorded,team_a_dangerous_attacks,team_b_dangerous_attacks,team_a_attacks,team_b_attacks,team_a_xg,team_b_xg,total_xg,team_a_penalties_won,team_b_penalties_won,team_a_penalty_goals,team_b_penalty_goals,team_a_penalty_missed,team_b_penalty_missed,pens_recorded,goal_timings_recorded,team_a_0_10_min_goals,team_b_0_10_min_goals,team_a_corners_0_10_min,team_b_corners_0_10_min,team_a_cards_0_10_min,team_b_cards_0_10_min,throwins_recorded,team_a_throwins,team_b_throwins,freekicks_recorded,team_a_freekicks,team_b_freekicks,goalkicks_recorded,team_a_goalkicks,team_b_goalkicks,o45_potential,o35_potential,o25_potential,o15_potential,o05_potential,o15HT_potential,o05HT_potential,o05_2H_potential,o15_2H_potential,corners_potential,offsides_potential,cards_potential,avg_potential,home_url,home_image,home_name,away_url,away_image,away_name,home_ppg,away_ppg,pre_match_home_ppg,pre_match_away_ppg,pre_match_teamA_overall_ppg,pre_match_teamB_overall_ppg,u45_potential,u35_potential,u25_potential,u15_potential,u05_potential,corners_o85_potential,corners_o95_potential,corners_o105_potential,team_a_xg_prematch,team_b_xg_prematch,total_xg_prematch,match_url,competition_id,matches_completed_minimum,over05,over15,over25,over35,over45,over55,btts,homeGoals_timings,awayGoals_timings,date,xG_home,xGA_home,npxG_home,npxGA_home,deep_home,deep_allowed_home,scored_home,missed_home,xpts_home,npxGD_home,ppda.att_home,ppda.def_home,ppda_allowed.att_home,ppda_allowed.def_home,xG_away,xGA_away,npxG_away,npxGA_away,deep_away,deep_allowed_away,scored_away,missed_away,xpts_away,npxGD_away,ppda.att_away,ppda.def_away,ppda_allowed.att_away,ppda_allowed.def_away
0,2155,150,108,2016/2017,complete,19,1,-1,"['45+1', '57']",['47'],2,1,3,5,3,8,1,0,2,2,0,0,4,6,7,9,11,15,7,17,50,50,685.0,196.0,160.0,KCOM Stadium (Hull),,2,2,3.41,3.19,2.39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,3,1,0,1,1,2,1,2016-08-13 11:30:00,150,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,2,2,0,2,2,-1,0,0,0,0,0.0,0.0,0.0,0,1,0,1,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/hull-city-afc-150,teams/england-hull-city-afc.png,Hull City,/clubs/leicester-city-fc-108,teams/england-leicester-city-fc.png,Leicester City,1.47,0.53,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/leicester-city-fc-vs-hull-city-afc-h2...,9,38,True,True,True,False,False,False,True,"['45+1', '57']",['47'],2016-08-13,1.053772,1.10703,1.053772,1.056286,3.933333,7.2,0.866667,1.2,1.386147,-0.002514,280.333333,20.133333,185.066667,20.933333,1.976917,1.128482,1.773939,1.026993,6.333333,8.533333,1.733333,0.666667,1.923433,0.746946,238.133333,25.733333,168.266667,22.666667
1,2156,145,154,2016/2017,complete,19,1,-1,[],['82'],0,1,1,7,4,11,2,2,3,2,0,0,6,8,6,7,12,15,10,14,55,45,688.0,197.0,198.0,Turf Moor (Burnley),,3,2,2.45,3.22,3.26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,2016-08-13 14:00:00,154,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,1,0,2,2,1,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/burnley-fc-145,teams/england-burnley-fc.png,Burnley,/clubs/swansea-city-afc-154,teams/wales-swansea-city-afc.png,Swansea City,1.74,0.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/burnley-fc-vs-swansea-city-afc-h2h-st...,9,38,True,False,False,False,False,False,False,[],['82'],2016-08-13,0.982456,1.571102,0.931711,1.418868,5.733333,6.666667,0.466667,1.0,0.958993,-0.487157,274.466667,23.6,179.533333,20.266667,1.12589,1.606961,1.12589,1.606961,5.733333,7.4,1.333333,1.4,1.077167,-0.481071,217.933333,28.0,256.466667,26.333333
2,2157,143,142,2016/2017,complete,19,1,-1,[],['74'],0,1,1,3,6,9,0,2,2,2,0,0,7,6,5,6,12,12,12,14,54,46,360.0,199.0,200.0,Selhurst Park (London),,2,2,2.2,3.25,3.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,2016-08-13 14:00:00,142,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,1,2,1,1,3,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/crystal-palace-fc-143,teams/england-crystal-palace-fc.png,Crystal Palace,/clubs/west-bromwich-albion-fc-142,teams/england-west-bromwich-albion-fc.png,West Bromwich Albion,1.05,0.84,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/west-bromwich-albion-fc-vs-crystal-pa...,9,38,True,False,False,False,False,False,False,[],['74'],2016-08-13,1.11155,1.430625,1.060806,1.278391,3.733333,8.733333,1.0,1.6,1.137407,-0.217585,208.266667,23.866667,182.666667,24.866667,1.211611,1.335043,1.058202,1.284298,4.466667,7.8,0.8,1.2,1.299807,-0.226096,260.933333,24.466667,137.866667,21.866667
3,2158,144,92,2016/2017,complete,19,1,-1,['5'],['59'],1,1,2,5,6,11,4,0,0,0,0,0,7,4,3,2,10,6,10,14,41,59,537.0,201.0,156.0,Goodison Park (Liverpool),,0,0,3.13,3.36,2.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,2016-08-13 14:00:00,-1,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/everton-fc-144,teams/england-everton-fc.png,Everton,/clubs/tottenham-hotspur-fc-92,teams/england-tottenham-hotspur-fc.png,Tottenham Hotspur,2.26,1.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/tottenham-hotspur-fc-vs-everton-fc-h2...,9,38,True,True,False,False,False,False,True,['5'],['59'],2016-08-13,1.469059,1.613283,1.215339,1.511794,6.466667,9.666667,1.266667,1.4,1.350113,-0.296455,208.533333,21.8,250.666667,24.8,1.948587,1.014782,1.8471,0.964038,7.333333,4.733333,1.866667,1.066667,2.020447,0.883062,169.133333,26.6,270.6,24.266667
4,2159,147,141,2016/2017,complete,19,1,-1,['11'],['67'],1,1,2,9,6,15,1,3,3,5,0,0,3,2,5,8,8,10,18,13,46,54,693.0,202.0,203.0,Riverside Stadium (Middlesbrough),,3,5,2.49,3.2,3.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,2016-08-13 14:00:00,-1,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,2,2,1,3,4,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/middlesbrough-fc-147,teams/england-middlesbrough-fc.png,Middlesbrough,/clubs/stoke-city-fc-141,teams/england-stoke-city-fc.png,Stoke City,0.95,0.89,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/stoke-city-fc-vs-middlesbrough-fc-h2h...,9,38,True,True,False,False,False,False,True,['11'],['67'],2016-08-13,,,,,,,,,,,,,,,0.959606,1.577794,0.908862,1.425562,5.733333,6.266667,1.133333,2.0,1.061133,-0.516701,221.8,22.866667,241.0,25.066667


The dataset consists of 3,800 observations and 244 features some of which are our target variables.

In [69]:
# check types of columns
match_data.dtypes.unique()

array([dtype('int64'), dtype('O'), dtype('float64'), dtype('bool')],
      dtype=object)

The result indicates that our dataset consists of categorical (object), boolean and numeric (float and int) features. Before proceeding further, let's confirm that the dtypes match what is expected in the documentation.

### To do (Create table of columns and their definitions) - Use datawrapper and insert link 


In [71]:
# inspect the first few rows
match_data.head(10)

Unnamed: 0,id,homeID,awayID,season,status,roundID,game_week,revised_game_week,homeGoals,awayGoals,homeGoalCount,awayGoalCount,totalGoalCount,team_a_corners,team_b_corners,totalCornerCount,team_a_offsides,team_b_offsides,team_a_yellow_cards,team_b_yellow_cards,team_a_red_cards,team_b_red_cards,team_a_shotsOnTarget,team_b_shotsOnTarget,team_a_shotsOffTarget,team_b_shotsOffTarget,team_a_shots,team_b_shots,team_a_fouls,team_b_fouls,team_a_possession,team_b_possession,refereeID,coach_a_ID,coach_b_ID,stadium_name,stadium_location,team_a_cards_num,team_b_cards_num,odds_ft_1,odds_ft_x,odds_ft_2,odds_ft_over05,odds_ft_over15,odds_ft_over25,odds_ft_over35,odds_ft_over45,odds_ft_under05,odds_ft_under15,odds_ft_under25,odds_ft_under35,odds_ft_under45,odds_btts_yes,odds_btts_no,odds_team_a_cs_yes,odds_team_a_cs_no,odds_team_b_cs_yes,odds_team_b_cs_no,odds_doublechance_1x,odds_doublechance_12,odds_doublechance_x2,odds_1st_half_result_1,odds_1st_half_result_x,odds_1st_half_result_2,odds_2nd_half_result_1,odds_2nd_half_result_x,odds_2nd_half_result_2,odds_dnb_1,odds_dnb_2,odds_corners_over_75,odds_corners_over_85,odds_corners_over_95,odds_corners_over_105,odds_corners_over_115,odds_corners_under_75,odds_corners_under_85,odds_corners_under_95,odds_corners_under_105,odds_corners_under_115,odds_corners_1,odds_corners_x,odds_corners_2,odds_team_to_score_first_1,odds_team_to_score_first_x,odds_team_to_score_first_2,odds_win_to_nil_1,odds_win_to_nil_2,odds_1st_half_over05,odds_1st_half_over15,odds_1st_half_over25,odds_1st_half_over35,odds_1st_half_under05,odds_1st_half_under15,odds_1st_half_under25,odds_1st_half_under35,odds_2nd_half_over05,odds_2nd_half_over15,odds_2nd_half_over25,odds_2nd_half_over35,odds_2nd_half_under05,odds_2nd_half_under15,odds_2nd_half_under25,odds_2nd_half_under35,odds_btts_1st_half_yes,odds_btts_1st_half_no,odds_btts_2nd_half_yes,odds_btts_2nd_half_no,overallGoalCount,ht_goals_team_a,ht_goals_team_b,goals_2hg_team_a,goals_2hg_team_b,GoalCount_2hg,HTGoalCount,date_unix,winningTeam,no_home_away,btts_potential,btts_fhg_potential,btts_2hg_potential,goalTimingDisabled,attendance,corner_timings_recorded,card_timings_recorded,team_a_fh_corners,team_b_fh_corners,team_a_2h_corners,team_b_2h_corners,corner_fh_count,corner_2h_count,team_a_fh_cards,team_b_fh_cards,team_a_2h_cards,team_b_2h_cards,total_fh_cards,total_2h_cards,attacks_recorded,team_a_dangerous_attacks,team_b_dangerous_attacks,team_a_attacks,team_b_attacks,team_a_xg,team_b_xg,total_xg,team_a_penalties_won,team_b_penalties_won,team_a_penalty_goals,team_b_penalty_goals,team_a_penalty_missed,team_b_penalty_missed,pens_recorded,goal_timings_recorded,team_a_0_10_min_goals,team_b_0_10_min_goals,team_a_corners_0_10_min,team_b_corners_0_10_min,team_a_cards_0_10_min,team_b_cards_0_10_min,throwins_recorded,team_a_throwins,team_b_throwins,freekicks_recorded,team_a_freekicks,team_b_freekicks,goalkicks_recorded,team_a_goalkicks,team_b_goalkicks,o45_potential,o35_potential,o25_potential,o15_potential,o05_potential,o15HT_potential,o05HT_potential,o05_2H_potential,o15_2H_potential,corners_potential,offsides_potential,cards_potential,avg_potential,home_url,home_image,home_name,away_url,away_image,away_name,home_ppg,away_ppg,pre_match_home_ppg,pre_match_away_ppg,pre_match_teamA_overall_ppg,pre_match_teamB_overall_ppg,u45_potential,u35_potential,u25_potential,u15_potential,u05_potential,corners_o85_potential,corners_o95_potential,corners_o105_potential,team_a_xg_prematch,team_b_xg_prematch,total_xg_prematch,match_url,competition_id,matches_completed_minimum,over05,over15,over25,over35,over45,over55,btts,homeGoals_timings,awayGoals_timings,date,xG_home,xGA_home,npxG_home,npxGA_home,deep_home,deep_allowed_home,scored_home,missed_home,xpts_home,npxGD_home,ppda.att_home,ppda.def_home,ppda_allowed.att_home,ppda_allowed.def_home,xG_away,xGA_away,npxG_away,npxGA_away,deep_away,deep_allowed_away,scored_away,missed_away,xpts_away,npxGD_away,ppda.att_away,ppda.def_away,ppda_allowed.att_away,ppda_allowed.def_away
0,2155,150,108,2016/2017,complete,19,1,-1,"['45+1', '57']",['47'],2,1,3,5,3,8,1,0,2,2,0,0,4,6,7,9,11,15,7,17,50,50,685.0,196.0,160.0,KCOM Stadium (Hull),,2,2,3.41,3.19,2.39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,3,1,0,1,1,2,1,2016-08-13 11:30:00,150,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,2,2,0,2,2,-1,0,0,0,0,0.0,0.0,0.0,0,1,0,1,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/hull-city-afc-150,teams/england-hull-city-afc.png,Hull City,/clubs/leicester-city-fc-108,teams/england-leicester-city-fc.png,Leicester City,1.47,0.53,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/leicester-city-fc-vs-hull-city-afc-h2...,9,38,True,True,True,False,False,False,True,"['45+1', '57']",['47'],2016-08-13,1.053772,1.10703,1.053772,1.056286,3.933333,7.2,0.866667,1.2,1.386147,-0.002514,280.333333,20.133333,185.066667,20.933333,1.976917,1.128482,1.773939,1.026993,6.333333,8.533333,1.733333,0.666667,1.923433,0.746946,238.133333,25.733333,168.266667,22.666667
1,2156,145,154,2016/2017,complete,19,1,-1,[],['82'],0,1,1,7,4,11,2,2,3,2,0,0,6,8,6,7,12,15,10,14,55,45,688.0,197.0,198.0,Turf Moor (Burnley),,3,2,2.45,3.22,3.26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,2016-08-13 14:00:00,154,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,1,0,2,2,1,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/burnley-fc-145,teams/england-burnley-fc.png,Burnley,/clubs/swansea-city-afc-154,teams/wales-swansea-city-afc.png,Swansea City,1.74,0.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/burnley-fc-vs-swansea-city-afc-h2h-st...,9,38,True,False,False,False,False,False,False,[],['82'],2016-08-13,0.982456,1.571102,0.931711,1.418868,5.733333,6.666667,0.466667,1.0,0.958993,-0.487157,274.466667,23.6,179.533333,20.266667,1.12589,1.606961,1.12589,1.606961,5.733333,7.4,1.333333,1.4,1.077167,-0.481071,217.933333,28.0,256.466667,26.333333
2,2157,143,142,2016/2017,complete,19,1,-1,[],['74'],0,1,1,3,6,9,0,2,2,2,0,0,7,6,5,6,12,12,12,14,54,46,360.0,199.0,200.0,Selhurst Park (London),,2,2,2.2,3.25,3.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,0,0,1,1,0,2016-08-13 14:00:00,142,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,1,2,1,1,3,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/crystal-palace-fc-143,teams/england-crystal-palace-fc.png,Crystal Palace,/clubs/west-bromwich-albion-fc-142,teams/england-west-bromwich-albion-fc.png,West Bromwich Albion,1.05,0.84,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/west-bromwich-albion-fc-vs-crystal-pa...,9,38,True,False,False,False,False,False,False,[],['74'],2016-08-13,1.11155,1.430625,1.060806,1.278391,3.733333,8.733333,1.0,1.6,1.137407,-0.217585,208.266667,23.866667,182.666667,24.866667,1.211611,1.335043,1.058202,1.284298,4.466667,7.8,0.8,1.2,1.299807,-0.226096,260.933333,24.466667,137.866667,21.866667
3,2158,144,92,2016/2017,complete,19,1,-1,['5'],['59'],1,1,2,5,6,11,4,0,0,0,0,0,7,4,3,2,10,6,10,14,41,59,537.0,201.0,156.0,Goodison Park (Liverpool),,0,0,3.13,3.36,2.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,2016-08-13 14:00:00,-1,0,0,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/everton-fc-144,teams/england-everton-fc.png,Everton,/clubs/tottenham-hotspur-fc-92,teams/england-tottenham-hotspur-fc.png,Tottenham Hotspur,2.26,1.74,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/tottenham-hotspur-fc-vs-everton-fc-h2...,9,38,True,True,False,False,False,False,True,['5'],['59'],2016-08-13,1.469059,1.613283,1.215339,1.511794,6.466667,9.666667,1.266667,1.4,1.350113,-0.296455,208.533333,21.8,250.666667,24.8,1.948587,1.014782,1.8471,0.964038,7.333333,4.733333,1.866667,1.066667,2.020447,0.883062,169.133333,26.6,270.6,24.266667
4,2159,147,141,2016/2017,complete,19,1,-1,['11'],['67'],1,1,2,9,6,15,1,3,3,5,0,0,3,2,5,8,8,10,18,13,46,54,693.0,202.0,203.0,Riverside Stadium (Middlesbrough),,3,5,2.49,3.2,3.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,1,0,0,1,1,1,2016-08-13 14:00:00,-1,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,2,2,1,3,4,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/middlesbrough-fc-147,teams/england-middlesbrough-fc.png,Middlesbrough,/clubs/stoke-city-fc-141,teams/england-stoke-city-fc.png,Stoke City,0.95,0.89,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/stoke-city-fc-vs-middlesbrough-fc-h2h...,9,38,True,True,False,False,False,False,True,['11'],['67'],2016-08-13,,,,,,,,,,,,,,,0.959606,1.577794,0.908862,1.425562,5.733333,6.266667,1.133333,2.0,1.061133,-0.516701,221.8,22.866667,241.0,25.066667
5,2160,146,155,2016/2017,complete,19,1,-1,['58'],['9'],1,1,2,6,2,8,1,0,1,2,0,1,10,0,8,6,18,6,8,12,66,34,697.0,204.0,205.0,"St. Mary's Stadium (Southampton, Hampshire)",,1,4,1.8,3.64,5.13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,2,0,1,1,0,1,1,2016-08-13 14:00:00,-1,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,0,1,3,0,4,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,1,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/southampton-fc-146,teams/england-southampton-fc.png,Southampton,/clubs/watford-fc-155,teams/england-watford-fc.png,Watford,1.26,0.63,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/southampton-fc-vs-watford-fc-h2h-stat...,9,38,True,True,False,False,False,False,True,['58'],['9'],2016-08-13,1.541624,1.269699,1.435341,1.269699,8.6,5.8,1.8,1.133333,1.5882,0.165642,238.533333,25.933333,190.266667,20.066667,1.057258,1.732613,0.905024,1.528459,5.466667,7.266667,0.866667,1.6,0.991227,-0.623435,241.066667,27.266667,188.866667,25.8
6,2161,93,156,2016/2017,complete,19,1,-1,"['4', '87']",['71'],2,1,3,10,5,15,1,2,1,2,0,0,5,3,8,2,13,5,11,11,72,28,293.0,72.0,206.0,Etihad Stadium (Manchester),,1,2,1.21,7.0,9.99,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,3,1,0,1,1,2,1,2016-08-13 16:30:00,93,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,1,1,1,1,2,-1,0,0,0,0,0.0,0.0,0.0,1,0,1,0,0,0,1,1,1,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/manchester-city-fc-93,teams/england-manchester-city-fc.png,Manchester City,/clubs/sunderland-afc-156,teams/england-sunderland-afc.png,Sunderland,2.11,0.53,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/manchester-city-fc-vs-sunderland-afc-...,9,38,True,True,True,False,False,False,True,"['4', '87']",['71'],2016-08-13,1.466291,1.029806,1.263313,0.979063,9.4,3.666667,1.733333,1.2,1.63854,0.284249,216.2,27.466667,263.266667,26.266667,1.171354,1.155271,1.069865,1.104526,5.466667,9.066667,1.333333,1.066667,1.40322,-0.034661,238.466667,23.933333,141.4,21.0
7,2162,148,149,2016/2017,complete,19,1,-1,['69'],"['40', '59', '64']",1,3,4,4,2,6,3,4,0,1,0,0,3,5,2,1,5,6,6,10,53,47,526.0,207.0,208.0,"Vitality Stadium (Bournemouth, Dorset)",,0,1,5.21,3.68,1.78,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,4,0,1,1,2,3,1,2016-08-14 12:30:00,149,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,0,0,0,1,0,1,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/afc-bournemouth-148,teams/england-afc-bournemouth.png,AFC Bournemouth,/clubs/manchester-united-fc-149,teams/england-manchester-united-fc.png,Manchester United,1.63,1.84,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/afc-bournemouth-vs-manchester-united-...,9,38,True,True,True,True,False,False,True,['69'],"['40', '59', '64']",2016-08-14,0.928213,1.239905,0.928213,1.189161,4.866667,6.133333,1.2,1.933333,1.14248,-0.260948,240.666667,25.933333,239.466667,23.466667,1.084222,1.180528,1.084222,1.180528,7.933333,6.0,1.4,0.933333,1.323667,-0.096306,188.333333,25.4,241.6,23.8
8,2163,59,151,2016/2017,complete,19,1,-1,"['31', '64', '75']","['45+1', '49', '56', '63']",3,4,7,5,4,9,4,3,3,3,0,0,2,5,4,5,6,10,13,15,47,53,393.0,145.0,85.0,Emirates Stadium (London),,3,3,2.14,3.44,3.74,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,7,1,1,2,3,5,2,2016-08-14 15:00:00,151,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,1,3,2,0,4,2,-1,0,0,0,0,0.0,0.0,0.0,1,0,0,0,1,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/arsenal-fc-59,teams/england-arsenal-fc.png,Arsenal,/clubs/liverpool-fc-151,teams/england-liverpool-fc.png,Liverpool,2.37,1.84,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/arsenal-fc-vs-liverpool-fc-h2h-stats#...,9,38,True,True,True,True,True,True,True,"['31', '64', '75']","['45+1', '49', '56', '63']",2016-08-14,1.74757,0.852146,1.74757,0.801401,13.4,4.4,1.866667,0.933333,1.93294,0.946168,179.0,25.066667,299.4,28.133333,1.579041,0.961499,1.528296,0.90596,9.6,3.733333,2.2,1.2,1.796947,0.622336,196.333333,26.2,290.266667,26.6
9,2164,152,153,2016/2017,complete,19,1,-1,"['47', '89']",['77'],2,1,3,7,1,8,1,1,5,2,0,0,5,2,7,2,12,4,15,13,57,43,461.0,209.0,210.0,Stamford Bridge (London),,5,2,1.62,4.01,6.35,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,3,0,0,2,1,3,0,2016-08-15 19:00:00,152,0,0,0,0,0,-1,-1,1,-1,-1,-1,-1,-1,-1,2,1,3,1,3,4,-1,0,0,0,0,0.0,0.0,0.0,1,0,1,0,0,0,1,1,0,0,-1,-1,1,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/chelsea-fc-152,teams/england-chelsea-fc.png,Chelsea,/clubs/west-ham-united-fc-153,teams/england-west-ham-united-fc.png,West Ham United,2.68,1.05,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/chelsea-fc-vs-west-ham-united-fc-h2h-...,9,38,True,True,True,False,False,False,True,"['47', '89']",['77'],2016-08-15,1.418148,1.203183,1.265914,1.152438,10.266667,4.933333,1.8,1.266667,1.54736,0.113476,230.733333,23.0,291.2,30.266667,1.736603,1.331783,1.584369,1.128806,7.133333,5.666667,1.933333,1.533333,1.54868,0.455564,204.133333,26.066667,229.8,24.333333


In [73]:
# inspect the last rows
match_data.tail()

Unnamed: 0,id,homeID,awayID,season,status,roundID,game_week,revised_game_week,homeGoals,awayGoals,homeGoalCount,awayGoalCount,totalGoalCount,team_a_corners,team_b_corners,totalCornerCount,team_a_offsides,team_b_offsides,team_a_yellow_cards,team_b_yellow_cards,team_a_red_cards,team_b_red_cards,team_a_shotsOnTarget,team_b_shotsOnTarget,team_a_shotsOffTarget,team_b_shotsOffTarget,team_a_shots,team_b_shots,team_a_fouls,team_b_fouls,team_a_possession,team_b_possession,refereeID,coach_a_ID,coach_b_ID,stadium_name,stadium_location,team_a_cards_num,team_b_cards_num,odds_ft_1,odds_ft_x,odds_ft_2,odds_ft_over05,odds_ft_over15,odds_ft_over25,odds_ft_over35,odds_ft_over45,odds_ft_under05,odds_ft_under15,odds_ft_under25,odds_ft_under35,odds_ft_under45,odds_btts_yes,odds_btts_no,odds_team_a_cs_yes,odds_team_a_cs_no,odds_team_b_cs_yes,odds_team_b_cs_no,odds_doublechance_1x,odds_doublechance_12,odds_doublechance_x2,odds_1st_half_result_1,odds_1st_half_result_x,odds_1st_half_result_2,odds_2nd_half_result_1,odds_2nd_half_result_x,odds_2nd_half_result_2,odds_dnb_1,odds_dnb_2,odds_corners_over_75,odds_corners_over_85,odds_corners_over_95,odds_corners_over_105,odds_corners_over_115,odds_corners_under_75,odds_corners_under_85,odds_corners_under_95,odds_corners_under_105,odds_corners_under_115,odds_corners_1,odds_corners_x,odds_corners_2,odds_team_to_score_first_1,odds_team_to_score_first_x,odds_team_to_score_first_2,odds_win_to_nil_1,odds_win_to_nil_2,odds_1st_half_over05,odds_1st_half_over15,odds_1st_half_over25,odds_1st_half_over35,odds_1st_half_under05,odds_1st_half_under15,odds_1st_half_under25,odds_1st_half_under35,odds_2nd_half_over05,odds_2nd_half_over15,odds_2nd_half_over25,odds_2nd_half_over35,odds_2nd_half_under05,odds_2nd_half_under15,odds_2nd_half_under25,odds_2nd_half_under35,odds_btts_1st_half_yes,odds_btts_1st_half_no,odds_btts_2nd_half_yes,odds_btts_2nd_half_no,overallGoalCount,ht_goals_team_a,ht_goals_team_b,goals_2hg_team_a,goals_2hg_team_b,GoalCount_2hg,HTGoalCount,date_unix,winningTeam,no_home_away,btts_potential,btts_fhg_potential,btts_2hg_potential,goalTimingDisabled,attendance,corner_timings_recorded,card_timings_recorded,team_a_fh_corners,team_b_fh_corners,team_a_2h_corners,team_b_2h_corners,corner_fh_count,corner_2h_count,team_a_fh_cards,team_b_fh_cards,team_a_2h_cards,team_b_2h_cards,total_fh_cards,total_2h_cards,attacks_recorded,team_a_dangerous_attacks,team_b_dangerous_attacks,team_a_attacks,team_b_attacks,team_a_xg,team_b_xg,total_xg,team_a_penalties_won,team_b_penalties_won,team_a_penalty_goals,team_b_penalty_goals,team_a_penalty_missed,team_b_penalty_missed,pens_recorded,goal_timings_recorded,team_a_0_10_min_goals,team_b_0_10_min_goals,team_a_corners_0_10_min,team_b_corners_0_10_min,team_a_cards_0_10_min,team_b_cards_0_10_min,throwins_recorded,team_a_throwins,team_b_throwins,freekicks_recorded,team_a_freekicks,team_b_freekicks,goalkicks_recorded,team_a_goalkicks,team_b_goalkicks,o45_potential,o35_potential,o25_potential,o15_potential,o05_potential,o15HT_potential,o05HT_potential,o05_2H_potential,o15_2H_potential,corners_potential,offsides_potential,cards_potential,avg_potential,home_url,home_image,home_name,away_url,away_image,away_name,home_ppg,away_ppg,pre_match_home_ppg,pre_match_away_ppg,pre_match_teamA_overall_ppg,pre_match_teamB_overall_ppg,u45_potential,u35_potential,u25_potential,u15_potential,u05_potential,corners_o85_potential,corners_o95_potential,corners_o105_potential,team_a_xg_prematch,team_b_xg_prematch,total_xg_prematch,match_url,competition_id,matches_completed_minimum,over05,over15,over25,over35,over45,over55,btts,homeGoals_timings,awayGoals_timings,date,xG_home,xGA_home,npxG_home,npxGA_home,deep_home,deep_allowed_home,scored_home,missed_home,xpts_home,npxGD_home,ppda.att_home,ppda.def_home,ppda_allowed.att_home,ppda_allowed.def_home,xG_away,xGA_away,npxG_away,npxGA_away,deep_away,deep_allowed_away,scored_away,missed_away,xpts_away,npxGD_away,ppda.att_away,ppda.def_away,ppda_allowed.att_away,ppda_allowed.def_away
3795,6689328,143,158,2023/2024,incomplete,100543,38,-1,[],[],0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,,-1.0,-1.0,Selhurst Park (London),,-1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,2024-05-19 15:00:00,-1,0,65,25,35,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,20,35,60,80,90,40,60,90,50,12.2,1.8,4.9,2.9,/clubs/crystal-palace-fc-143,teams/england-crystal-palace-fc.png,Crystal Palace,/clubs/aston-villa-fc-158,teams/england-aston-villa-fc.png,Aston Villa,0.9,1.4,0.9,1.4,1.05,2.1,80,65,40,20,10,85,75,55,1.3,1.48,2.78,/england/crystal-palace-fc-vs-aston-villa-fc-h...,9660,20,False,False,False,False,False,False,False,[],[],2024-05-19,1.279368,1.826222,1.127135,1.775478,5.733333,9.866667,1.066667,1.8,1.08794,-0.648343,319.466667,21.666667,209.466667,23.8,1.882485,1.209184,1.780995,1.209184,10.8,5.733333,2.066667,1.133333,1.834047,0.571811,201.733333,19.066667,267.933333,26.466667
3796,6689329,151,223,2023/2024,incomplete,100543,38,-1,[],[],0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,,-1.0,-1.0,Anfield (Liverpool),,-1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,2024-05-19 15:00:00,-1,0,60,30,30,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,25,35,70,85,95,50,60,90,50,11.0,4.2,4.3,3.25,/clubs/liverpool-fc-151,teams/england-liverpool-fc.png,Liverpool,/clubs/wolverhampton-wanderers-fc-223,teams/england-wolverhampton-wanderers-fc.png,Wolverhampton Wanderers,2.6,1.0,2.6,1.0,2.25,1.4,75,65,30,15,5,70,65,50,2.28,1.27,3.55,/england/liverpool-fc-vs-wolverhampton-wandere...,9660,20,False,False,False,False,False,False,False,[],[],2024-05-19,2.18217,1.205389,1.970823,1.154645,12.733333,6.0,2.133333,0.866667,1.883913,0.816178,175.866667,25.2,321.733333,20.333333,1.42952,1.515928,1.378775,1.312952,5.733333,8.666667,1.6,1.266667,1.313507,0.065823,263.6,20.4,252.933333,25.666667
3797,6689330,271,162,2023/2024,incomplete,100543,38,-1,[],[],0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,,-1.0,-1.0,"Kenilworth Road (Luton, Bedfordshire)",,-1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,2024-05-19 15:00:00,-1,0,65,15,50,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,20,30,60,80,95,25,65,90,70,9.5,3.7,5.0,3.1,/clubs/luton-town-fc-271,teams/england-luton-town-fc.png,Luton Town,/clubs/fulham-fc-162,teams/england-fulham-fc.png,Fulham,0.8,0.6,0.8,0.6,0.79,1.2,80,70,40,20,5,75,55,50,1.36,1.11,2.47,/england/fulham-fc-vs-luton-town-fc-h2h-stats,9660,19,False,False,False,False,False,False,False,[],[],2024-05-19,1.077711,2.475337,1.077711,2.475337,3.466667,9.866667,1.4,1.8,0.6425,-1.397626,279.733333,20.733333,156.266667,20.933333,1.331388,1.649206,1.229898,1.496973,7.733333,8.266667,1.533333,1.733333,1.235193,-0.267074,303.333333,23.6,287.8,22.533333
3798,6689331,93,153,2023/2024,incomplete,100543,38,-1,[],[],0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,,-1.0,-1.0,Etihad Stadium (Manchester),,-1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,2024-05-19 15:00:00,-1,0,74,21,42,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,32,47,68,95,100,43,90,89,67,11.71,3.43,4.31,3.64,/clubs/manchester-city-fc-93,teams/england-manchester-city-fc.png,Manchester City,/clubs/west-ham-united-fc-153,teams/england-west-ham-united-fc.png,West Ham United,2.33,1.6,2.33,1.6,2.11,1.7,69,53,32,6,0,79,63,63,1.67,1.21,2.88,/england/manchester-city-fc-vs-west-ham-united...,9660,19,False,False,False,False,False,False,False,[],[],2024-05-19,2.114357,1.069918,1.962123,0.968428,12.066667,4.733333,2.266667,1.333333,2.02972,0.993695,224.733333,19.0,329.2,17.2,1.468236,1.619438,1.366747,1.467204,4.6,9.866667,1.6,1.466667,1.260753,-0.100457,312.2,18.333333,230.533333,21.666667
3799,6689332,251,92,2023/2024,incomplete,100543,38,-1,[],[],0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,,-1.0,-1.0,Bramall Lane (Sheffield),,-1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,2024-05-19 15:00:00,-1,0,65,30,45,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,25,45,70,85,100,45,75,90,55,8.4,3.6,6.6,3.6,/clubs/sheffield-united-fc-251,teams/england-sheffield-united-fc.png,Sheffield United,/clubs/tottenham-hotspur-fc-92,teams/england-tottenham-hotspur-fc.png,Tottenham Hotspur,0.7,1.8,0.7,1.8,0.45,1.95,75,55,30,15,0,80,70,55,1.14,1.47,2.61,/england/tottenham-hotspur-fc-vs-sheffield-uni...,9660,20,False,False,False,False,False,False,False,[],[],2024-05-19,1.016014,2.017237,0.863782,1.915748,4.6,8.866667,0.8,2.266667,0.807747,-1.051965,321.866667,20.866667,158.133333,17.4,1.731368,1.863185,1.680624,1.710952,12.533333,7.133333,1.933333,1.6,1.37046,-0.030328,204.0,24.0,324.0,28.666667


In [75]:
# inspect random sample of rows
match_data.sample(5)

Unnamed: 0,id,homeID,awayID,season,status,roundID,game_week,revised_game_week,homeGoals,awayGoals,homeGoalCount,awayGoalCount,totalGoalCount,team_a_corners,team_b_corners,totalCornerCount,team_a_offsides,team_b_offsides,team_a_yellow_cards,team_b_yellow_cards,team_a_red_cards,team_b_red_cards,team_a_shotsOnTarget,team_b_shotsOnTarget,team_a_shotsOffTarget,team_b_shotsOffTarget,team_a_shots,team_b_shots,team_a_fouls,team_b_fouls,team_a_possession,team_b_possession,refereeID,coach_a_ID,coach_b_ID,stadium_name,stadium_location,team_a_cards_num,team_b_cards_num,odds_ft_1,odds_ft_x,odds_ft_2,odds_ft_over05,odds_ft_over15,odds_ft_over25,odds_ft_over35,odds_ft_over45,odds_ft_under05,odds_ft_under15,odds_ft_under25,odds_ft_under35,odds_ft_under45,odds_btts_yes,odds_btts_no,odds_team_a_cs_yes,odds_team_a_cs_no,odds_team_b_cs_yes,odds_team_b_cs_no,odds_doublechance_1x,odds_doublechance_12,odds_doublechance_x2,odds_1st_half_result_1,odds_1st_half_result_x,odds_1st_half_result_2,odds_2nd_half_result_1,odds_2nd_half_result_x,odds_2nd_half_result_2,odds_dnb_1,odds_dnb_2,odds_corners_over_75,odds_corners_over_85,odds_corners_over_95,odds_corners_over_105,odds_corners_over_115,odds_corners_under_75,odds_corners_under_85,odds_corners_under_95,odds_corners_under_105,odds_corners_under_115,odds_corners_1,odds_corners_x,odds_corners_2,odds_team_to_score_first_1,odds_team_to_score_first_x,odds_team_to_score_first_2,odds_win_to_nil_1,odds_win_to_nil_2,odds_1st_half_over05,odds_1st_half_over15,odds_1st_half_over25,odds_1st_half_over35,odds_1st_half_under05,odds_1st_half_under15,odds_1st_half_under25,odds_1st_half_under35,odds_2nd_half_over05,odds_2nd_half_over15,odds_2nd_half_over25,odds_2nd_half_over35,odds_2nd_half_under05,odds_2nd_half_under15,odds_2nd_half_under25,odds_2nd_half_under35,odds_btts_1st_half_yes,odds_btts_1st_half_no,odds_btts_2nd_half_yes,odds_btts_2nd_half_no,overallGoalCount,ht_goals_team_a,ht_goals_team_b,goals_2hg_team_a,goals_2hg_team_b,GoalCount_2hg,HTGoalCount,date_unix,winningTeam,no_home_away,btts_potential,btts_fhg_potential,btts_2hg_potential,goalTimingDisabled,attendance,corner_timings_recorded,card_timings_recorded,team_a_fh_corners,team_b_fh_corners,team_a_2h_corners,team_b_2h_corners,corner_fh_count,corner_2h_count,team_a_fh_cards,team_b_fh_cards,team_a_2h_cards,team_b_2h_cards,total_fh_cards,total_2h_cards,attacks_recorded,team_a_dangerous_attacks,team_b_dangerous_attacks,team_a_attacks,team_b_attacks,team_a_xg,team_b_xg,total_xg,team_a_penalties_won,team_b_penalties_won,team_a_penalty_goals,team_b_penalty_goals,team_a_penalty_missed,team_b_penalty_missed,pens_recorded,goal_timings_recorded,team_a_0_10_min_goals,team_b_0_10_min_goals,team_a_corners_0_10_min,team_b_corners_0_10_min,team_a_cards_0_10_min,team_b_cards_0_10_min,throwins_recorded,team_a_throwins,team_b_throwins,freekicks_recorded,team_a_freekicks,team_b_freekicks,goalkicks_recorded,team_a_goalkicks,team_b_goalkicks,o45_potential,o35_potential,o25_potential,o15_potential,o05_potential,o15HT_potential,o05HT_potential,o05_2H_potential,o15_2H_potential,corners_potential,offsides_potential,cards_potential,avg_potential,home_url,home_image,home_name,away_url,away_image,away_name,home_ppg,away_ppg,pre_match_home_ppg,pre_match_away_ppg,pre_match_teamA_overall_ppg,pre_match_teamB_overall_ppg,u45_potential,u35_potential,u25_potential,u15_potential,u05_potential,corners_o85_potential,corners_o95_potential,corners_o105_potential,team_a_xg_prematch,team_b_xg_prematch,total_xg_prematch,match_url,competition_id,matches_completed_minimum,over05,over15,over25,over35,over45,over55,btts,homeGoals_timings,awayGoals_timings,date,xG_home,xGA_home,npxG_home,npxGA_home,deep_home,deep_allowed_home,scored_home,missed_home,xpts_home,npxGD_home,ppda.att_home,ppda.def_home,ppda_allowed.att_home,ppda_allowed.def_home,xG_away,xGA_away,npxG_away,npxGA_away,deep_away,deep_allowed_away,scored_away,missed_away,xpts_away,npxGD_away,ppda.att_away,ppda.def_away,ppda_allowed.att_away,ppda_allowed.def_away
924,3079,146,144,2014/2015,complete,21,17,-1,"['38', '65', '82']",[],3,0,3,4,6,10,3,2,1,1,0,0,2,4,7,7,9,11,12,10,41,59,,201.0,221.0,St. Mary's Stadium,"Entebbe Road, Kitende",1,1,2.14,3.53,3.71,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,3,1,0,2,0,2,1,2014-12-20 15:00:00,146,0,50,26,26,0,31475,-1,1,-1,-1,-1,-1,-1,-1,0,0,1,1,0,2,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,7,25,57,82,94,38,63,75,51,10.38,5.63,3.13,2.82,/clubs/southampton-fc-146,teams/england-southampton-fc.png,Southampton,/clubs/everton-fc-144,teams/england-everton-fc.png,Everton,1.95,1.0,2.0,1.13,1.63,1.31,94,75,44,19,7,82,50,32,0.0,0.0,0.0,/england/everton-fc-vs-southampton-fc-h2h-stat...,11,38,True,True,True,False,False,False,False,"['38', '65', '82']",[],2014-12-20,1.491747,0.916412,1.441002,0.916412,7.666667,3.533333,1.6,0.733333,1.78334,0.52459,217.4,25.933333,293.933333,18.6,1.293094,1.182744,1.140861,1.081255,6.933333,5.6,1.666667,1.466667,1.4887,0.059606,201.133333,20.733333,320.666667,27.866667
2983,1308591,149,159,2021/2022,complete,72035,33,-1,"['7', '32', '76']","['45+1', '52']",3,2,5,6,7,13,3,0,0,0,0,0,10,5,11,11,21,16,8,6,61,39,690.0,1283.0,404.0,Old Trafford (Manchester),,0,0,1.24,6.5,13.5,1.02,1.17,1.55,2.4,4.2,19.0,5.4,2.56,1.6,1.25,2.05,1.68,1.78,1.93,9.75,1.05,1.04,1.12,4.2,1.61,2.7,8.75,1.44,3.4,8.5,0,0,1.11,1.25,1.53,1.85,2.29,5.3,3.55,2.5,1.98,1.63,1.11,10.75,8.5,1.25,20.5,4.45,1.91,19.0,1.25,2.1,4.6,11.5,3.5,1.63,1.17,1.04,1.14,1.68,3.1,6.5,5.25,2.05,1.33,1.09,0,0,0,0,5,2,1,1,1,2,3,2022-04-16 14:00:00,149,0,39,10,26,0,73381,1,-1,2,4,4,3,6,7,0,0,0,0,0,0,1,82,37,120,81,2.61,1.63,4.24,0,0,0,0,0,0,1,1,1,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,22,29,46,75,91,29,58,81,43,8.58,4.76,3.6,2.71,/clubs/manchester-united-fc-149,teams/england-manchester-united-fc.png,Manchester United,/clubs/norwich-city-fc-159,teams/england-norwich-city-fc.png,Norwich City,1.84,0.53,1.75,0.6,1.65,0.68,78,72,55,26,10,71,62,46,1.88,1.11,2.99,/england/manchester-united-fc-vs-norwich-city-...,6135,38,True,True,True,True,True,False,True,"['7', '32', '76']","['45+1', '52']",2022-04-16,1.676577,1.188066,1.676577,1.137321,8.066667,6.4,1.533333,1.2,1.730867,0.539255,242.266667,19.8,261.666667,23.266667,0.705055,2.3456,0.654311,1.97538,4.333333,11.0,0.8,2.066667,0.47706,-1.321069,268.866667,17.866667,176.933333,22.266667
1953,579054,148,153,2019/2020,complete,50055,7,-1,"['17', '46']","['10', '74']",2,2,4,6,6,12,2,2,3,1,0,0,5,7,8,11,13,18,7,8,49,51,705.0,207.0,225.0,"Vitality Stadium (Bournemouth, Dorset)",,3,1,2.4,3.65,2.9,1.01,1.15,1.5,2.25,3.9,15.75,5.4,2.55,1.65,1.26,1.42,2.6,4.33,1.2,5.0,1.16,1.4,1.28,1.53,2.87,2.4,3.2,2.5,2.75,2.87,0,0,1.19,1.37,1.62,2.0,2.6,4.3,3.0,2.22,1.72,1.47,1.9,8.1,2.36,1.8,15.75,2.0,5.5,6.5,1.26,2.3,5.25,12.0,3.5,1.56,1.13,1.01,1.17,1.73,3.25,6.5,4.5,2.0,1.33,1.1,0,0,0,0,4,1,1,1,1,2,2,2019-09-28 14:00:00,-1,0,84,50,33,0,10729,1,1,3,5,3,1,8,4,1,0,2,1,1,3,1,53,48,115,106,1.56,1.97,3.54,0,0,0,0,0,0,1,1,0,1,0,1,0,0,1,2,2,-1,-1,-1,-1,-1,-1,0,50,50,84,84,50,50,84,67,10.34,5.33,3.0,2.67,/clubs/afc-bournemouth-148,teams/england-afc-bournemouth.png,AFC Bournemouth,/clubs/west-ham-united-fc-153,teams/england-west-ham-united-fc.png,West Ham United,1.11,0.89,1.33,1.67,1.67,1.83,100,50,50,17,17,67,50,50,1.4,1.48,2.88,/england/afc-bournemouth-vs-west-ham-united-fc...,2012,38,True,True,True,True,False,False,True,"['17', '46']","['10', '74']",2019-09-28,1.750837,1.738917,1.649348,1.637428,7.333333,8.133333,1.866667,1.733333,1.377073,0.01192,252.733333,19.933333,216.733333,24.733333,1.376295,1.94081,1.224066,1.78858,6.866667,8.733333,1.533333,1.4,1.088913,-0.564513,239.266667,22.066667,236.733333,24.133333
1151,48978,148,155,2017/2018,complete,243,2,-1,[],"['73', '86']",0,2,2,8,5,13,1,7,1,3,0,0,3,8,1,7,4,15,6,14,55,45,697.0,207.0,214.0,Vitality Stadium,"Dean Court, Kings Park, Bournemouth, Dorset",1,3,1.98,3.66,4.13,1.05,1.25,1.8,2.7,4.75,7.5,3.55,2.0,1.4,1.15,1.67,2.1,2.75,1.4,4.5,1.18,1.18,1.22,1.74,2.51,2.24,4.33,2.23,2.55,3.9,0,0,0.0,0.0,0.0,1.83,0.0,0.0,0.0,0.0,1.83,0.0,0.0,0.0,0.0,1.7,7.5,2.35,3.6,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.4,2.83,6.5,13.0,2.96,1.43,1.04,1.01,0,0,0,0,2,0,0,0,2,2,0,2017-08-19 14:00:00,155,0,0,0,0,0,10501,1,1,3,1,5,4,4,9,0,1,1,2,1,3,1,74,74,96,84,1.0,2.03,3.04,0,0,0,0,0,0,1,1,0,0,2,0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,/clubs/afc-bournemouth-148,teams/england-afc-bournemouth.png,AFC Bournemouth,/clubs/watford-fc-155,teams/england-watford-fc.png,Watford,1.37,0.74,0.0,0.0,0.0,1.0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,/england/afc-bournemouth-vs-watford-fc-h2h-sta...,161,38,True,True,False,False,False,False,False,[],"['73', '86']",2017-08-19,1.318075,1.711959,1.064351,1.661214,6.066667,8.333333,1.333333,1.4,1.120807,-0.596863,237.6,19.6,229.666667,23.533333,1.015234,1.475414,0.964489,1.42467,4.0,6.933333,0.933333,2.066667,1.0496,-0.46018,232.6,19.933333,228.466667,22.133333
1866,454219,93,92,2018/2019,complete,3547,35,-1,['5'],[],1,0,1,4,4,8,2,2,1,2,0,0,5,5,6,3,11,8,11,11,60,40,393.0,72.0,156.0,Etihad Stadium (Manchester),,1,2,1.25,6.55,11.25,1.01,1.12,1.42,2.05,3.4,18.0,6.25,2.9,1.77,1.33,1.77,2.0,2.1,1.66,13.0,1.04,1.07,1.12,3.75,1.68,3.15,8.44,1.44,3.6,7.5,0,0,1.14,1.3,1.57,1.83,2.55,4.8,3.2,2.25,1.83,1.45,1.12,15.0,5.8,1.2,21.0,4.33,2.25,20.0,1.28,2.25,5.2,15.0,3.6,1.61,1.17,1.04,1.22,2.01,4.33,11.0,4.0,1.89,1.2,1.05,0,0,0,0,1,1,0,0,0,0,1,2019-04-20 11:30:00,93,0,59,27,32,0,54489,1,1,4,2,0,2,6,2,0,0,1,2,0,3,1,72,28,140,71,1.58,1.06,2.64,0,0,0,0,0,0,1,1,1,0,0,1,0,0,1,32,15,1,13,13,1,4,13,24,39,77,91,100,47,80,85,62,13.0,3.64,2.52,3.56,/clubs/manchester-city-fc-93,teams/england-manchester-city-fc.png,Manchester City,/clubs/tottenham-hotspur-fc-92,teams/england-tottenham-hotspur-fc.png,Tottenham Hotspur,2.84,1.74,2.82,1.94,2.52,2.03,77,62,24,9,0,59,59,56,2.28,1.4,3.68,/england/tottenham-hotspur-fc-vs-manchester-ci...,1625,38,True,False,False,False,False,False,False,['5'],[],2019-04-20,2.614689,0.576898,2.462455,0.526153,16.333333,3.6,2.4,0.6,2.508173,1.936302,172.466667,21.866667,363.666667,13.333333,1.621478,1.279277,1.570734,1.177787,8.2,4.8,1.8,1.066667,1.597013,0.392946,176.8,20.066667,354.266667,20.0


In [77]:
# check for null values
match_data.isnull().sum().to_dict()

{'id': 0,
 'homeID': 0,
 'awayID': 0,
 'season': 0,
 'status': 0,
 'roundID': 0,
 'game_week': 0,
 'revised_game_week': 0,
 'homeGoals': 0,
 'awayGoals': 0,
 'homeGoalCount': 0,
 'awayGoalCount': 0,
 'totalGoalCount': 0,
 'team_a_corners': 0,
 'team_b_corners': 0,
 'totalCornerCount': 0,
 'team_a_offsides': 0,
 'team_b_offsides': 0,
 'team_a_yellow_cards': 0,
 'team_b_yellow_cards': 0,
 'team_a_red_cards': 0,
 'team_b_red_cards': 0,
 'team_a_shotsOnTarget': 0,
 'team_b_shotsOnTarget': 0,
 'team_a_shotsOffTarget': 0,
 'team_b_shotsOffTarget': 0,
 'team_a_shots': 0,
 'team_b_shots': 0,
 'team_a_fouls': 0,
 'team_b_fouls': 0,
 'team_a_possession': 0,
 'team_b_possession': 0,
 'refereeID': 1280,
 'coach_a_ID': 11,
 'coach_b_ID': 12,
 'stadium_name': 0,
 'stadium_location': 2148,
 'team_a_cards_num': 0,
 'team_b_cards_num': 0,
 'odds_ft_1': 0,
 'odds_ft_x': 0,
 'odds_ft_2': 0,
 'odds_ft_over05': 0,
 'odds_ft_over15': 0,
 'odds_ft_over25': 0,
 'odds_ft_over35': 0,
 'odds_ft_over45': 0,
 'odd

### Columns of interest:
This were are columns that were identified to contain interesting, problematic or missing data and warranted further inspection.

#### Null values
Most columns did not have null, with exception of stadium location and refereeID which were found to consist of 2,158  and 1,280 null values respectively. 

#### Status Column
The first column of interest is the status colum which indicates the whether a match was completed. Looking at the head, tail and random sample snapshots of the df, the status of a match can be either complete or incomplete. Let's inspect unique values in the column:

In [78]:
# check unique values in the status column
match_data['status'].unique()

array(['complete', 'incomplete', 'suspended'], dtype=object)

Based on the output above, match status can `complete`, `incomplete`, or `suspended`. The dataframe snapshots, particularly the tail, shows that most of the incomplete matches are from the 2023/2024 season. This makes sense because, the 2023/2024 is ongoing and there are bound to be incomplete matches. For our analysis we will want to drop entries where the match status is suspended or incomplete as these columns are will not useful during model training.

In [80]:
# Check list of incomplete matches
match_data[match_data['status'] == 'incomplete' ].sample(5)

Unnamed: 0,id,homeID,awayID,season,status,roundID,game_week,revised_game_week,homeGoals,awayGoals,homeGoalCount,awayGoalCount,totalGoalCount,team_a_corners,team_b_corners,totalCornerCount,team_a_offsides,team_b_offsides,team_a_yellow_cards,team_b_yellow_cards,team_a_red_cards,team_b_red_cards,team_a_shotsOnTarget,team_b_shotsOnTarget,team_a_shotsOffTarget,team_b_shotsOffTarget,team_a_shots,team_b_shots,team_a_fouls,team_b_fouls,team_a_possession,team_b_possession,refereeID,coach_a_ID,coach_b_ID,stadium_name,stadium_location,team_a_cards_num,team_b_cards_num,odds_ft_1,odds_ft_x,odds_ft_2,odds_ft_over05,odds_ft_over15,odds_ft_over25,odds_ft_over35,odds_ft_over45,odds_ft_under05,odds_ft_under15,odds_ft_under25,odds_ft_under35,odds_ft_under45,odds_btts_yes,odds_btts_no,odds_team_a_cs_yes,odds_team_a_cs_no,odds_team_b_cs_yes,odds_team_b_cs_no,odds_doublechance_1x,odds_doublechance_12,odds_doublechance_x2,odds_1st_half_result_1,odds_1st_half_result_x,odds_1st_half_result_2,odds_2nd_half_result_1,odds_2nd_half_result_x,odds_2nd_half_result_2,odds_dnb_1,odds_dnb_2,odds_corners_over_75,odds_corners_over_85,odds_corners_over_95,odds_corners_over_105,odds_corners_over_115,odds_corners_under_75,odds_corners_under_85,odds_corners_under_95,odds_corners_under_105,odds_corners_under_115,odds_corners_1,odds_corners_x,odds_corners_2,odds_team_to_score_first_1,odds_team_to_score_first_x,odds_team_to_score_first_2,odds_win_to_nil_1,odds_win_to_nil_2,odds_1st_half_over05,odds_1st_half_over15,odds_1st_half_over25,odds_1st_half_over35,odds_1st_half_under05,odds_1st_half_under15,odds_1st_half_under25,odds_1st_half_under35,odds_2nd_half_over05,odds_2nd_half_over15,odds_2nd_half_over25,odds_2nd_half_over35,odds_2nd_half_under05,odds_2nd_half_under15,odds_2nd_half_under25,odds_2nd_half_under35,odds_btts_1st_half_yes,odds_btts_1st_half_no,odds_btts_2nd_half_yes,odds_btts_2nd_half_no,overallGoalCount,ht_goals_team_a,ht_goals_team_b,goals_2hg_team_a,goals_2hg_team_b,GoalCount_2hg,HTGoalCount,date_unix,winningTeam,no_home_away,btts_potential,btts_fhg_potential,btts_2hg_potential,goalTimingDisabled,attendance,corner_timings_recorded,card_timings_recorded,team_a_fh_corners,team_b_fh_corners,team_a_2h_corners,team_b_2h_corners,corner_fh_count,corner_2h_count,team_a_fh_cards,team_b_fh_cards,team_a_2h_cards,team_b_2h_cards,total_fh_cards,total_2h_cards,attacks_recorded,team_a_dangerous_attacks,team_b_dangerous_attacks,team_a_attacks,team_b_attacks,team_a_xg,team_b_xg,total_xg,team_a_penalties_won,team_b_penalties_won,team_a_penalty_goals,team_b_penalty_goals,team_a_penalty_missed,team_b_penalty_missed,pens_recorded,goal_timings_recorded,team_a_0_10_min_goals,team_b_0_10_min_goals,team_a_corners_0_10_min,team_b_corners_0_10_min,team_a_cards_0_10_min,team_b_cards_0_10_min,throwins_recorded,team_a_throwins,team_b_throwins,freekicks_recorded,team_a_freekicks,team_b_freekicks,goalkicks_recorded,team_a_goalkicks,team_b_goalkicks,o45_potential,o35_potential,o25_potential,o15_potential,o05_potential,o15HT_potential,o05HT_potential,o05_2H_potential,o15_2H_potential,corners_potential,offsides_potential,cards_potential,avg_potential,home_url,home_image,home_name,away_url,away_image,away_name,home_ppg,away_ppg,pre_match_home_ppg,pre_match_away_ppg,pre_match_teamA_overall_ppg,pre_match_teamB_overall_ppg,u45_potential,u35_potential,u25_potential,u15_potential,u05_potential,corners_o85_potential,corners_o95_potential,corners_o105_potential,team_a_xg_prematch,team_b_xg_prematch,total_xg_prematch,match_url,competition_id,matches_completed_minimum,over05,over15,over25,over35,over45,over55,btts,homeGoals_timings,awayGoals_timings,date,xG_home,xGA_home,npxG_home,npxGA_home,deep_home,deep_allowed_home,scored_home,missed_home,xpts_home,npxGD_home,ppda.att_home,ppda.def_home,ppda_allowed.att_home,ppda_allowed.def_home,xG_away,xGA_away,npxG_away,npxGA_away,deep_away,deep_allowed_away,scored_away,missed_away,xpts_away,npxGD_away,ppda.att_away,ppda.def_away,ppda_allowed.att_away,ppda_allowed.def_away
3717,6689250,211,143,2023/2024,incomplete,100543,30,-1,[],[],0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,,-1.0,-1.0,"The City Ground (Nottingham, Nottinghamshire)","Pavilion Road, Nottingham, Nottinghamshire",-1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,2024-03-30 15:00:00,-1,0,65,10,45,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,10,30,50,85,100,15,60,95,45,7.7,3.6,3.7,2.75,/clubs/nottingham-forest-fc-211,teams/england-nottingham-forest-fc.png,Nottingham Forest,/clubs/crystal-palace-fc-143,teams/england-crystal-palace-fc.png,Crystal Palace,1.2,1.2,1.2,1.2,1.0,1.05,90,70,50,15,0,70,55,35,1.27,1.19,2.46,/england/crystal-palace-fc-vs-nottingham-fores...,9660,20,False,False,False,False,False,False,False,[],[],2024-03-30,1.324044,1.678484,1.273299,1.576999,5.533333,9.533333,1.266667,1.933333,1.16372,-0.3037,297.533333,17.466667,211.333333,19.866667,1.279368,1.826222,1.127135,1.775478,5.733333,9.866667,1.066667,1.8,1.08794,-0.648343,319.466667,21.666667,209.466667,23.8
3692,6689225,158,92,2023/2024,incomplete,100543,28,-1,[],[],0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,,-1.0,-1.0,Villa Park (Birmingham),,-1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,2024-03-09 15:00:00,-1,0,70,25,60,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,30,60,70,85,100,45,75,90,70,9.7,2.8,6.3,3.75,/clubs/aston-villa-fc-158,teams/england-aston-villa-fc.png,Aston Villa,/clubs/tottenham-hotspur-fc-92,teams/england-tottenham-hotspur-fc.png,Tottenham Hotspur,2.8,1.8,2.8,1.8,2.1,1.95,70,40,30,15,0,65,60,45,1.63,1.47,3.1,/england/tottenham-hotspur-fc-vs-aston-villa-f...,9660,20,False,False,False,False,False,False,False,[],[],2024-03-09,1.882485,1.209184,1.780995,1.209184,10.8,5.733333,2.066667,1.133333,1.834047,0.571811,201.733333,19.066667,267.933333,26.466667,1.731368,1.863185,1.680624,1.710952,12.533333,7.133333,1.933333,1.6,1.37046,-0.030328,204.0,24.0,324.0,28.666667
3673,6689206,209,144,2023/2024,incomplete,100543,26,-1,[],[],0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,,-1.0,-1.0,The American Express Community Stadium (Falmer...,,-1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,2024-02-24 15:00:00,-1,0,70,30,25,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,15,50,65,90,100,50,85,95,50,12.0,3.6,4.6,3.25,/clubs/brighton-hove-albion-fc-209,teams/england-brighton-hove-albion-fc.png,Brighton & Hove Albion,/clubs/everton-fc-144,teams/england-everton-fc.png,Everton,1.9,1.6,1.9,1.6,1.55,1.3,85,50,35,10,0,70,60,40,2.0,1.32,3.32,/england/everton-fc-vs-brighton-hove-albion-fc...,9660,20,False,False,False,False,False,False,False,[],[],2024-02-24,1.581373,1.478245,1.429144,1.275267,10.666667,9.933333,1.333333,1.666667,1.42216,0.153877,213.4,21.866667,390.933333,21.8,1.616159,1.421092,1.616159,1.218119,6.333333,8.6,1.266667,1.2,1.50286,0.398039,344.333333,26.333333,161.666667,17.866667
3785,6689318,157,209,2023/2024,incomplete,100543,37,-1,[],[],0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,,-1.0,-1.0,St. James' Park (Newcastle upon Tyne),"St. James' Street, Newcastle upon Tyne",-1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,2024-05-11 14:00:00,-1,0,60,25,25,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,30,45,60,80,95,40,70,95,55,8.1,3.3,5.7,3.25,/clubs/newcastle-united-fc-157,teams/england-newcastle-united-fc.png,Newcastle United,/clubs/brighton-hove-albion-fc-209,teams/england-brighton-hove-albion-fc.png,Brighton & Hove Albion,2.4,1.2,2.4,1.2,1.45,1.55,70,55,40,20,5,55,50,40,1.86,1.48,3.34,/england/newcastle-united-fc-vs-brighton-hove-...,9660,20,False,False,False,False,False,False,False,[],[],2024-05-11,2.07619,2.024867,1.923956,1.864258,9.066667,7.733333,1.666667,1.666667,1.587027,0.059698,212.066667,22.0,273.666667,24.0,1.581373,1.478245,1.429144,1.275267,10.666667,9.933333,1.333333,1.666667,1.42216,0.153877,213.4,21.866667,390.933333,21.8
3670,6689203,148,93,2023/2024,incomplete,100543,26,-1,[],[],0,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,,-1.0,-1.0,"Vitality Stadium (Bournemouth, Dorset)",,-1,-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,2024-02-24 17:30:00,-1,0,52,16,26,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,0,0,0,0.0,0.0,0.0,0,0,0,0,0,0,1,1,0,0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,5,26,68,85,95,27,69,95,69,12.29,2.67,3.58,2.93,/clubs/afc-bournemouth-148,teams/england-afc-bournemouth.png,AFC Bournemouth,/clubs/manchester-city-fc-93,teams/england-manchester-city-fc.png,Manchester City,1.33,1.9,1.33,1.9,1.32,2.11,95,74,32,16,6,74,64,64,1.41,2.0,3.41,/england/manchester-city-fc-vs-afc-bournemouth...,9660,19,False,False,False,False,False,False,False,[],[],2024-02-24,1.640816,1.612377,1.590072,1.510889,8.466667,9.733333,1.6,2.066667,1.474413,0.079183,273.066667,26.133333,196.2,23.0,2.114357,1.069918,1.962123,0.968428,12.066667,4.733333,2.266667,1.333333,2.02972,0.993695,224.733333,19.0,329.2,17.2


In [81]:
# Check seasons from which matches are incomplete
match_data[match_data['status'] == 'incomplete' ]['season'].unique()

array(['2023/2024'], dtype=object)

The above inspection confirms that all matches with the `incomplete` or `suspended` status are the current season. 

#### game_week and revised_game_week columns
The `revised_game_week` column indicates indicates whether the game-week (the week of the season when the match was played) was revised or not. We observe that for inspected sections of the dataframe, no matches appear to have revised match weeks.

In [83]:
# check for unique values in the 'revised_game_week' column
match_data['revised_game_week'].unique()

array([-1], dtype=int64)

 -1 is the only unique value in the `revised_game_week` column. Thus this column does not provide any useful information and will be dropped.
 As mentioned before, the `game_week`, indicates the week of the season when the match was played. Typically, the EPL games run from from game-week 1 to game-week 38 (assuming a team plays a single EPL game every week - which is usually the case). 

In [85]:
# check for unique values in the 'game_week' column
match_data['game_week'].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38], dtype=int64)

The above output mostly confirms our that there are 38 game weeks in a season which matches with the actual number match weeks in the current premier league format. 

#### homeGoals and awayGoals columns
This columns consist of arrays that store goal timing data (when goals were scored during the match). Upon closer scrutiny of the dataset, it was observed that the columns `homeGoals` and `homeGoals_timings` contained identical information, as did `awayGoals` and `awayGoals_timings`. In the interest of eliminating redundancy and retaining more descriptive columns, it makes sense to remove the `homeGoals` and `awayGoals` columns.

In [86]:
match_data[['homeGoals', 'awayGoals', 'homeGoals_timings', 'awayGoals_timings' ]].head()

Unnamed: 0,homeGoals,awayGoals,homeGoals_timings,awayGoals_timings
0,"['45+1', '57']",['47'],"['45+1', '57']",['47']
1,[],['82'],[],['82']
2,[],['74'],[],['74']
3,['5'],['59'],['5'],['59']
4,['11'],['67'],['11'],['67']


#### odds columns
The dataframe snapshots (head, tail, sample) revealed that the values of most of the odds columns, with the exception of `odds_ft_1`, `odds_ft_x`, & `odds_ft_1`  columns, were zeros. Odds represent the probability of a given match outcome, therefore, are not expected to be zero.Given that this columns essentially are missing data, we will drop them from the dataset.

In [105]:
# get list of odds columns 
list_of_columns = list(match_data.columns)

# odds columns
odds_columns = [column for column in list_of_columns if column.startswith('odds')]


In [113]:
# compute the % zero values for each column and store the results in a dictionary
zero_counts = {}
for column in odds_columns:
    zero_counts[column] = (match_data[column] == 0).sum()/3800

In [114]:
zero_counts

{'odds_ft_1': 0.049736842105263156,
 'odds_ft_x': 0.049736842105263156,
 'odds_ft_2': 0.049736842105263156,
 'odds_ft_over05': 0.3497368421052632,
 'odds_ft_over15': 0.3497368421052632,
 'odds_ft_over25': 0.34789473684210526,
 'odds_ft_over35': 0.3497368421052632,
 'odds_ft_over45': 0.3497368421052632,
 'odds_ft_under05': 0.3497368421052632,
 'odds_ft_under15': 0.3497368421052632,
 'odds_ft_under25': 0.34842105263157896,
 'odds_ft_under35': 0.3494736842105263,
 'odds_ft_under45': 0.3497368421052632,
 'odds_btts_yes': 0.3497368421052632,
 'odds_btts_no': 0.3497368421052632,
 'odds_team_a_cs_yes': 0.35210526315789475,
 'odds_team_a_cs_no': 0.35210526315789475,
 'odds_team_b_cs_yes': 0.37763157894736843,
 'odds_team_b_cs_no': 0.3771052631578947,
 'odds_doublechance_1x': 0.3531578947368421,
 'odds_doublechance_12': 0.3531578947368421,
 'odds_doublechance_x2': 0.3531578947368421,
 'odds_1st_half_result_1': 0.35157894736842105,
 'odds_1st_half_result_x': 0.3518421052631579,
 'odds_1st_half_r

The output above reveals that 5% of entries in the `odds_ft_1`, `odds_ft_x`, & `odds_ft_1`  columns have invalid odd data (zeros) while the other odd columns consists of between 62% and 100%  invalid data.Therefore, to ensure model integrity all odds columns except `odds_ft_1`, `odds_ft_x`, & `odds_ft_1`  will be dropped from the dataset.Entries with invalid data in the `odds_ft_1`, `odds_ft_x`, & `odds_ft_1`  columns will be dropped. 

#### winning_team column
Inspection revealed that this column indicates winning team using the teamid and draws using -1. For EDA it might be useful to create column that indicates home win with `1`, draws with `x` and away wins with 2, which is the industry standard.

#### no_home_away column
We observed tha the `no_home_away` appears to consist of mostly 0 which makes sense as this would only be 1 if a match was played on a neutral ground. This rarely happens in the EPL.

In [116]:
match_data['no_home_away'].unique()

array([0], dtype=int64)

The above output confirms that no games were played on neutral ground and thus this column can be dropped.
#### Attendance column
In this column we noted that zero attendance or missing attendance was encoded with -1. 

In [118]:
match_data[match_data['attendance']==-1]['attendance'].count()

1046

In [119]:
match_data[match_data['attendance']==-1].groupby('season')['attendance'].count()

season
2014/2015     16
2015/2016     16
2016/2017     16
2020/2021     81
2021/2022    263
2022/2023    321
2023/2024    333
Name: attendance, dtype: int64

Further inspection reveals that for the relevant seasons, seasons to be considered to have sufficient data, very few matches between `2014` and `2019` had missing fan attendance information. While missing attendance data was expected for the `2020/2021` due covid restrictions, missing data in `2021/2022` and `2022/2023` is unusual. 
Since attendance data may be useful in predicting match outcome, missing data could be obtained from past attendance data or stadium capacity can be used with the assumption that the stadium was full.

#### goalTimingDisabled, corner_timings_recorded, & card_timings_recorded columns
This columns consisted mostly of 1's or 0's. API documentation revealed that this columns were used to indicate whether goal timing, corner timing or card timing data was captured. This information has little relevance and can be safely dropped dropped.

#### team_a_fh_corners	team_b_fh_corners	team_a_2h_corners	team_b_2h_corners	corner_fh_count	corner_2h_count	team_a_fh_cards	team_b_fh_cards	team_a_2h_cards	team_b_2h_cards	total_fh_cards	total_2h_cards 

- Review this columns (2016/2017) season showed some missing data for these this season

#### attacks_recorded column
The attacks_recorded column tracks whether or not attack data was captured encoded as -1 (attack data not recorded) and 1 (attack data was recorded). This is a metadata column and thus can be dropped as it does have any prective relevance. 

In [121]:
match_data['attacks_recorded'].unique()

array([-1,  1], dtype=int64)

In [122]:
match_data['attacks_recorded'].value_counts(normalize=True)

 1    0.742368
-1    0.257632
Name: attacks_recorded, dtype: float64

The output above that approximately 26% of matches are missing attack data. The affected columns include:
- `team_a_dangerous_attacks`	
- `team_b_dangerous_attacks`	
- `team_a_attacks`	
- `team_b_attacks`
Given that a large fraction of the attack data is missing, and because this data cannot be imputed, dropping the affected columns is a sensible approach. 

#### Potential data columns
Potential data columns were noted to have some missing entries (coded with zero). 

In [126]:
# potential columns
potential_columns = [column for column in list_of_columns if column.endswith('potential')]

In [129]:
# compute the % zero values for each column and store the results in a dictionary
potential_zero_counts = {}
for column in potential_columns:
    potential_zero_counts[column] = ((match_data[column] == 0) | (match_data[column] == -1) ).sum()/3800

In [130]:
potential_zero_counts

{'btts_potential': 0.07078947368421053,
 'btts_fhg_potential': 0.15,
 'btts_2hg_potential': 0.11105263157894738,
 'o45_potential': 0.20394736842105263,
 'o35_potential': 0.09578947368421052,
 'o25_potential': 0.06657894736842106,
 'o15_potential': 0.056842105263157895,
 'o05_potential': 0.0518421052631579,
 'o15HT_potential': 0.08605263157894737,
 'o05HT_potential': 0.057368421052631575,
 'o05_2H_potential': 0.053421052631578946,
 'o15_2H_potential': 0.07394736842105264,
 'corners_potential': 0.05131578947368421,
 'offsides_potential': 0.05473684210526316,
 'cards_potential': 0.052894736842105265,
 'avg_potential': 0.0518421052631579,
 'u45_potential': 0.0518421052631579,
 'u35_potential': 0.05921052631578947,
 'u25_potential': 0.07631578947368421,
 'u15_potential': 0.1281578947368421,
 'u05_potential': 0.38052631578947366,
 'corners_o85_potential': 0.05973684210526316,
 'corners_o95_potential': 0.065,
 'corners_o105_potential': 0.07447368421052632}

Given that most of the potential data are non-zeros, these columns can be retained and later the entries with zeros can be dropped.



## Data Inspection Notes Summary

### Data Types in the Dataset

#### Irrelevant columns

The dataset exhibits a diverse range of data types, encompassing `object`, `float64`, `int64`, and `bool`. Upon closer examination of columns with the `object` data type, several columns have been identified as having limited relevance to the project's objectives. Consequently, the following `object` type columns are slated for removal, as they contribute minimally to the project:

- Stadium_name
- Stadium_location
- home_url
- home_image
- home_name
- away_url
- away_image
- away_name
- match_url
- date_unix


### Uninformative Columns

Three `int64` type columns, namely `pens_recorded`, `no_home_away` and `revised_game_week`, were identified as having uniform values across all entries, rendering them of little relevance. Consequently, these columns will be dropped from the dataset.

### Redundant Columns

Upon closer scrutiny of the dataset, it was observed that the columns `homeGoals` and `homeGoals_timings` contained identical information, as did `awayGoals` and `awayGoals_timings`. In the interest of eliminating redundancy and retaining more descriptive columns, it has been decided to remove the `homeGoals` and `awayGoals` columns.


### Match Status Column

The match status column reveals three distinct values: `complete`, `incomplete`, and `suspended`. Notably, all instances of incomplete and suspended matches are specific to the current season (2023/2024). Consequently, these particular matches are considered irrelevant to the project as they have not been played and can be safely excluded from the dataset. Furthermore, the status column itself becomes obsolete once 'incomplete' and 'suspended' matches are removed and can be dropped from the dataset.

### Unique Identifier Columns (Non-feature/target columns)

Several columns function as unique identifiers and are unlikely to contribute to predictive power. While these identifiers may be valuable during Exploratory Data Analysis (EDA), they should be dropped during model training. The identified columns are:

- id
- homeID
- awayID
- Season
- competition_id
- roundID
- refereeID
- coach_a_ID
- coach_b_ID


### Nested Data

Upon inspection, it was observed that two columns, namely `homeGoals_timings` and `awayGoals_timings`, are structured as arrays. These columns provide information on goal timings for the home and away teams, respectively. The data within these arrays is likely to hold valuable predictive power regarding the number of goals scored in the first and second halves of a match. As part of the data cleaning process, it is imperative to flatten these arrays. Furthermore, since the data within the arrays is currently in string format, a conversion to numerical format is necessary to enhance its utility for analysis and modeling.

### More irrelevant data
Close inspection of the API documenation revealed that the following columns contained metadata (for example whether timing data was stored or not). This columns were found to offer no useful information and thus could be dropped without any implications to the success of the project:

 - 'attendance',
 - 'corner_timings_recorded',
 - 'card_timings_recorded',
 - 'pens_recorded',
 - 'goal_timings_recorded',
 - 'throwins_recorded',
 - 'freekicks_recorded',
 - 'goalkicks_recorded',
 - 'competition_id',
 - 'matches_completed_minimum
 


In [143]:
columns_to_drop = [
    'odds_ft_1', 'odds_ft_x', 'odds_ft_2', 'odds_ft_over05', 'odds_ft_over15', 'odds_ft_over25',
    'odds_ft_over35', 'odds_ft_over45', 'odds_ft_under05', 'odds_ft_under15', 'odds_ft_under25',
    'odds_ft_under35', 'odds_ft_under45', 'odds_btts_yes', 'odds_btts_no', 'odds_team_a_cs_yes',
    'odds_team_a_cs_no', 'odds_team_b_cs_yes', 'odds_team_b_cs_no', 'odds_doublechance_1x',
    'odds_doublechance_12', 'odds_doublechance_x2', 'odds_1st_half_result_1', 'odds_1st_half_result_x',
    'odds_1st_half_result_2', 'odds_2nd_half_result_1', 'odds_2nd_half_result_x', 'odds_2nd_half_result_2',
    'odds_dnb_1', 'odds_dnb_2', 'odds_corners_over_75', 'odds_corners_over_85', 'odds_corners_over_95',
    'odds_corners_over_105', 'odds_corners_over_115', 'odds_corners_under_75', 'odds_corners_under_85',
    'odds_corners_under_95', 'odds_corners_under_105', 'odds_corners_under_115', 'odds_corners_1',
    'odds_corners_x', 'odds_corners_2', 'odds_team_to_score_first_1', 'odds_team_to_score_first_x',
    'odds_team_to_score_first_2', 'odds_win_to_nil_1', 'odds_win_to_nil_2', 'odds_1st_half_over05',
    'odds_1st_half_over15', 'odds_1st_half_over25', 'odds_1st_half_over35', 'odds_1st_half_under05',
    'odds_1st_half_under15', 'odds_1st_half_under25', 'odds_1st_half_under35', 'odds_2nd_half_over05',
    'odds_2nd_half_over15', 'odds_2nd_half_over25', 'odds_2nd_half_over35', 'odds_2nd_half_under05',
    'odds_2nd_half_under15', 'odds_2nd_half_under25', 'odds_2nd_half_under35', 'odds_btts_1st_half_yes',
    'odds_btts_1st_half_no', 'odds_btts_2nd_half_yes', 'odds_btts_2nd_half_no',
    'attendance', 'corner_timings_recorded', 'card_timings_recorded', 'pens_recorded',
    'goal_timings_recorded', 'throwins_recorded', 'freekicks_recorded', 'goalkicks_recorded',
    'competition_id', 'matches_completed_minimum', 'roundID', 'refereeID', 'coach_a_ID', 'coach_b_ID',
    'stadium_name', 'stadium_location', 'home_url', 'home_image', 'away_url', 'away_image', 'match_url', 'team_a_fh_corners', 'team_b_fh_corners', 'team_a_2h_corners',
    'team_b_2h_corners', 'corner_fh_count', 'corner_2h_count', 'team_a_fh_cards', 'team_b_fh_cards',
    'team_a_2h_cards', 'team_b_2h_cards', 'total_fh_cards', 'total_2h_cards', 'winningTeam', 'no_home_away', 
    'revised_game_week', 'goalTimingDisabled', 
]


In [144]:
match_data.drop(columns = columns_to_drop, inplace=True)

KeyError: "['odds_ft_1' 'odds_ft_x' 'odds_ft_2' 'odds_ft_over05' 'odds_ft_over15'\n 'odds_ft_over25' 'odds_ft_over35' 'odds_ft_over45' 'odds_ft_under05'\n 'odds_ft_under15' 'odds_ft_under25' 'odds_ft_under35' 'odds_ft_under45'\n 'odds_btts_yes' 'odds_btts_no' 'odds_team_a_cs_yes' 'odds_team_a_cs_no'\n 'odds_team_b_cs_yes' 'odds_team_b_cs_no' 'odds_doublechance_1x'\n 'odds_doublechance_12' 'odds_doublechance_x2' 'odds_1st_half_result_1'\n 'odds_1st_half_result_x' 'odds_1st_half_result_2'\n 'odds_2nd_half_result_1' 'odds_2nd_half_result_x'\n 'odds_2nd_half_result_2' 'odds_dnb_1' 'odds_dnb_2' 'odds_corners_over_75'\n 'odds_corners_over_85' 'odds_corners_over_95' 'odds_corners_over_105'\n 'odds_corners_over_115' 'odds_corners_under_75' 'odds_corners_under_85'\n 'odds_corners_under_95' 'odds_corners_under_105' 'odds_corners_under_115'\n 'odds_corners_1' 'odds_corners_x' 'odds_corners_2'\n 'odds_team_to_score_first_1' 'odds_team_to_score_first_x'\n 'odds_team_to_score_first_2' 'odds_win_to_nil_1' 'odds_win_to_nil_2'\n 'odds_1st_half_over05' 'odds_1st_half_over15' 'odds_1st_half_over25'\n 'odds_1st_half_over35' 'odds_1st_half_under05' 'odds_1st_half_under15'\n 'odds_1st_half_under25' 'odds_1st_half_under35' 'odds_2nd_half_over05'\n 'odds_2nd_half_over15' 'odds_2nd_half_over25' 'odds_2nd_half_over35'\n 'odds_2nd_half_under05' 'odds_2nd_half_under15' 'odds_2nd_half_under25'\n 'odds_2nd_half_under35' 'odds_btts_1st_half_yes' 'odds_btts_1st_half_no'\n 'odds_btts_2nd_half_yes' 'odds_btts_2nd_half_no' 'attendance'\n 'corner_timings_recorded' 'card_timings_recorded' 'pens_recorded'\n 'goal_timings_recorded' 'throwins_recorded' 'freekicks_recorded'\n 'goalkicks_recorded' 'competition_id' 'matches_completed_minimum'\n 'roundID' 'refereeID' 'coach_a_ID' 'coach_b_ID' 'stadium_name'\n 'stadium_location' 'home_url' 'home_image' 'home_name' 'away_url'\n 'away_image' 'away_name' 'match_url' 'team_a_fh_corners'\n 'team_b_fh_corners' 'team_a_2h_corners' 'team_b_2h_corners'\n 'corner_fh_count' 'corner_2h_count' 'team_a_fh_cards' 'team_b_fh_cards'\n 'team_a_2h_cards' 'team_b_2h_cards' 'total_fh_cards' 'total_2h_cards'\n 'winningTeam' 'no_home_away'] not found in axis"

In [145]:
match_data.shape

(3800, 139)

In [138]:
# compute the % zero values for each column and store the results in a dictionary
data_zero_counts = {}
for column in match_data.columns:
    data_zero_counts[column] = ((match_data[column] == 0) | (match_data[column] == -1) ).sum()/3800

In [140]:

sorted_data_zero_counts = dict(sorted(data_zero_counts.items(), key=lambda item: item[1], reverse=True))

# Print the sorted dictionary
sorted_data_zero_counts


{'revised_game_week': 1.0,
 'goalTimingDisabled': 1.0,
 'team_b_penalty_missed': 0.9765789473684211,
 'team_a_penalty_missed': 0.9726315789473684,
 'team_a_cards_0_10_min': 0.9610526315789474,
 'team_b_cards_0_10_min': 0.9526315789473684,
 'team_a_red_cards': 0.9476315789473684,
 'over55': 0.9457894736842105,
 'team_b_red_cards': 0.935,
 'team_b_penalty_goals': 0.9207894736842105,
 'team_b_0_10_min_goals': 0.9092105263157895,
 'team_b_penalties_won': 0.8981578947368422,
 'team_a_penalty_goals': 0.8928947368421053,
 'team_a_0_10_min_goals': 0.8878947368421053,
 'team_a_penalties_won': 0.8689473684210526,
 'over45': 0.863421052631579,
 'team_b_corners_0_10_min': 0.7639473684210526,
 'team_a_corners_0_10_min': 0.7178947368421053,
 'over35': 0.7110526315789474,
 'team_b_freekicks': 0.6878947368421052,
 'team_a_freekicks': 0.6876315789473684,
 'team_a_goalkicks': 0.6526315789473685,
 'team_b_goalkicks': 0.6518421052631579,
 'team_b_throwins': 0.6157894736842106,
 'team_a_throwins': 0.613421

Since our models will rely on on prematch statistics, there are several columns where the data will be availble after the match has started, therefore, we can safely drop this columns from the dataset. 

In [None]:
match_data_columns_after_start = [
    'revised_game_week', 'goalTimingDisabled', 'team_b_penalty_missed', 'team_a_penalty_missed',
    'team_a_cards_0_10_min', 'team_b_cards_0_10_min', 'team_a_red_cards', 'over55', 'team_b_red_cards',
    'team_b_penalty_goals', 'team_b_0_10_min_goals', 'team_b_penalties_won', 'team_a_penalty_goals',
    'team_a_0_10_min_goals', 'team_a_penalties_won', 'over45', 'team_b_corners_0_10_min',
    'team_a_corners_0_10_min', 'over35', 'team_b_freekicks', 'team_a_freekicks', 'team_a_goalkicks',
    'team_b_goalkicks', 'team_b_throwins', 'team_a_throwins', 'ht_goals_team_b', 'goals_2hg_team_b',
    'ht_goals_team_a', 'btts', 'over25', 'goals_2hg_team_a', 'u05_potential', 'team_b_dangerous_attacks',
    'team_b_attacks', 'team_a_dangerous_attacks', 'team_a_attacks', 'awayGoalCount', 'HTGoalCount',
    'over15', 'homeGoalCount', 'attacks_recorded', 'team_a_xg', 'team_b_xg', 'total_xg',
    'team_a_xg_prematch', 'team_b_xg_prematch', 'total_xg_prematch', 'GoalCount_2hg',
    'team_b_offsides', 'team_a_yellow_cards', 'team_a_cards_num', 'team_a_offsides', 'o45_potential',
    'team_b_yellow_cards', 'team_b_cards_num', 'btts_fhg_potential', 'u15_potential', 'totalGoalCount',
    'overallGoalCount', 'over05', 'btts_2hg_potential', 'team_b_shotsOnTarget', 'o35_potential',
    'pre_match_away_ppg', 'o15HT_potential', 'pre_match_home_ppg', 'team_a_shotsOnTarget',
    'u25_potential', 'corners_o105_potential', 'team_b_corners', 'o15_2H_potential', 'btts_potential',
    'o25_potential', 'team_b_shotsOffTarget', 'corners_o95_potential', 'team_a_corners',
    'corners_o85_potential', 'u35_potential', 'team_a_shotsOffTarget', 'o05HT_potential',
    'o15_potential', 'offsides_potential', 'o05_2H_potential', 'cards_potential', 'o05_potential',
    'avg_potential', 'u45_potential', 'corners_potential', 'team_a_shots', 'team_b_shots',
    'team_b_fouls', 'team_a_fouls', 'team_a_possession', 'team_b_possession', 'totalCornerCount',
    'pre_match_teamA_overall_ppg', 'pre_match_teamB_overall_ppg', 'scored_away', 'missed_home',
    'scored_home', 'missed_away', 'deep_home', 'deep_allowed_away'
]

# Print the list of columns
print(match_data_columns_after_start)
