# Import Libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Features to be Used

Below are the preliminary features that will be used to build the model. More will be added on afterwards.

**Fatigue**
- Days since last match

**Home Team Form**
- Goals difference of home team in the last x matches    
- Goals difference of home team in the last x home matches    
- Average number of points gained by home team in the last x matches
- Number of home matches won by home team in its last x home matches
- Home Team Win streak  
- Home Team Newly Promoted Team?

**Away Team Form**
- Goals difference of away team in the last x matches  
- Goals difference of away team in the last x away matches
- Average number of points gained by away team in the last x matches
- Number of away matches won by away team in its last x away matches
- Away Team Win streak
- Away Team Newly Promoted Team?

**Home Team Performance Index**
- Home Defense Performance Index
- Home Midfield Performance Index
- Home Attack Performance Index

**Away Team Performance Index**
- Away Defense Performance Index
- Away Midfield Performance Index
- Away Attack Performance Index

**Betting Odds**
- B365H
- B365D
- B365A

# Data Preprocessing

There are two main datasets (Dataset 1 and Dataset 2) for each season that will be used to extract the features needed for the model. 

First we create an empty DataFrame. This DataFrame will eventually contain the data integrated from the two datasets.

In [3]:
df = pd.DataFrame()

## Dataset 1

Data Source: www.football-data-co.uk

For Dataset 1, we will do some data preprocessing steps on the data. Then we will concatenate every Dataset 1 for every season into a single dataframe. 

In [4]:
# standardize the teams names across all datasets

rename_teams = {'Arsenal': 'arsenal', 'Brighton': 'brighton', 'Chelsea': 'chelsea', 'Crystal Palace': 'palace', 'Everton': 'everton', 
                'Southampton': 'southampton', 'Watford': 'watford', 'West Brom': 'west-brom', 'Man United': 'united', 'Newcastle': 'newcastle',
                'Bournemouth': 'bournemouth', 'Burnley': 'burnley', 'Leicester': 'leicester', 'Liverpool': 'liverpool', 'Stoke': 'stoke',
                'Swansea': 'swansea', 'Huddersfield': 'huddersfield', 'Tottenham': 'tottenham', 'Man City': 'city', 'West Ham': 'west-ham',
                'Fulham': 'fulham', 'Wolves': 'wolves', 'Cardiff': 'cardiff', 'Aston Villa': 'aston-villa', 'Norwich': 'norwich',
                'Sheffield United': 'sheffield', 'Leeds': 'leeds', 'Brentford':'brentford'}

In [5]:
dataset1_df = pd.DataFrame()
seasons = ['2017-2018', '2018-2019', '2019-2020', '2020-2021', '2021-2022']

for season in seasons:
    # read csv file for match statistics
    temp_df = pd.read_csv(f'datasets/{season}/dataset1.csv')

    # convert 'Date' column to datetime object
    temp_df['Date'] =  pd.to_datetime(temp_df['Date'], format="%d/%m/%Y")
    
    # rename team names in the 'HomeTeam' and 'AwayTeam' columns for standardized team names
    temp_df['HomeTeam'] = temp_df['HomeTeam'].apply(lambda word : rename_teams[word])
    temp_df['AwayTeam'] = temp_df['AwayTeam'].apply(lambda word : rename_teams[word])
    
    # concatenate temp_df to dataset1_df
    if dataset1_df.empty:
        dataset1_df = temp_df
    else:
        dataset1_df = pd.concat([dataset1_df, temp_df]).reset_index(drop=True)

In [6]:
# Make sure we have 5 seasons x 380 matches = 1900 matches in the DataFrame
dataset1_df.shape

(1900, 21)

In [7]:
dataset1_df.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR,B365H,B365D,B365A
0,2017-08-11,arsenal,leicester,4,3,H,27,6,10,3,9,12,9,4,0,1,0,0,1.53,4.5,6.5
1,2017-08-12,brighton,city,0,2,A,6,14,2,4,6,9,3,10,0,2,0,0,11.0,5.5,1.33
2,2017-08-12,chelsea,burnley,2,3,A,19,10,6,5,16,11,8,5,3,3,2,0,1.25,6.5,15.0
3,2017-08-12,palace,huddersfield,0,3,A,14,8,4,6,7,19,12,9,1,3,0,0,1.83,3.6,5.0
4,2017-08-12,everton,stoke,1,0,H,9,9,4,1,13,10,6,7,1,1,0,0,1.7,3.8,5.75


## Dataset 2

Data Source: www.fbref.com/en

For Dataset 2, we will do some data preprocessing steps on the data. Then we will concatenate every Dataset 2 of each season into a single dataframe.

In [8]:
# standardize the teams names across all datasets

rename_teams = {'Leicester City':'leicester', 'Bournemouth':'bournemouth', 'West Brom':'west-brom', 'Brighton': 'brighton', 'Swansea City':'swansea', 
                'Tottenham':'tottenham', 'Huddersfield':'huddersfield', 'Manchester Utd': 'united', 'Newcastle Utd':'newcastle', 'Liverpool':'liverpool', 
                'Chelsea': 'chelsea', 'Crystal Palace': 'palace', 'Everton': 'everton', 'Manchester City':'city', 'Watford': 'watford', 
                'Stoke City':'stoke', 'Southampton': 'southampton', 'West Ham':'west-ham', 'Burnley':'burnley', 'Arsenal': 'arsenal', 
                'Wolves':'wolves', 'Fulham':'fulham', 'Cardiff City':'cardiff', 'Aston Villa':'aston-villa', 'Sheffield Utd':'sheffield',
                'Norwich City':'norwich', 'Leeds United':'leeds', 'Brentford':'brentford'}

While doing data preprocessing for Dataset 2, we will also create a new feature 'HDaysLastPlayed' and 'ADaysLastPlayed' along the way.

The 'HDaysLastPlayed' feature indicates the number of days since the home team's last match.

The 'ADaysLastPlayed' feature indicates the number of days since the away team's last match.

Supposedly, we should create new features in the Feature Engineering step later. However, these two features must be created using information obtained from the raw datasets. It is not possible to create these two features after we have concatenated every Dataset 2 of each season into a single dataframe. Therefore, we will create these two features before we concatenate every Dataset 2 of each season into a single dataframe.

In [9]:
# returns the number of days since home team's last match
def getHDaysLastPlayed(row):
    HDaysLastPlayed = str(row['DaysLastPlayed']).split()[0]
    return HDaysLastPlayed

In [10]:
# returns the number of days since away team's last match
def getADaysLastPlayed(row):
    
    date = row['Date']
    team = row['team']
    opponent = row['Opponent']
    
    filter_condition = (concatenated_df['Date'] == date) & (concatenated_df['team'] == opponent) & (concatenated_df['Opponent'] == team)
    ADaysLastPlayed = str(concatenated_df[filter_condition]['DaysLastPlayed']).split()[1]    
    return ADaysLastPlayed

In [11]:
dataset2_df = pd.DataFrame()
seasons = ['2017-2018', '2018-2019', '2019-2020', '2020-2021', '2021-2022']

for season in seasons:
    
    concatenated_df = pd.read_csv(f'datasets/{season}/dataset2.csv')
  
    # convert 'Date' column to datetime object
    concatenated_df['Date'] =  pd.to_datetime(concatenated_df['Date'], format="%Y/%m/%d")
    
    # get DaysLastPlayed for all matches
    concatenated_df['DaysLastPlayed'] = concatenated_df['Date'] - concatenated_df['Date'].shift(1)
    
    # filter by Premier League matches only
    concatenated_df = concatenated_df[concatenated_df['Comp'] == 'Premier League']
    
    # rename team names in the 'Opponent' column for standardized team names
    concatenated_df['Opponent'] = concatenated_df['Opponent'].apply(lambda word : rename_teams[word])
    
    # add a new feature: HDaysLastPlayed (number of days since home team's last match)
    concatenated_df['HDaysLastPlayed'] = concatenated_df.apply(lambda row: getHDaysLastPlayed(row), axis=1)
    
    # add a new feature: ADaysLastPlayed (number of days since away team's last match)
    concatenated_df['ADaysLastPlayed'] = concatenated_df.apply(lambda row: getADaysLastPlayed(row), axis=1)
    
    # filter by home matches only
    concatenated_df = concatenated_df[concatenated_df['Venue'] == 'Home'].reset_index(drop=True)
    
    # drop home team column
    concatenated_df.drop(['Venue'], axis=1, inplace = True)
        
    # rename features
    concatenated_df = concatenated_df.rename(columns={'xG': 'HxG', 'xGA': 'AxG', 'Poss': 'HPoss', 'Opponent': 'AwayTeam', 'team': 'HomeTeam'})
    
    if dataset2_df.empty:
        dataset2_df = concatenated_df
    else:
        dataset2_df = pd.concat([dataset2_df, concatenated_df]).reset_index(drop=True)

In [12]:
# Make sure we have 5 x 380 = 1900 matches in the DataFrame
dataset2_df.shape

(1900, 13)

In [13]:
dataset2_df.head()

Unnamed: 0,Date,Comp,Result,GF,GA,AwayTeam,HxG,AxG,HPoss,HomeTeam,DaysLastPlayed,HDaysLastPlayed,ADaysLastPlayed
0,2017-08-11,Premier League,W,4,3,leicester,2.5,1.5,68.0,arsenal,5 days,5,-275
1,2017-09-09,Premier League,W,3,0,bournemouth,2.2,0.6,58.0,arsenal,13 days,13,14
2,2017-09-25,Premier League,W,2,0,west-brom,2.2,0.9,69.0,arsenal,5 days,5,5
3,2017-10-01,Premier League,W,2,0,brighton,2.4,0.4,64.0,arsenal,3 days,3,7
4,2017-10-28,Premier League,W,2,1,swansea,2.0,0.9,72.0,arsenal,4 days,4,4


# Integration of Data Sources 

In [14]:
# df

In [15]:
# # merge two data sources into one DataFrame

# df = dataset1_df
# df = pd.merge(df, dataset2_df, on=['Date', 'HomeTeam', 'AwayTeam'])
# df.head()