# Predicting English Premier League Matches

By: Yi Xuan Sim

## Initial Data Exploration

Firstly, we will import the 2018/2019 season English Premier League data as a pandas dataframe to obtain a sense of the underlying data structure. 

Data Source: http://www.football-data.co.uk/englandm.php

In [1]:
# The code was removed by Watson Studio for sharing.

In [2]:
#Read the data and parse date in the correct format
dateCols = ['Date']
df_data_1 = pd.read_csv(body, parse_dates=dateCols)

We inspect the first few rows of data to see the underlying structure of the data. 

In [3]:
df_data_1.head()

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,BbAv<2.5,BbAH,BbAHh,BbMxAHH,BbAvAHH,BbMxAHA,BbAvAHA,PSCH,PSCD,PSCA
0,E0,2018-10-08,Man United,Leicester,2,1,H,1,0,H,...,1.79,17,-0.75,1.75,1.7,2.29,2.21,1.55,4.07,7.69
1,E0,2018-11-08,Bournemouth,Cardiff,2,0,H,1,0,H,...,1.83,20,-0.75,2.2,2.13,1.8,1.75,1.88,3.61,4.7
2,E0,2018-11-08,Fulham,Crystal Palace,0,2,A,0,1,A,...,1.87,22,-0.25,2.18,2.11,1.81,1.77,2.62,3.38,2.9
3,E0,2018-11-08,Huddersfield,Chelsea,0,3,A,0,2,A,...,1.84,23,1.0,1.84,1.8,2.13,2.06,7.24,3.95,1.58
4,E0,2018-11-08,Newcastle,Tottenham,1,2,A,1,2,A,...,1.81,20,0.25,2.2,2.12,1.8,1.76,4.74,3.53,1.89


We can see that there are 62 columns. The column we are trying to predict is 'FTR' (Full Time Result). The column has 3 labels, where: <br/>
'H' denotes a win by the home team <br/>
'A' denotes a win by the away team <br/>
'D' denotes a draw <br/>

This is a supervised machine learning problem. The rest of the columns are given as: <br/>
Div = League Division <br/>
Date = Match Date (dd/mm/yy) <br/>
HomeTeam = Home Team <br/>
AwayTeam = Away Team <br/>
FTHG and HG = Full Time Home Team Goals <br/>
FTAG and AG = Full Time Away Team Goals <br/>
HTHG = Half Time Home Team Goals <br/>
HTAG = Half Time Away Team Goals <br/>
HTR = Half Time Result (H=Home Win, D=Draw, A=Away Win) <br/>

Match Statistics (where available) <br/>
Attendance = Crowd Attendance <br/>
Referee = Match Referee <br/>
HS = Home Team Shots <br/>
AS = Away Team Shots <br/>
HST = Home Team Shots on Target <br/>
AST = Away Team Shots on Target <br/>
HC = Home Team Corners <br/>
AC = Away Team Corners <br/>
HF = Home Team Fouls Committed <br/>
AF = Away Team Fouls Committed <br/>
HY = Home Team Yellow Cards <br/>
AY = Away Team Yellow Cards <br/>
HR = Home Team Red Cards <br/>
AR = Away Team Red Cards <br/>

The rest of the columns are simply betting odds taken from different betting agencies, which will be ignored. 

## Identifying Relevant Features

The dataset contains numerous features, however, not all of them are relevant. To identify relevant features, we want to plot a correlation matrix to identify the columns with the highest correlation with FTR (Full Time Result). However, the Full Time Result has categorical labels, so we need to convert it to integer labels. We can do so using label encoding. 

Firstly, we create a new dataframe with columns containing the match statistics. Columns with betting odds are ignored.

In [4]:
df_data_2 = df_data_1[['Date','HomeTeam','AwayTeam','FTHG','FTAG','FTR','HTHG','HTAG','HTR','HS','AS','HST','AST','HC','AC','HF','AF','HY','AY','HR','AR']]


In [5]:
from sklearn.preprocessing import LabelEncoder
full_time = df_data_2['FTR']
half_time = df_data_2['HTR']

# integer encode
label_encoder = LabelEncoder()
FT_encoded = label_encoder.fit_transform(full_time)
HT_encoded = label_encoder.fit_transform(half_time)
df_data_2['FT_encoded'] = FT_encoded
df_data_2['HT_encoded'] = HT_encoded
df_data_2.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HS,...,HC,AC,HF,AF,HY,AY,HR,AR,FT_encoded,HT_encoded
0,2018-10-08,Man United,Leicester,2,1,H,1,0,H,8,...,2,5,11,8,2,1,0,0,2,2
1,2018-11-08,Bournemouth,Cardiff,2,0,H,1,0,H,12,...,7,4,11,9,1,1,0,0,2,2
2,2018-11-08,Fulham,Crystal Palace,0,2,A,0,1,A,15,...,5,5,9,11,1,2,0,0,0,0
3,2018-11-08,Huddersfield,Chelsea,0,3,A,0,2,A,6,...,2,5,9,8,2,1,0,0,0,0
4,2018-11-08,Newcastle,Tottenham,1,2,A,1,2,A,15,...,3,5,11,12,2,2,0,0,0,0
5,2018-11-08,Watford,Brighton,2,0,H,1,0,H,19,...,8,2,10,16,2,2,0,0,2,2
6,2018-11-08,Wolves,Everton,2,2,D,1,1,D,11,...,3,6,8,7,0,1,0,1,1,1
7,2018-12-08,Arsenal,Man City,0,2,A,0,1,A,9,...,2,9,11,14,2,2,0,0,0,0
8,2018-12-08,Liverpool,West Ham,4,0,H,2,0,H,18,...,5,4,14,9,1,2,0,0,2,2
9,2018-12-08,Southampton,Burnley,0,0,D,0,0,D,18,...,8,5,10,9,0,1,0,0,1,1


The encoding is as follows: <br/>
Win by away team = 0 <br/>
Draw = 1 <br/>
Win by home team = 2 <br/>

#### Plotting Correlation Matrix

In [6]:
df_data_2.corr()

Unnamed: 0,FTHG,FTAG,HTHG,HTAG,HS,AS,HST,AST,HC,AC,HF,AF,HY,AY,HR,AR,FT_encoded,HT_encoded
FTHG,1.0,-0.167997,0.739448,-0.056106,0.318988,-0.191386,0.629432,-0.205383,0.036251,-0.168767,-0.013768,-0.026539,-0.132962,0.020738,-0.060999,0.041527,0.703611,0.535038
FTAG,-0.167997,1.0,-0.06497,0.693411,-0.187031,0.281652,-0.128492,0.529567,-0.183598,0.03117,0.07318,-0.042929,0.098773,0.052354,0.121406,-0.008788,-0.629343,-0.492502
HTHG,0.739448,-0.06497,1.0,0.040572,0.170286,-0.040958,0.42107,-0.094557,-0.155622,-0.03151,-0.034619,0.049099,-0.063564,0.050446,0.062311,0.073454,0.474845,0.624791
HTAG,-0.056106,0.693411,0.040572,1.0,-0.059918,0.1476,-0.046595,0.340307,-0.098169,-0.014921,0.075451,-0.005169,0.074249,0.10057,0.098608,0.008354,-0.396468,-0.633326
HS,0.318988,-0.187031,0.170286,-0.059918,1.0,-0.458778,0.675895,-0.312503,0.559914,-0.429854,-0.141764,0.120665,-0.207727,0.121007,-0.201003,-0.00849,0.283275,0.164673
AS,-0.191386,0.281652,-0.040958,0.1476,-0.458778,1.0,-0.2485,0.625478,-0.407822,0.527095,0.166783,-0.120421,0.219603,-0.066762,0.112489,-0.023679,-0.248715,-0.156163
HST,0.629432,-0.128492,0.42107,-0.046595,0.675895,-0.2485,1.0,-0.175144,0.301347,-0.285751,-0.116521,-0.005255,-0.18782,0.008468,-0.173722,-0.031532,0.441531,0.345748
AST,-0.205383,0.529567,-0.094557,0.340307,-0.312503,0.625478,-0.175144,1.0,-0.284909,0.302527,0.110245,0.008228,0.136813,-0.036625,0.144365,0.121537,-0.431561,-0.318183
HC,0.036251,-0.183598,-0.155622,-0.098169,0.559914,-0.407822,0.301347,-0.284909,1.0,-0.339264,-0.051371,0.072326,-0.207308,0.024243,-0.06477,-0.027101,0.060967,-0.04089
AC,-0.168767,0.03117,-0.03151,-0.014921,-0.429854,0.527095,-0.285751,0.302527,-0.339264,1.0,0.007635,-0.139518,0.196349,-0.188978,0.128727,-0.041859,-0.109048,-0.045141


From the table above, we can see that the number of goals scored by the home teams and away teams at full time (FTHG and FTAG) are highly correlated with the full time results (FT_encoded). We can also see that the number of shots on target (HST and AST) are highly correlated with the full time results. This is expected, and we will later use these features to evaluate team performance.

Another interesting observation is that the number of goals scored by home teams and away teams at half time (HTHG and HTAG) are slightly correlated with the full time results (FT_encoded). This is also expected. If the goal difference at half time is very large, then the outcome of the game is mostly decided.

Other features such as the number of fouls, number of corners, number of yellow cards and number of red cards are very weakly correlated with the full time results, thus we can remove those features.

In [7]:
#Removing Irrelevant features
df_data_clean = df_data_2.drop(columns=['HS','AS','HF', 'AF','HC','AC','HY','AY','HR','AR'])

In [8]:
df_data_clean.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HST,AST,FT_encoded,HT_encoded
0,2018-10-08,Man United,Leicester,2,1,H,1,0,H,6,4,2,2
1,2018-11-08,Bournemouth,Cardiff,2,0,H,1,0,H,4,1,2,2
2,2018-11-08,Fulham,Crystal Palace,0,2,A,0,1,A,6,9,0,0
3,2018-11-08,Huddersfield,Chelsea,0,3,A,0,2,A,1,4,0,0
4,2018-11-08,Newcastle,Tottenham,1,2,A,1,2,A,2,5,0,0


We will remove the irrelevant features for the data from other seasons during the extract, transform, load (ETL) phase of data processing. 