# Project Overview

In this project, we'll predict match winners in the English Premier League (EPL) using machine learning.

As we mentioned, we'll be working with match data from the English Premier League. This data is from the 2020-2021 and 2021-2022 seasons. (The data was scraped partway through the 2021-2022 season, so you won't have the complete match history for the season.)



In [1]:
#imports 
import pandas as pd


In [2]:
matches=pd.read_csv('matches.csv',index_col=0)

matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,Match Report,,18.0,4.0,16.9,1.0,0.0,0.0,2022,Manchester City
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,Match Report,,16.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,Match Report,,25.0,10.0,14.3,0.0,0.0,0.0,2022,Manchester City
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,Match Report,,25.0,8.0,14.0,0.0,0.0,0.0,2022,Manchester City
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,Match Report,,16.0,1.0,15.7,1.0,0.0,0.0,2022,Manchester City


In [3]:
matches.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1389 entries, 1 to 42
Data columns (total 27 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          1389 non-null   object 
 1   time          1389 non-null   object 
 2   comp          1389 non-null   object 
 3   round         1389 non-null   object 
 4   day           1389 non-null   object 
 5   venue         1389 non-null   object 
 6   result        1389 non-null   object 
 7   gf            1389 non-null   float64
 8   ga            1389 non-null   float64
 9   opponent      1389 non-null   object 
 10  xg            1389 non-null   float64
 11  xga           1389 non-null   float64
 12  poss          1389 non-null   float64
 13  attendance    693 non-null    float64
 14  captain       1389 non-null   object 
 15  formation     1389 non-null   object 
 16  referee       1389 non-null   object 
 17  match report  1389 non-null   object 
 18  notes         0 non-null      

In [4]:
matches.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
gf,1389.0,1.335493,1.274235,0.0,0.0,1.0,2.0,9.0
ga,1389.0,1.38085,1.291049,0.0,0.0,1.0,2.0,9.0
xg,1389.0,1.304176,0.767268,0.0,0.7,1.2,1.8,4.6
xga,1389.0,1.338445,0.78936,0.0,0.7,1.2,1.8,5.0
poss,1389.0,49.702664,12.401897,18.0,40.0,50.0,59.0,82.0
attendance,693.0,36089.963925,17797.991778,2000.0,24351.0,32061.0,52214.0,73458.0
notes,0.0,,,,,,,
sh,1389.0,12.153348,5.268876,0.0,8.0,12.0,15.0,31.0
sot,1389.0,4.041037,2.403866,0.0,2.0,4.0,5.0,15.0
dist,1388.0,17.011527,2.988364,4.0,15.1,16.9,18.8,34.9


# Investigating Missing Data

As we mentioned earlier, some of the match data is missing. Let's determine exactly what's missing. In the English Premier League, there are 20 teams, and each team plays 38 matches. We have data for two seasons. So we should have 2 * 20 * 38 matches, or 1520.

Three teams are relegated each season to a lower league, and three are promoted. So given the relegations/promotions that happened at the end of the 2020-2021 season, we should have 6 teams with 38 matches and 17 teams with 76 matches. Of course, since the data was scraped partway through the season, this may not be true.

In [5]:
matches['team'].value_counts()

Southampton                 72
Brighton and Hove Albion    72
Manchester United           72
West Ham United             72
Newcastle United            72
Burnley                     71
Leeds United                71
Crystal Palace              71
Manchester City             71
Wolverhampton Wanderers     71
Tottenham Hotspur           71
Arsenal                     71
Leicester City              70
Chelsea                     70
Aston Villa                 70
Everton                     70
Liverpool                   38
Fulham                      38
West Bromwich Albion        38
Sheffield United            38
Brentford                   34
Watford                     33
Norwich City                33
Name: team, dtype: int64

Based on what we see above we can see that there are 7 teams that have less matches than the others. Using some domain knowledge we know that the last 3 bottem teams will be relegected to the lower teams and the top 3 teams from the lower leauge move up to the ETL (English Premier League). However last season Liverpool was not relegated to the lower leauge so we need to find out why Liverpool has such low numbers.

In [6]:
 matches[matches["team"] == "Liverpool"]

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2020-09-12,17:30,Premier League,Matchweek 1,Sat,Home,W,4.0,3.0,Leeds United,...,Match Report,,20.0,4.0,17.0,0.0,2.0,2.0,2021,Liverpool
2,2020-09-20,16:30,Premier League,Matchweek 2,Sun,Away,W,2.0,0.0,Chelsea,...,Match Report,,17.0,5.0,17.7,1.0,0.0,0.0,2021,Liverpool
4,2020-09-28,20:00,Premier League,Matchweek 3,Mon,Home,W,3.0,1.0,Arsenal,...,Match Report,,21.0,9.0,16.8,0.0,0.0,0.0,2021,Liverpool
6,2020-10-04,19:15,Premier League,Matchweek 4,Sun,Away,L,2.0,7.0,Aston Villa,...,Match Report,,14.0,8.0,15.8,1.0,0.0,0.0,2021,Liverpool
7,2020-10-17,12:30,Premier League,Matchweek 5,Sat,Away,D,2.0,2.0,Everton,...,Match Report,,22.0,8.0,15.0,1.0,0.0,0.0,2021,Liverpool
9,2020-10-24,20:00,Premier League,Matchweek 6,Sat,Home,W,2.0,1.0,Sheffield Utd,...,Match Report,,17.0,5.0,18.2,1.0,0.0,0.0,2021,Liverpool
11,2020-10-31,17:30,Premier League,Matchweek 7,Sat,Home,W,2.0,1.0,West Ham,...,Match Report,,8.0,2.0,18.6,1.0,1.0,1.0,2021,Liverpool
13,2020-11-08,16:30,Premier League,Matchweek 8,Sun,Away,D,1.0,1.0,Manchester City,...,Match Report,,9.0,2.0,21.5,0.0,1.0,1.0,2021,Liverpool
14,2020-11-22,19:15,Premier League,Matchweek 9,Sun,Home,W,3.0,0.0,Leicester City,...,Match Report,,24.0,12.0,11.9,0.0,0.0,0.0,2021,Liverpool
16,2020-11-28,12:30,Premier League,Matchweek 10,Sat,Away,D,1.0,1.0,Brighton,...,Match Report,,6.0,2.0,20.9,0.0,0.0,0.0,2021,Liverpool


Based on the above code we can see that we have only data from the 2020/2021 season and not data from 2021/2022. So now we know that the missing date is from Liverpool. 

Next we are going to look at the round column. The round column tells us what Match week we are in and based on domain knowledge they should have 39.

In [7]:
 matches["round"].value_counts()

Matchweek 1     39
Matchweek 16    39
Matchweek 34    39
Matchweek 32    39
Matchweek 31    39
Matchweek 29    39
Matchweek 28    39
Matchweek 26    39
Matchweek 25    39
Matchweek 24    39
Matchweek 23    39
Matchweek 2     39
Matchweek 19    39
Matchweek 17    39
Matchweek 20    39
Matchweek 15    39
Matchweek 5     39
Matchweek 3     39
Matchweek 13    39
Matchweek 12    39
Matchweek 4     39
Matchweek 11    39
Matchweek 10    39
Matchweek 9     39
Matchweek 8     39
Matchweek 14    39
Matchweek 7     39
Matchweek 6     39
Matchweek 30    37
Matchweek 27    37
Matchweek 22    37
Matchweek 21    37
Matchweek 18    37
Matchweek 33    32
Matchweek 35    20
Matchweek 36    20
Matchweek 37    20
Matchweek 38    20
Name: round, dtype: int64

As we see above a vast majoirty do have 39 matches for that week however they are a few weeks where we don't have 39 mactes. The reason for this is because the data is scapped in the middle of the 2021-2022 season so we do not have the complete information. 

# Cleaning Data for Machine Learning

You may have noticed that Liverpool was missing a season of data. This is due to an issue with the scraping process. All of the other teams have the rows we'd expect, given that the data was scraped partway through the season.

It's fine to proceed, even with the missing data. Doing this type of investigation to verify the data is very useful before you start on a machine learning project. It ensures that the data is consistent.

Next, we'll clean the data and prepare it for machine learning.

In [8]:
matches.dtypes

date             object
time             object
comp             object
round            object
day              object
venue            object
result           object
gf              float64
ga              float64
opponent         object
xg              float64
xga             float64
poss            float64
attendance      float64
captain          object
formation        object
referee          object
match report     object
notes           float64
sh              float64
sot             float64
dist            float64
fk              float64
pk              float64
pkatt           float64
season            int64
team             object
dtype: object

From the code above we can see that there are some things that needs to be changed so that our model can work.

We need to change the date to a datetime format


In [9]:
matches['date'] = pd.to_datetime(matches.date)


In [10]:
matches.dtypes

date            datetime64[ns]
time                    object
comp                    object
round                   object
day                     object
venue                   object
result                  object
gf                     float64
ga                     float64
opponent                object
xg                     float64
xga                    float64
poss                   float64
attendance             float64
captain                 object
formation               object
referee                 object
match report            object
notes                  float64
sh                     float64
sot                    float64
dist                   float64
fk                     float64
pk                     float64
pkatt                  float64
season                   int64
team                    object
dtype: object

# Creating Predictors for Machine Learning

Now that we've cleaned our data, we'll set up predictors (the columns we'll use to make our prediction), and our target (what we're going to predict). In this case, we want to predict if a team will win, so we'll code a win as a 1 and a loss or a draw as a 0. We'll then predict this target using the predictor columns.

Our initial predictors will be a set of codes that correspond to the venue, the opponent, the hour, and the day.

In [11]:
#converting the wins and loses/draws to 1 and 0 
matches['target']=matches['result'].replace(['W', 'L','D'], [1,0,0])

matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,notes,sh,sot,dist,fk,pk,pkatt,season,team,target
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,,18.0,4.0,16.9,1.0,0.0,0.0,2022,Manchester City,0
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,,16.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City,1
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,,25.0,10.0,14.3,0.0,0.0,0.0,2022,Manchester City,1
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,,25.0,8.0,14.0,0.0,0.0,0.0,2022,Manchester City,1
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,,16.0,1.0,15.7,1.0,0.0,0.0,2022,Manchester City,0


Next we are going to convert the venue column also into 1 and 0 as depending on where the match is taking place (ie either a home or away game) it can gretaly influence the overall outcome of the match.

We are going to make home as 1 and away as 0

In [12]:
matches['venue_code']=matches['venue'].replace(['Home', 'Away'], [1,0])
matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,sh,sot,dist,fk,pk,pkatt,season,team,target,venue_code
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,18.0,4.0,16.9,1.0,0.0,0.0,2022,Manchester City,0,0
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,16.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City,1,1
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,25.0,10.0,14.3,0.0,0.0,0.0,2022,Manchester City,1,1
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,25.0,8.0,14.0,0.0,0.0,0.0,2022,Manchester City,1,0
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,16.0,1.0,15.7,1.0,0.0,0.0,2022,Manchester City,0,1


Next we are going to also do a similar thing as above but in this case we are going to do it for the opponent column. The reason why we are doing this for this column is that teams can play better or worse depending on who they are facing.

As a quick side note we can not do the replace here as we did above as there are way too many opponents to take in consideration. To fix this we are going to have to do the following:

-  We will be using .cat.codes to convert all the values into numerical values along side the astype

In [14]:
# we are cnverting the opponent column to a category type and then usig cat codes it is going to return the values as series of values for that specfic opponent
matches["opp_code"] = matches["opponent"].astype("category").cat.codes

In [16]:
matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,sot,dist,fk,pk,pkatt,season,team,target,venue_code,opp_code
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,4.0,16.9,1.0,0.0,0.0,2022,Manchester City,0,0,18
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,4.0,17.3,1.0,0.0,0.0,2022,Manchester City,1,1,15
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,10.0,14.3,0.0,0.0,0.0,2022,Manchester City,1,1,0
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,8.0,14.0,0.0,0.0,0.0,2022,Manchester City,1,0,10
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,1.0,15.7,1.0,0.0,0.0,2022,Manchester City,0,1,17


In [17]:
#confirming
matches[matches['opponent']=='Arsenal']


Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,sot,dist,fk,pk,pkatt,season,team,target,venue_code,opp_code
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,10.0,14.3,0.0,0.0,0.0,2022,Manchester City,1,1,0
29,2022-01-01,12:30,Premier League,Matchweek 21,Sat,Away,W,2.0,1.0,Arsenal,...,1.0,18.4,0.0,1.0,1.0,2022,Manchester City,1,0,0
2,2021-08-22,16:30,Premier League,Matchweek 2,Sun,Away,W,2.0,0.0,Arsenal,...,5.0,14.6,0.0,0.0,0.0,2022,Chelsea,1,0,0
52,2022-04-20,19:45,Premier League,Matchweek 25,Wed,Home,L,2.0,4.0,Arsenal,...,2.0,16.5,0.0,0.0,0.0,2022,Chelsea,0,1,0
9,2021-09-26,16:30,Premier League,Matchweek 6,Sun,Away,L,1.0,3.0,Arsenal,...,4.0,16.1,0.0,0.0,0.0,2022,Tottenham Hotspur,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37,2021-04-18,13:30,Premier League,Matchweek 32,Sun,Away,D,1.0,1.0,Arsenal,...,0.0,26.5,0.0,1.0,1.0,2021,Fulham,0,0,0
18,2021-01-02,20:00,Premier League,Matchweek 17,Sat,Home,L,0.0,4.0,Arsenal,...,3.0,18.3,0.0,0.0,0.0,2021,West Bromwich Albion,0,1,0
37,2021-05-09,19:00,Premier League,Matchweek 35,Sun,Away,L,1.0,3.0,Arsenal,...,1.0,17.7,0.0,0.0,0.0,2021,West Bromwich Albion,0,0,0
4,2020-10-04,14:00,Premier League,Matchweek 4,Sun,Away,L,1.0,2.0,Arsenal,...,2.0,24.1,0.0,0.0,0.0,2021,Sheffield United,0,0,0


Finally as above we need to do the same as above for time column as well as the day column. Using the same reasoning above teams performance can alter based on the time that they play and the day that they played as well.

In [18]:
matches['hour'] = pd.to_datetime(matches['time'], format='%H:%M').dt.hour

In [19]:
matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,dist,fk,pk,pkatt,season,team,target,venue_code,opp_code,hour
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,16.9,1.0,0.0,0.0,2022,Manchester City,0,0,18,16
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,17.3,1.0,0.0,0.0,2022,Manchester City,1,1,15,15
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,14.3,0.0,0.0,0.0,2022,Manchester City,1,1,0,12
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,14.0,0.0,0.0,0.0,2022,Manchester City,1,0,10,15
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,15.7,1.0,0.0,0.0,2022,Manchester City,0,1,17,15


In [21]:
matches['day_code']=matches['date'].dt.dayofweek

matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,target,venue_code,opp_code,hour,day_code
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,1.0,0.0,0.0,2022,Manchester City,0,0,18,16,6
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,1.0,0.0,0.0,2022,Manchester City,1,1,15,15,5
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,0.0,0.0,0.0,2022,Manchester City,1,1,0,12,5
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,0.0,0.0,0.0,2022,Manchester City,1,0,10,15,5
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,1.0,0.0,0.0,2022,Manchester City,0,1,17,15,5


# Training an Initial ML Model


Now that we have a target and predictors, we can train our initial model! We'll use a random forest classifier to make our initial predictions and then measure the accuracy of our predictions.

We'll first have to split our data into training and test sets. The training set is what we'll train our model with, and we'll use the test set to evaluate our accuracy. We have to be careful to ensure that the test data all falls after the training data. This is because we have time series data — some matches happened before other matches. We don't want to use future data to predict the past.

We'll use the precision score to measure the effectiveness of our model. A precision score is the number of times when the model said the team would win and the team actually won divided by the total number of times the model said the team would win. You can interpret it as "When the model says a team will win, what % of the time is it correct?"

In [22]:
from sklearn.ensemble import RandomForestClassifier

In [23]:
rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)

In [26]:
# in this case we are going to use 2022 as the test data and everything before as training data. The reason why for this is that we don't want to use future data to as a predictor so that is why using train_test_split will make it much harder to use as we are using time series data

train=matches[matches['date']<'2022-01-01']

test=matches[matches['date']>='2022-01-01']

In [42]:
train=train[['venue_code','opp_code','hour','day_code','target']]
test=test[['venue_code','opp_code','hour','day_code','target']]

In [45]:
#taking the predictors

predictors=['venue_code','opp_code','hour','day_code']
rf.fit(train[predictors],train['target'])

RandomForestClassifier(min_samples_split=10, n_estimators=50, random_state=1)

In [46]:
pred = rf.predict(test[predictors])

In [47]:
from sklearn.metrics import precision_score

precision_score(test['target'], pred)

0.48333333333333334

In [48]:
from sklearn.metrics import accuracy_score
accuracy_score(test['target'], pred)

0.6134751773049646

# Improving the Model with Rolling Averages

The next thing we can do is improve the accuracy of the model with rolling averages. Rolling averages will compute the average team stats in the last N matches. These rolling averages will give the model information about what happened in the matches prior to the current one. For example, if the team scored fewer goals than opponents in the previous matches, this information can help the model make a better judgement about whether or not the team will win the next match.

To compute these rolling averages, we need to group the data by team. Grouping by team will ensure that we get rolling averages for matches by that team only. We also need to sort by date so that we get the rolling averages in the right order.

We need to be careful not to include the current row in the rolling average. The current row contains stats for the match we're predicting. In the real world, if we're predicting the outcome of a future match, we won't know how many goals the team scored in that match (since it hasn't been played yet!).

In [51]:
#grouping by teams

grouped_matches=matches.groupby('team')
grouped_matches.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,target,venue_code,opp_code,hour,day_code
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,1.0,0.0,0.0,2022,Manchester City,0,0,18,16,6
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,1.0,0.0,0.0,2022,Manchester City,1,1,15,15,5
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,0.0,0.0,0.0,2022,Manchester City,1,1,0,12,5
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,0.0,0.0,0.0,2022,Manchester City,1,0,10,15,5
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,1.0,0.0,0.0,2022,Manchester City,0,1,17,15,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,2020-09-14,18:00,Premier League,Matchweek 1,Mon,Home,L,0.0,2.0,Wolves,...,0.0,0.0,0.0,2021,Sheffield United,0,1,22,18,0
2,2020-09-21,18:00,Premier League,Matchweek 2,Mon,Away,L,0.0,1.0,Aston Villa,...,0.0,0.0,1.0,2021,Sheffield United,0,0,1,18,0
3,2020-09-27,12:00,Premier League,Matchweek 3,Sun,Home,L,0.0,1.0,Leeds United,...,0.0,0.0,0.0,2021,Sheffield United,0,1,9,12,6
4,2020-10-04,14:00,Premier League,Matchweek 4,Sun,Away,L,1.0,2.0,Arsenal,...,0.0,0.0,0.0,2021,Sheffield United,0,0,0,14,6


In [56]:
group = grouped_matches.get_group("Manchester City")

In [57]:
group

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,fk,pk,pkatt,season,team,target,venue_code,opp_code,hour,day_code
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0.0,1.0,Tottenham,...,1.0,0.0,0.0,2022,Manchester City,0,0,18,16,6
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5.0,0.0,Norwich City,...,1.0,0.0,0.0,2022,Manchester City,1,1,15,15,5
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5.0,0.0,Arsenal,...,0.0,0.0,0.0,2022,Manchester City,1,1,0,12,5
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1.0,0.0,Leicester City,...,0.0,0.0,0.0,2022,Manchester City,1,0,10,15,5
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0.0,0.0,Southampton,...,1.0,0.0,0.0,2022,Manchester City,0,1,17,15,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54,2021-05-01,12:30,Premier League,Matchweek 34,Sat,Away,W,2.0,0.0,Crystal Palace,...,1.0,0.0,0.0,2021,Manchester City,1,0,6,12,5
56,2021-05-08,17:30,Premier League,Matchweek 35,Sat,Home,L,1.0,2.0,Chelsea,...,0.0,0.0,1.0,2021,Manchester City,0,1,5,17,5
57,2021-05-14,20:00,Premier League,Matchweek 36,Fri,Away,W,4.0,3.0,Newcastle Utd,...,1.0,0.0,0.0,2021,Manchester City,1,0,14,20,4
58,2021-05-18,19:00,Premier League,Matchweek 37,Tue,Away,L,2.0,3.0,Brighton,...,1.0,0.0,0.0,2021,Manchester City,0,0,3,19,1


We are going to be using the previous games to predict the future

In [52]:
def rolling_avg(group,cols,new_cols):
  # what this function does is it takes a group in,
  #  takes  a set of columns tha we want to compute rolling average for
  #take in a set of new columns that we want to assign the rolling averages to

  #we sort our group by dates ince we want it to be sorted in ascending order becayuse we want to look at the last 3 matches and there performances
  group=group.sort_values('date')
  #takes the cols we are going to input and takes it rolling average
  # we need to closed =  left as , the last point in the window is excluded from calculations ir it will take only values before the speficied values and not inlcuding it
  rolling_stats=group[cols].rolling(3,closed='left').mean()
  #assign the rollinsg stats back to orginal dataframe
  group[new_cols]=rolling_stats
  #dropping the missing values in the new columns
  group=group.dropna(subset=new_cols)
  return group
  

In [54]:
cols = ["gf", "ga", "sh", "sot", "dist", "fk", "pk", "pkatt"]
new_cols = [f"{c}_rolling" for c in cols]


In [55]:
new_cols

['gf_rolling',
 'ga_rolling',
 'sh_rolling',
 'sot_rolling',
 'dist_rolling',
 'fk_rolling',
 'pk_rolling',
 'pkatt_rolling']

In [59]:
#running our function
rolling_avg(group, cols, new_cols)

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,hour,day_code,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
5,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Home,W,1.0,0.0,Arsenal,...,17,5,2.000000,2.333333,17.333333,4.666667,18.900000,1.333333,0.333333,0.333333
7,2020-10-24,12:30,Premier League,Matchweek 6,Sat,Away,D,1.0,1.0,West Ham,...,12,5,1.333333,2.000000,17.333333,3.666667,17.733333,0.666667,0.000000,0.000000
9,2020-10-31,12:30,Premier League,Matchweek 7,Sat,Away,W,1.0,0.0,Sheffield Utd,...,12,5,1.000000,0.666667,16.666667,4.333333,18.233333,0.666667,0.000000,0.000000
11,2020-11-08,16:30,Premier League,Matchweek 8,Sun,Home,D,1.0,1.0,Liverpool,...,16,6,1.000000,0.333333,14.333333,6.666667,18.466667,1.000000,0.000000,0.000000
12,2020-11-21,17:30,Premier League,Matchweek 9,Sat,Away,L,0.0,2.0,Tottenham,...,17,5,1.000000,0.666667,12.000000,5.666667,19.366667,1.000000,0.000000,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42,2022-03-14,20:00,Premier League,Matchweek 29,Mon,Away,D,0.0,0.0,Crystal Palace,...,20,0,2.333333,1.333333,19.000000,7.000000,15.366667,0.333333,0.333333,0.333333
44,2022-04-02,15:00,Premier League,Matchweek 31,Sat,Away,W,2.0,0.0,Burnley,...,15,5,1.666667,0.333333,18.333333,7.333333,16.000000,0.333333,0.000000,0.000000
46,2022-04-10,16:30,Premier League,Matchweek 32,Sun,Home,D,2.0,2.0,Liverpool,...,16,6,2.000000,0.333333,20.000000,6.666667,16.133333,0.333333,0.000000,0.000000
49,2022-04-20,20:00,Premier League,Matchweek 30,Wed,Home,W,3.0,0.0,Brighton,...,20,2,1.333333,0.666667,15.666667,4.666667,16.700000,0.333333,0.000000,0.000000


Since we got the values for Manchester City we now apply this to all the values in our dataset

In [61]:
#using the lambda function its goigm to apply one function to each team
matches_rolling=matches.groupby('team').apply(lambda x: rolling_avg(x, cols, new_cols))

In [62]:
matches_rolling

Unnamed: 0_level_0,Unnamed: 1_level_0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,hour,day_code,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Arsenal,6,2020-10-04,14:00,Premier League,Matchweek 4,Sun,Home,W,2.0,1.0,Sheffield Utd,...,14,6,2.000000,1.333333,7.666667,3.666667,14.733333,0.666667,0.000000,0.000000
Arsenal,7,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0.0,1.0,Manchester City,...,17,5,1.666667,1.666667,5.333333,3.666667,15.766667,0.000000,0.000000,0.000000
Arsenal,9,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0.0,1.0,Leicester City,...,19,6,1.000000,1.666667,7.000000,3.666667,16.733333,0.666667,0.000000,0.000000
Arsenal,11,2020-11-01,16:30,Premier League,Matchweek 7,Sun,Away,W,1.0,0.0,Manchester Utd,...,16,6,0.666667,1.000000,9.666667,4.000000,16.033333,1.000000,0.000000,0.000000
Arsenal,13,2020-11-08,19:15,Premier League,Matchweek 8,Sun,Home,L,0.0,3.0,Aston Villa,...,19,6,0.333333,0.666667,9.666667,2.666667,18.033333,1.000000,0.333333,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wolverhampton Wanderers,32,2022-03-13,14:00,Premier League,Matchweek 29,Sun,Away,W,1.0,0.0,Everton,...,14,6,1.333333,1.000000,12.333333,3.666667,19.300000,0.000000,0.000000,0.000000
Wolverhampton Wanderers,33,2022-03-18,20:00,Premier League,Matchweek 30,Fri,Home,L,2.0,3.0,Leeds United,...,20,4,1.666667,0.666667,12.333333,4.333333,19.600000,0.000000,0.000000,0.000000
Wolverhampton Wanderers,34,2022-04-02,15:00,Premier League,Matchweek 31,Sat,Home,W,2.0,1.0,Aston Villa,...,15,5,2.333333,1.000000,13.000000,5.333333,19.833333,0.000000,0.000000,0.000000
Wolverhampton Wanderers,35,2022-04-08,20:00,Premier League,Matchweek 32,Fri,Away,L,0.0,1.0,Newcastle Utd,...,20,4,1.666667,1.333333,13.000000,5.000000,18.533333,0.000000,0.000000,0.000000


In [63]:
#going to drop the index (ie the teams names) since it will make the data a bit harder to work with

matches_rolling = matches_rolling.droplevel('team')

In [64]:
matches_rolling

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,hour,day_code,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
6,2020-10-04,14:00,Premier League,Matchweek 4,Sun,Home,W,2.0,1.0,Sheffield Utd,...,14,6,2.000000,1.333333,7.666667,3.666667,14.733333,0.666667,0.000000,0.000000
7,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0.0,1.0,Manchester City,...,17,5,1.666667,1.666667,5.333333,3.666667,15.766667,0.000000,0.000000,0.000000
9,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0.0,1.0,Leicester City,...,19,6,1.000000,1.666667,7.000000,3.666667,16.733333,0.666667,0.000000,0.000000
11,2020-11-01,16:30,Premier League,Matchweek 7,Sun,Away,W,1.0,0.0,Manchester Utd,...,16,6,0.666667,1.000000,9.666667,4.000000,16.033333,1.000000,0.000000,0.000000
13,2020-11-08,19:15,Premier League,Matchweek 8,Sun,Home,L,0.0,3.0,Aston Villa,...,19,6,0.333333,0.666667,9.666667,2.666667,18.033333,1.000000,0.333333,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32,2022-03-13,14:00,Premier League,Matchweek 29,Sun,Away,W,1.0,0.0,Everton,...,14,6,1.333333,1.000000,12.333333,3.666667,19.300000,0.000000,0.000000,0.000000
33,2022-03-18,20:00,Premier League,Matchweek 30,Fri,Home,L,2.0,3.0,Leeds United,...,20,4,1.666667,0.666667,12.333333,4.333333,19.600000,0.000000,0.000000,0.000000
34,2022-04-02,15:00,Premier League,Matchweek 31,Sat,Home,W,2.0,1.0,Aston Villa,...,15,5,2.333333,1.000000,13.000000,5.333333,19.833333,0.000000,0.000000,0.000000
35,2022-04-08,20:00,Premier League,Matchweek 32,Fri,Away,L,0.0,1.0,Newcastle Utd,...,20,4,1.666667,1.333333,13.000000,5.000000,18.533333,0.000000,0.000000,0.000000


Noticing that our index did not start from zero indexing shows that the data are repeatative

In [65]:
#basically assign values from 0 to 1316 to be the new indices

matches_rolling.index = range(matches_rolling.shape[0])

In [66]:
matches_rolling

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,hour,day_code,gf_rolling,ga_rolling,sh_rolling,sot_rolling,dist_rolling,fk_rolling,pk_rolling,pkatt_rolling
0,2020-10-04,14:00,Premier League,Matchweek 4,Sun,Home,W,2.0,1.0,Sheffield Utd,...,14,6,2.000000,1.333333,7.666667,3.666667,14.733333,0.666667,0.000000,0.000000
1,2020-10-17,17:30,Premier League,Matchweek 5,Sat,Away,L,0.0,1.0,Manchester City,...,17,5,1.666667,1.666667,5.333333,3.666667,15.766667,0.000000,0.000000,0.000000
2,2020-10-25,19:15,Premier League,Matchweek 6,Sun,Home,L,0.0,1.0,Leicester City,...,19,6,1.000000,1.666667,7.000000,3.666667,16.733333,0.666667,0.000000,0.000000
3,2020-11-01,16:30,Premier League,Matchweek 7,Sun,Away,W,1.0,0.0,Manchester Utd,...,16,6,0.666667,1.000000,9.666667,4.000000,16.033333,1.000000,0.000000,0.000000
4,2020-11-08,19:15,Premier League,Matchweek 8,Sun,Home,L,0.0,3.0,Aston Villa,...,19,6,0.333333,0.666667,9.666667,2.666667,18.033333,1.000000,0.333333,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1312,2022-03-13,14:00,Premier League,Matchweek 29,Sun,Away,W,1.0,0.0,Everton,...,14,6,1.333333,1.000000,12.333333,3.666667,19.300000,0.000000,0.000000,0.000000
1313,2022-03-18,20:00,Premier League,Matchweek 30,Fri,Home,L,2.0,3.0,Leeds United,...,20,4,1.666667,0.666667,12.333333,4.333333,19.600000,0.000000,0.000000,0.000000
1314,2022-04-02,15:00,Premier League,Matchweek 31,Sat,Home,W,2.0,1.0,Aston Villa,...,15,5,2.333333,1.000000,13.000000,5.333333,19.833333,0.000000,0.000000,0.000000
1315,2022-04-08,20:00,Premier League,Matchweek 32,Fri,Away,L,0.0,1.0,Newcastle Utd,...,20,4,1.666667,1.333333,13.000000,5.000000,18.533333,0.000000,0.000000,0.000000


# Retraining Our Model


Now that we have more predictors, we can retrain our model and measure precision again. This will use the new predictors — and hopefully improve our predictions.

In [67]:
# making everything into a fucntion


def make_predictions(data, predictors):
  train = data[data["date"] < "2022-01-01"]
  test = data[data["date"] > "2022-01-01"]
  rf.fit(train[predictors], train["target"])
  preds = rf.predict(test[predictors])
  combined = pd.DataFrame(dict(actual=test["target"], predicted=preds), index=test.index)
  precision = precision_score(test["target"], preds)
  return combined, precision
     


In [68]:

combined, precision = make_predictions(matches_rolling, predictors + new_cols)

In [69]:

precision
     

0.625

In [70]:
combined

Unnamed: 0,actual,predicted
55,0,0
56,1,0
57,1,0
58,1,1
59,1,1
...,...,...
1312,1,0
1313,0,0
1314,1,0
1315,0,0


In [71]:
#merging the daaframes
combined = combined.merge(matches_rolling[["date", "team", "opponent", "result"]], left_index=True, right_index=True)


In [72]:
combined

Unnamed: 0,actual,predicted,date,team,opponent,result
55,0,0,2022-01-23,Arsenal,Burnley,D
56,1,0,2022-02-10,Arsenal,Wolves,W
57,1,0,2022-02-19,Arsenal,Brentford,W
58,1,1,2022-02-24,Arsenal,Wolves,W
59,1,1,2022-03-06,Arsenal,Watford,W
...,...,...,...,...,...,...
1312,1,0,2022-03-13,Wolverhampton Wanderers,Everton,W
1313,0,0,2022-03-18,Wolverhampton Wanderers,Leeds United,L
1314,1,0,2022-04-02,Wolverhampton Wanderers,Aston Villa,W
1315,0,0,2022-04-08,Wolverhampton Wanderers,Newcastle Utd,L


#  Combining Home and Away Predictions

You may have noticed that we're predicting "both sides" of a match — both the home team result and the away team result. These results don't always line up — sometimes the model will predict that both teams will win.

By only filtering for predictions where the model thinks that one team will win and the other team will lose, we can potentially boost the accuracy of our predictions.

To do this, we'll need to combine "both sides" of a match into a single row. You'll first need to join the predictions with the other columns in our matches DataFrame.

Then, we can join the combined DataFrame against itself using the team and opponent columns. Unfortunately, the team names are slightly different in both columns. 

Then, you can merge the DataFrame against itself and filter only for columns where one team is predicted to win, and the other team is predicted to lose.


In [73]:

#making a class that inherits from the dictinary class
#the reason why we need to do this is by defualt that the pandas map method will not essemtial handle any missing keys
#so if we created a mapping dictionary that is missing a team name it will just remove it but what we want to do is if a team name is missing in the dictionary we want it replace it with the orginal name that was passed in 
class MissingDict(dict):
    __missing__ = lambda self, key: key

map_values = {"Brighton and Hove Albion": "Brighton", 
              "Manchester United": "Manchester Utd", 
              "Newcastle United": "Newcastle Utd", 
              "Tottenham Hotspur": "Tottenham", 
              "West Ham United": "West Ham", 
              "Wolverhampton Wanderers": "Wolves"} 
mapping = MissingDict(**map_values)

In [74]:
#confiming 
mapping["West Ham United"]

'West Ham'

Now that we know that it works we will use this in the pandas map method

In [75]:
combined["new_team"] = combined["team"].map(mapping)

In [76]:
combined

Unnamed: 0,actual,predicted,date,team,opponent,result,new_team
55,0,0,2022-01-23,Arsenal,Burnley,D,Arsenal
56,1,0,2022-02-10,Arsenal,Wolves,W,Arsenal
57,1,0,2022-02-19,Arsenal,Brentford,W,Arsenal
58,1,1,2022-02-24,Arsenal,Wolves,W,Arsenal
59,1,1,2022-03-06,Arsenal,Watford,W,Arsenal
...,...,...,...,...,...,...,...
1312,1,0,2022-03-13,Wolverhampton Wanderers,Everton,W,Wolves
1313,0,0,2022-03-18,Wolverhampton Wanderers,Leeds United,L,Wolves
1314,1,0,2022-04-02,Wolverhampton Wanderers,Aston Villa,W,Wolves
1315,0,0,2022-04-08,Wolverhampton Wanderers,Newcastle Utd,L,Wolves


In [77]:
#merging the on itslef
merged = combined.merge(combined, left_on = ["date", "new_team"], right_on=["date", "opponent"])

In [78]:
merged

Unnamed: 0,actual_x,predicted_x,date,team_x,opponent_x,result_x,new_team_x,actual_y,predicted_y,team_y,opponent_y,result_y,new_team_y
0,0,0,2022-01-23,Arsenal,Burnley,D,Arsenal,0,0,Burnley,Arsenal,D,Burnley
1,1,0,2022-02-10,Arsenal,Wolves,W,Arsenal,0,0,Wolverhampton Wanderers,Arsenal,L,Wolves
2,1,0,2022-02-19,Arsenal,Brentford,W,Arsenal,0,0,Brentford,Arsenal,L,Brentford
3,1,1,2022-02-24,Arsenal,Wolves,W,Arsenal,0,0,Wolverhampton Wanderers,Arsenal,L,Wolves
4,1,1,2022-03-06,Arsenal,Watford,W,Arsenal,0,0,Watford,Arsenal,L,Watford
...,...,...,...,...,...,...,...,...,...,...,...,...,...
257,1,0,2022-03-13,Wolverhampton Wanderers,Everton,W,Wolves,0,0,Everton,Wolves,L,Everton
258,0,0,2022-03-18,Wolverhampton Wanderers,Leeds United,L,Wolves,1,0,Leeds United,Wolves,W,Leeds United
259,1,0,2022-04-02,Wolverhampton Wanderers,Aston Villa,W,Wolves,0,0,Aston Villa,Wolves,L,Aston Villa
260,0,0,2022-04-08,Wolverhampton Wanderers,Newcastle Utd,L,Wolves,1,0,Newcastle United,Wolves,W,Newcastle Utd


In [79]:
#when the model predicted that team a will win and team b will lose what actually happened 
merged[(merged["predicted_x"] == 1) & (merged["predicted_y"] == 0)]["actual_x"].value_counts()


1    27
0    13
Name: actual_x, dtype: int64

In [80]:
27/40


0.675

So based on the above 2 lines our model predicted and got right that 27 of the team actually won. With that in mind our model had an accuracy of 67%

# Consulsion

Our model with an intial trianing and some fine tuning was able to predict with an ccuracy of 67%. 

Here are some potential next steps that we can do to imporve our model:

- Get additional data for more Premier League seasons to improve the model
- Create extra predictors by using more of the columns in the data
- Try making predictions for a different league
- Use data from additional competitions outside of the Premier League
- Look at opponent rolling averages and use them as predictor columns
- Try out a different algorithm