# La Quiniela Machine Learning Analysis

On this notebook we're going to build a model to predict the results of La Liga matches, from Spain. To train the model we will use data from La Liga matches from 1928-1929 until 2018-2019 with the intention of predicting the results of the 2020-2021 season.

 The model of choice is a random forest which uses the name of the home team, the name of the away team, the current rank of the home team and the current rank of the away team as predictors.

First of all we get the data from the sqlite database it is stored in. Doing this we obtain a dataset in which each row represents a match. After obtaining this dataset we modify it in order to get for each match, the ranking of each of the two teams in the current matchday of the season of the match. We do this, as we will use this two ranking values as predictors for our model.

In [33]:
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings("ignore")

# Read the data
con = sqlite3.connect("laliga.sqlite")
df = pd.read_sql_query("SELECT * FROM Matches", con)
con.close()
#modify df to add the rank of each team at the current matchday of each match
df.dropna(inplace = True)
df["home_score"] = df.apply(lambda x: int(x["score"].split(":")[0]), axis = 1)
df["away_score"] = df.apply(lambda x: int(x["score"].split(":")[1]), axis = 1)
df["winner"] = df.apply(lambda x : x["home_team"] if(x["home_score"] > x["away_score"]) else (x["away_team"] if(x["home_score"] < x["away_score"]) else "NaN"), axis = 1)
df["loser"] = df.apply(lambda x : x["home_team"] if(x["home_score"] < x["away_score"]) else (x["away_team"] if(x["home_score"] > x["away_score"]) else "NaN"), axis = 1)
df_ht = df.copy()
df_ht['team'] = df['home_team']
df_aw = df.copy()
df_aw['team'] = df['away_team']
df_total = pd.concat([df_ht,df_aw])
df_total = df_total.sort_values(by = ['season', 'division', 'matchday', 'score'])
df_total['W'] = df_total.apply(lambda x : 1 if x['winner'] == x['team'] else 0, axis = 1)
df_total['T'] = df_total.apply(lambda x : 1 if x['loser'] == 'NaN' else 0, axis = 1)
df_total['W'] = df_total.groupby(['season', 'division', 'team'])['W'].cumsum()
df_total['T'] = df_total.groupby(['season', 'division', 'team'])['T'].cumsum()
df_total['Pts'] = 3 * df_total['W'] + df_total['T']
df_total['rank'] = df_total.groupby(['division','season','matchday'])['Pts'].rank(method = 'min', ascending=False)
df_total = df_total.sort_index()
df_total_ht = df_total.loc[df_total['home_team'] == df_total['team']]
df_total_aw = df_total.loc[df_total['away_team'] == df_total['team']]
df['home_team_rank'] = df_total_ht['rank']
df['away_team_rank'] = df_total_aw['rank']
df.drop(columns = ['winner', 'loser'], inplace = True)
df.drop(columns = ['home_score', 'away_score'], inplace = True) 

We can take a look at how df looks like now. As we can see, we have the columns home_team_rank and away_team_rank which contain the rank of each team at the corresponding matchday.

In [34]:
df.head()
df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,home_team_rank,away_team_rank
44032,2012-2013,2,2,8/25/12,7:00 PM,CE Sabadell,Villarreal,0:0,15.0,3.0
41163,2005-2006,2,35,4/30/06,6:00 PM,Racing Ferrol,CD Numancia,1:3,20.0,7.0
43142,2010-2011,2,5,9/25/10,6:00 PM,FC Cartagena,Córdoba CF,1:2,13.0,15.0
39784,2002-2003,2,35,5/11/03,7:00 PM,Poli Ejido,Real Murcia,0:1,7.0,1.0
40072,2003-2004,2,20,1/17/04,6:30 PM,Elche CF,Córdoba CF,1:1,2.0,5.0
24025,2015-2016,1,25,2/21/16,6:15 PM,Athletic,Real Sociedad,0:1,8.0,9.0
47199,2018-2019,2,38,5/10/19,9:00 PM,Granada CF,CD Tenerife,2:1,2.0,17.0
45335,2014-2015,2,36,5/3/15,5:00 PM,Sporting Gijón,RCD Mallorca,1:0,3.0,16.0
46881,2018-2019,2,9,10/12/18,6:00 PM,Extremadura,Cádiz CF,2:1,17.0,20.0
22622,2011-2012,1,37,5/5/12,9:00 PM,Sevilla FC,Rayo Vallecano,5:2,9.0,17.0


We are also going to add a column which represents the winner of each match using the following notation:

- 0: Home team wins
- 1: Draw
- 2: Away team wins

This is the column we will want our model to predict.

In [35]:
#Drop the null values
df = df.dropna()
#Add a column with the winner of the match
df['winner'] = df['score'].str.split(':').str[0].astype(int) - df['score'].str.split(':').str[1].astype(int)
df['winner'] = np.where(df['winner'] > 0, 0, np.where(df['winner'] < 0, 2, 1))
df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,home_team_rank,away_team_rank,winner
39101,2001-2002,2,15,11/18/01,5:00 PM,Gimnàstic,Elche CF,1:2,20.0,11.0,2
40811,2005-2006,2,3,9/10/05,8:30 PM,Hércules CF,UD Almería,2:2,17.0,11.0,1
23261,2013-2014,1,25,2/22/14,8:00 PM,Real Sociedad,Barcelona,3:1,5.0,2.0,0
21726,2009-2010,1,23,2/21/10,9:00 PM,Real Madrid,Villarreal,6:2,2.0,10.0,0
42102,2007-2008,2,36,5/4/08,6:00 PM,Poli Ejido,SD Eibar,1:0,22.0,11.0,0
41826,2007-2008,2,11,11/4/07,5:00 PM,Córdoba CF,CD Castellón,2:1,5.0,10.0,0
39641,2002-2003,2,22,2/9/03,6:00 PM,CD Tenerife,SD Compostela,5:1,11.0,17.0,0
21044,2007-2008,1,31,4/6/08,5:00 PM,Racing,Dep. La Coruña,1:3,5.0,12.0,2
46831,2018-2019,2,4,9/9/18,4:00 PM,CD Numancia,Elche CF,1:0,10.0,18.0,0
25281,2018-2019,1,37,5/12/19,6:30 PM,Real Betis,SD Huesca,2:1,10.0,20.0,0


In order to input the data into the model we are going to assign to each team a number, using a dictionary. We will modify df in order to add a column with the home team number and a column with the away team number. 

In [36]:
#Assing to each team a number
teams = [df['home_team'].unique()]
#Convert the array to a list
teams = teams[0].tolist()

#Create a dictionary with the teams and their number
teams_dict = {}
for i in range(len(teams)):
    teams_dict[teams[i]] = i

#Create a new column with the number of the home team
df['home_team_num'] = df['home_team'].map(teams_dict)
#Create a new column with the number of the away team
df['away_team_num'] = df['away_team'].map(teams_dict)
df['winner'] = df['winner'].astype(int)
df['home_team_num'] = df['home_team_num'].astype(int)
df['away_team_num'] = df['away_team_num'].astype(int)

df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,home_team_rank,away_team_rank,winner,home_team_num,away_team_num
20138,2005-2006,1,17,12/20/05,10:00 PM,Barcelona,Celta de Vigo,2:0,1.0,7.0,0,14,3
45373,2014-2015,2,40,5/23/15,4:00 PM,Recr. Huelva,CE Sabadell,3:1,20.0,21.0,0,27,70
43709,2011-2012,2,14,11/19/11,7:30 PM,UD Las Palmas,Xerez CD,0:0,10.0,17.0,1,23,36
23636,2014-2015,1,24,2/22/15,9:00 PM,Elche CF,Real Madrid,0:2,16.0,1.0,2,39,10
19729,2004-2005,1,14,12/4/04,8:00 PM,Barcelona,Málaga CF,4:0,1.0,19.0,0,14,24
21889,2010-2011,1,2,9/11/10,6:00 PM,Valencia,Racing,1:0,1.0,19.0,0,2,9
21355,2008-2009,1,24,2/22/09,5:00 PM,CA Osasuna,CD Numancia,2:0,17.0,20.0,0,22,21
24348,2016-2017,1,20,1/27/17,8:45 PM,CA Osasuna,Málaga CF,1:1,19.0,14.0,1,22,24
47822,2020-2021,2,10,11/1/20,9:00 PM,UD Almería,Girona,0:0,5.0,5.0,1,34,43
22660,2012-2013,1,3,9/1/12,8:00 PM,Dep. La Coruña,Getafe,1:1,7.0,9.0,1,15,31


Now we have the data ready in order to train the model. As explained before, we will train it with the data from seasons prior to 2019-2020 and check the model's accuracy with the 2019-2020 season.

In [79]:
from sklearn.model_selection import train_test_split

# We define the training data
train_data = df[(df['season'] < '2019-2020')] 

# We define the model
forest_model = RandomForestRegressor(random_state=100, n_estimators=100)

# We select the predictors and the variable we want to predict
X_train = train_data[['home_team_num', 'away_team_num', 'home_team_rank', 'away_team_rank']]
Y_train = train_data['winner']

#We train the model
forest_model.fit(X_train, Y_train)


Now we will use the constructed model to predict the results of season 2019-2020. The accuracy of this prediction will be the model's measure of quality.

In [80]:
#Use the model tho predict the winner of the matchdays of the season 2019-2020
season_19_20 = df[df['season'] == '2019-2020']
test_19_20 = season_19_20[['home_team_num', 'away_team_num', 'home_team_rank', 'away_team_rank']]
season_19_20['prediction_win'] = forest_model.predict(test_19_20).astype(int)
accuracy_19_20 = (season_19_20['winner'] == season_19_20['prediction_win']).sum() / len(season_19_20)
print('The accuracy of the model for predicting the 2019-2020 season is: ', accuracy_19_20)


The accuracy of the model for predicting the 2019-2020 season is:  0.46318289786223277


In [None]:
#Save the model
from sklearn.externals import joblib

joblib.dump(forest_model, 'model.pkl')