# La Quiniela Machine Learning Analysis

On this notebook we're going to build a model to predict the results of La Liga matches, from Spain. To train the model we will use data from La Liga matches from 1928-1929 until 2018-2019 with the intention of predicting the results of the 2020-2021 season.

 The model of choice is a random forest which uses the name of the home team, the name of the away team, the current rank of the home team and the current rank of the away team as predictors.

First of all we get the data from the sqlite database it is stored in. Doing this we obtain a dataset in which each row represents a match. After obtaining this dataset we modify it in order to get for each match, the ranking of each of the two teams in the current matchday of the season of the match. We do this, as we will use this two ranking values as predictors for our model.

In [2]:
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings("ignore")

# Read the data
con = sqlite3.connect("laliga.sqlite")
df = pd.read_sql_query("SELECT * FROM Matches", con)
con.close()
#modify df to add the rank of each team at the current matchday of each match
df.dropna(inplace = True)
df["home_score"] = df.apply(lambda x: int(x["score"].split(":")[0]), axis = 1)
df["away_score"] = df.apply(lambda x: int(x["score"].split(":")[1]), axis = 1)
df["winner"] = df.apply(lambda x : x["home_team"] if(x["home_score"] > x["away_score"]) else (x["away_team"] if(x["home_score"] < x["away_score"]) else "NaN"), axis = 1)
df["loser"] = df.apply(lambda x : x["home_team"] if(x["home_score"] < x["away_score"]) else (x["away_team"] if(x["home_score"] > x["away_score"]) else "NaN"), axis = 1)
df_ht = df.copy()
df_ht['team'] = df['home_team']
df_aw = df.copy()
df_aw['team'] = df['away_team']
df_total = pd.concat([df_ht,df_aw])
df_total = df_total.sort_values(by = ['season', 'division', 'matchday', 'score'])
df_total['W'] = df_total.apply(lambda x : 1 if x['winner'] == x['team'] else 0, axis = 1)
df_total['T'] = df_total.apply(lambda x : 1 if x['loser'] == 'NaN' else 0, axis = 1)
df_total['W'] = df_total.groupby(['season', 'division', 'team'])['W'].cumsum()
df_total['T'] = df_total.groupby(['season', 'division', 'team'])['T'].cumsum()
df_total['Pts'] = 3 * df_total['W'] + df_total['T']
df_total['rank'] = df_total.groupby(['division','season','matchday'])['Pts'].rank(method = 'min', ascending=False)
df_total = df_total.sort_index()
df_total_ht = df_total.loc[df_total['home_team'] == df_total['team']]
df_total_aw = df_total.loc[df_total['away_team'] == df_total['team']]
df['home_team_rank'] = df_total_ht['rank']
df['away_team_rank'] = df_total_aw['rank']
df.drop(columns = ['winner', 'loser'], inplace = True)
df.drop(columns = ['home_score', 'away_score'], inplace = True) 

We can take a look at how df looks like now. As we can see, we have the columns home_team_rank and away_team_rank which contain the rank of each team at the corresponding matchday.

In [3]:
df.head()
df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,home_team_rank,away_team_rank
43171,2010-2011,2,7,10/10/10,7:00 PM,Ponferradina,Real Valladolid,1:1,21.0,7.0
22115,2010-2011,1,24,2/20/11,7:00 PM,Sevilla FC,Hércules CF,1:0,7.0,14.0
43170,2010-2011,2,7,10/10/10,5:00 PM,Girona,Recr. Huelva,0:0,16.0,19.0
21632,2009-2010,1,14,12/13/09,5:00 PM,Getafe,CD Tenerife,2:1,8.0,12.0
24194,2016-2017,1,4,9/18/16,4:15 PM,Athletic,Valencia,2:1,9.0,20.0
25767,2020-2021,1,9,11/8/20,9:00 PM,Valencia,Real Madrid,4:1,10.0,3.0
43881,2011-2012,2,30,5/23/12,8:00 PM,CD Alcoyano,Hércules CF,0:5,19.0,4.0
22928,2012-2013,1,30,4/5/13,9:00 PM,Granada CF,Real Betis,1:5,16.0,6.0
40844,2005-2006,2,6,10/2/05,12:00 PM,CD Numancia,Ciudad Murcia,1:1,11.0,4.0
24811,2017-2018,1,28,3/10/18,6:30 PM,Getafe,Levante,0:1,11.0,17.0


We are also going to add a column which represents the winner of each match using the following notation:

- 0: Home team wins
- 1: Draw
- 2: Away team wins

This is the column we will want our model to predict.

In [4]:
#Drop the null values
df = df.dropna()
#Add a column with the winner of the match
df['winner'] = df['score'].str.split(':').str[0].astype(int) - df['score'].str.split(':').str[1].astype(int)
df['winner'] = np.where(df['winner'] > 0, 0, np.where(df['winner'] < 0, 2, 1))
df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,home_team_rank,away_team_rank,winner
25148,2018-2019,1,24,2/15/19,9:00 PM,SD Eibar,Getafe,2:2,10.0,5.0,1
24568,2017-2018,1,4,9/15/17,9:00 PM,SD Eibar,CD Leganés,1:0,7.0,7.0,0
24539,2017-2018,1,1,8/18/17,10:15 PM,Valencia,UD Las Palmas,1:0,1.0,14.0,0
46550,2017-2018,2,21,1/6/18,4:00 PM,Sporting Gijón,Córdoba CF,3:2,9.0,20.0,0
44735,2013-2014,2,24,2/1/14,6:15 PM,RM Castilla,Hércules CF,4:0,19.0,17.0,0
45893,2016-2017,2,3,9/3/16,6:00 PM,Getafe,Reus Deportiu,1:1,11.0,5.0,1
22070,2010-2011,1,20,1/22/11,10:00 PM,Valencia,Málaga CF,4:3,4.0,18.0,0
25284,2018-2019,1,37,5/12/19,6:30 PM,Valencia,Alavés,3:1,4.0,10.0,0
43698,2011-2012,2,13,11/13/11,4:00 PM,Villarreal CF B,UD Almería,2:1,16.0,2.0,0
18616,2001-2002,1,16,12/9/01,5:00 PM,Dep. La Coruña,Valencia,1:0,2.0,7.0,0


In order to input the data into the model we are going to assign to each team a number, using a dictionary. We will modify df in order to add a column with the home team number and a column with the away team number. 

In [5]:
#Assing to each team a number
teams = [df['home_team'].unique()]
#Convert the array to a list
teams = teams[0].tolist()

#Create a dictionary with the teams and their number
teams_dict = {}
for i in range(len(teams)):
    teams_dict[teams[i]] = i

#Create a new column with the number of the home team
df['home_team_num'] = df['home_team'].map(teams_dict)
#Create a new column with the number of the away team
df['away_team_num'] = df['away_team'].map(teams_dict)
df['winner'] = df['winner'].astype(int)
df['home_team_num'] = df['home_team_num'].astype(int)
df['away_team_num'] = df['away_team_num'].astype(int)

df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,home_team_rank,away_team_rank,winner,home_team_num,away_team_num
46195,2016-2017,2,30,3/19/17,5:00 PM,CD Tenerife,Reus Deportiu,0:1,4.0,10.0,2,17,77
21089,2007-2008,1,36,5/7/08,8:00 PM,Dep. La Coruña,Levante,1:0,8.0,20.0,0,15,30
42568,2008-2009,2,37,5/16/09,6:00 PM,Hércules CF,CD Tenerife,3:1,4.0,2.0,0,37,17
25901,2020-2021,1,23,2/13/21,6:30 PM,SD Eibar,Real Valladolid,1:1,17.0,17.0,1,40,12
41396,2006-2007,2,14,11/26/06,7:00 PM,Hércules CF,UD Las Palmas,1:1,8.0,18.0,1,37,23
43508,2010-2011,2,38,5/12/11,8:00 PM,Villarreal CF B,Alcorcón,1:4,15.0,10.0,2,66,68
24359,2016-2017,1,21,2/4/17,4:15 PM,Barcelona,Athletic,3:0,2.0,7.0,0,14,13
19955,2004-2005,1,36,5/15/05,5:00 PM,Real Sociedad,Málaga CF,1:3,12.0,12.0,2,4,24
21364,2008-2009,1,25,3/1/09,5:00 PM,CD Numancia,Dep. La Coruña,0:1,20.0,5.0,2,21,15
24380,2016-2017,1,23,2/18/17,4:15 PM,Real Madrid,Espanyol,2:0,1.0,10.0,0,10,5


MODEL QUALITY EVALUATION

Now we will use the constructed model to predict the results of season 2019-2020. The accuracy of this prediction will be the model's measure of quality.

We have decided to train our model depending on the division. i.e: we will have two models, one trained with first division data and one trained with second division data. For the first division predictions the first model will be used and for the second division predictions the second one.

PREDICTING THE 2019-2020 RESULTS FOR FIRST DIVISION

In [9]:
# We define the training data
train_data = df[(df['season'] < '2019-2020') & (df['division'] == 1)] 

# We select the predictors and the variable we want to predict
X_train = train_data[['home_team_num', 'away_team_num', 'home_team_rank', 'away_team_rank']]
Y_train = train_data['winner']

#We train the model
forest_model.fit(X_train, Y_train)

In [10]:
#Use the model to predict the winner of matchdays of the season 2019-2020 for first division
season_19_20_first_division = df[(df['season'] == '2019-2020') & (df['division'] == 1)]
test_19_20_first_division = season_19_20_first_division[['home_team_num', 'away_team_num', 'home_team_rank', 'away_team_rank']]
season_19_20_first_division['prediction_win'] = forest_model.predict(test_19_20_first_division).astype(int)
accuracy_19_20_first_division = (season_19_20_first_division['winner'] == season_19_20_first_division['prediction_win']).sum() / len(season_19_20_first_division)
print('The accuracy of the model for predicting the first division results for the 2019-2020 season is: ', accuracy_19_20_first_division)


The accuracy of the model for predicting the first division results for the 2019-2020 season is:  0.4868421052631579


PREDICTING THE 2019-2020 RESULTS FOR SECOND DIVISION

In [11]:
# We define the training data
train_data = df[(df['season'] < '2019-2020') & (df['division'] == 2)] 

# We select the predictors and the variable we want to predict
X_train = train_data[['home_team_num', 'away_team_num', 'home_team_rank', 'away_team_rank']]
Y_train = train_data['winner']

#We train the model
forest_model.fit(X_train, Y_train)

In [12]:
#Use the model to predict the winner of matchdays of the season 2019-2020 for first division
season_19_20_second_division = df[(df['season'] == '2019-2020') & (df['division'] == 1)]
test_19_20_second_division = season_19_20_second_division[['home_team_num', 'away_team_num', 'home_team_rank', 'away_team_rank']]
season_19_20_second_division['prediction_win'] = forest_model.predict(test_19_20_second_division).astype(int)
accuracy_19_20_second_division = (season_19_20_second_division['winner'] == season_19_20_second_division['prediction_win']).sum() / len(season_19_20_second_division)
print('The accuracy of the model for predicting the second division results for the 2019-2020 season is: ', accuracy_19_20_second_division)

The accuracy of the model for predicting the second division results for the 2019-2020 season is:  0.48947368421052634


THE TOTAL ACCURACY OF THE MODEL IS THE AVERAGE BETWEEN ACCURACIES:

In [14]:
print(f"Total model accuracy: {(accuracy_19_20_first_division + accuracy_19_20_second_division)/2}")

Total model accuracy: 0.4881578947368421


With the following function we can predict results for a certain matchday and division, obtaining a clean output

In [28]:
#forest_model_fd will be the model trained with the first division data and forest_model_sd the model trained with second division data.
forest_model_fd = RandomForestRegressor(random_state=100, n_estimators=100)
train_data = df[(df['season'] < '2019-2020') & (df['division'] == 1)] 
X_train = train_data[['home_team_num', 'away_team_num', 'home_team_rank', 'away_team_rank']]
Y_train = train_data['winner']
forest_model_fd.fit(X_train, Y_train)
forest_model_sd = RandomForestRegressor(random_state=100, n_estimators=100)
train_data = df[(df['season'] < '2019-2020') & (df['division'] == 2)] 
X_train = train_data[['home_team_num', 'away_team_num', 'home_team_rank', 'away_team_rank']]
Y_train = train_data['winner']
forest_model_sd.fit(X_train, Y_train)

def predict_matchday(division, matchday):  
    season = '2019-2020' 
    if(division != 1 and division != 2):
            raise Exception('Prediction is only possible for first and second division')
    results = df[(df['season'] == season) & (df['division'] == division) & (df['matchday'] == matchday)]
    input_data = results[['home_team_num', 'away_team_num', 'home_team_rank', 'away_team_rank']]
    
    if(division == 1): 
        results['prediction_win'] = forest_model_fd.predict(input_data).astype(int)
    elif(division == 2): 
        results['prediction_win'] = forest_model_sd.predict(input_data).astype(int)

    for index, row in results.iterrows():
        print(f"{row['home_team']}  vs  {row['away_team']} ----> {row['prediction_win']} ")

For example, we can predict the results for first division matchday number 1:

In [29]:
predict_matchday(1, 1)

Athletic  vs  Barcelona ----> 0 
Celta de Vigo  vs  Real Madrid ----> 1 
Valencia  vs  Real Sociedad ----> 0 
RCD Mallorca  vs  SD Eibar ----> 0 
CD Leganés  vs  CA Osasuna ----> 1 
Villarreal  vs  Granada CF ----> 1 
Alavés  vs  Levante ----> 0 
Espanyol  vs  Sevilla FC ----> 1 
Real Betis  vs  Real Valladolid ----> 1 
Atlético Madrid  vs  Getafe ----> 0 


Finally we save the model as two, one that is trained using first divison data and one trained using second division data:

In [30]:
#Save the model
from sklearn.externals import joblib

joblib.dump(forest_model_fd, 'model_fd.pkl')
joblib.dump(forest_model_sd, 'model_sd.pkl')

ImportError: cannot import name 'joblib' from 'sklearn.externals' (/opt/homebrew/anaconda3/lib/python3.9/site-packages/sklearn/externals/__init__.py)

AQUI FALTA COMENTAR RESULTATS MODEL