# La Quiniela Machine Learning Analysis

On this notebook we're going to analyze the data and the train a model to predict the results of the matches of the La Liga, from spain. We are going to do it with scikit-learn library.T he data source is Transfermarkt, and it was scraped using Python’s library BeautifulSoup4. The data is provided as a SQLite3 database that is inside the repository. This  data set contains a the following table:

Matches: All the matches played between seasons 1928-1929 and 2021-2022 with the date and score. Columns are season, division, matchday, date, time, home_team, away_team, score. Have in mind there is no time information for many of themand also that it contains matches still not played from current season.


In [31]:
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings("ignore")

# Read the data
con = sqlite3.connect("laliga.sqlite")
df = pd.read_sql_query("SELECT * FROM Matches", con)
con.close()
#modify df to add the rank of each team at the current matchday of each match
df.dropna(subset = ['score'], inplace = True)
df["home_score"] = df.apply(lambda x: int(x["score"].split(":")[0]), axis = 1)
df["away_score"] = df.apply(lambda x: int(x["score"].split(":")[1]), axis = 1)
df["winner"] = df.apply(lambda x : x["home_team"] if(x["home_score"] > x["away_score"]) else (x["away_team"] if(x["home_score"] < x["away_score"]) else "NaN"), axis = 1)
df["loser"] = df.apply(lambda x : x["home_team"] if(x["home_score"] < x["away_score"]) else (x["away_team"] if(x["home_score"] > x["away_score"]) else "NaN"), axis = 1)
df_ht = df.copy()
df_ht['team'] = df['home_team']
df_aw = df.copy()
df_aw['team'] = df['away_team']
df_total = pd.concat([df_ht,df_aw])
df_total = df_total.sort_values(by = ['season', 'division', 'matchday', 'score'])
df_total['W'] = df_total.apply(lambda x : 1 if x['winner'] == x['team'] else 0, axis = 1)
df_total['T'] = df_total.apply(lambda x : 1 if x['loser'] == 'NaN' else 0, axis = 1)
df_total['W'] = df_total.groupby(['season', 'division', 'team'])['W'].cumsum()
df_total['T'] = df_total.groupby(['season', 'division', 'team'])['T'].cumsum()
df_total['Pts'] = 3 * df_total['W'] + df_total['T']
df_total['rank'] = df_total.groupby(['division','season','matchday'])['Pts'].rank(method = 'min', ascending=False)
df_total = df_total.sort_index()
df_total_ht = df_total.loc[df_total['home_team'] == df_total['team']]
df_total_aw = df_total.loc[df_total['away_team'] == df_total['team']]
df['home_team_rank'] = df_total_ht['rank']
df['away_team_rank'] = df_total_aw['rank'] 

In [32]:
df.head()
df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,home_score,away_score,winner,loser,home_team_rank,away_team_rank
22283,2011-2012,1,3,9/11/11,4:00 PM,Racing,Levante,0:0,0,0,,,17.0,13.0
4622,1957-1958,1,30,5/4/58,,Valencia,Sevilla FC,2:2,2,2,,,5.0,11.0
44874,2013-2014,2,36,4/27/14,6:00 PM,Real Jaén CF,CD Tenerife,1:0,1,0,Real Jaén CF,CD Tenerife,18.0,4.0
9071,1974-1975,1,19,2/9/75,,Espanyol,Real Betis,1:0,1,0,Espanyol,Real Betis,3.0,3.0
1848,1945-1946,1,8,11/18/45,,Valencia,Celta de Vigo,5:1,5,1,Valencia,Celta de Vigo,7.0,12.0
28656,1959-1960,2,17,1/10/60,,CD Basconia,Ferrol,5:0,5,0,CD Basconia,Ferrol,10.0,16.0
6727,1966-1967,1,23,3/5/67,,Valencia,Espanyol,2:1,2,1,Valencia,Espanyol,4.0,3.0
44050,2012-2013,2,3,9/2/12,12:00 PM,Racing,Sporting Gijón,0:0,0,0,,,19.0,19.0
17606,1998-1999,1,29,4/11/99,,UD Salamanca,Racing,1:2,1,2,Racing,UD Salamanca,20.0,14.0
2460,1948-1949,1,17,1/23/49,,Celta de Vigo,Sevilla FC,3:1,3,1,Celta de Vigo,Sevilla FC,10.0,12.0


For inputing the data to the model we are going to assign to each team a number, and add a colum of the winner of the match. We also are going a colum with information about wich team (home or away) wins.

- 0: Home team wins
- 1: Draw
- 2: Away team wins

In [33]:
#Drop the null values
df = df.dropna()
df.drop(columns = ['winner', 'loser'], inplace = True)
df.drop(columns = ['home_score', 'away_score'], inplace = True)
#Add a column with the winner of the match
df['winner'] = df['score'].str.split(':').str[0].astype(int) - df['score'].str.split(':').str[1].astype(int)
df['winner'] = np.where(df['winner'] > 0, 0, np.where(df['winner'] < 0, 2, 1))

df.sample(10)


Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,home_team_rank,away_team_rank,winner
41912,2007-2008,2,19,1/6/08,5:00 PM,Alavés,Racing Ferrol,1:0,14.0,19.0,0
42053,2007-2008,2,32,4/5/08,6:30 PM,Xerez CD,Cádiz CF,2:1,21.0,14.0,0
48567,2021-2022,2,1,8/16/21,8:00 PM,Málaga CF,CD Mirandés,0:0,8.0,8.0,1
45978,2016-2017,2,11,10/22/16,4:00 PM,Elche CF,Córdoba CF,1:1,7.0,3.0,1
48134,2020-2021,2,39,5/14/21,9:00 PM,Espanyol,FC Cartagena,0:2,1.0,16.0,2
21941,2010-2011,1,7,10/17/10,5:00 PM,Racing,UD Almería,1:0,14.0,17.0,0
18629,2001-2002,1,18,12/22/01,8:45 PM,Dep. La Coruña,Real Betis,2:0,1.0,3.0,0
24415,2016-2017,1,26,3/5/17,6:30 PM,UD Las Palmas,CA Osasuna,5:2,12.0,20.0,0
42140,2007-2008,2,40,5/31/08,6:30 PM,Racing Ferrol,Alavés,1:1,17.0,20.0,1
20514,2006-2007,1,16,12/20/06,7:00 PM,Real Madrid,Recr. Huelva,0:3,3.0,7.0,2


In [34]:
#Assing to each team a number
teams = [df['home_team'].unique()]
#Convert the array to a list
teams = teams[0].tolist()

#Create a dictionary with the teams and their number
teams_dict = {}
for i in range(len(teams)):
    teams_dict[teams[i]] = i

#Create a new column with the number of the home team
df['home_team_num'] = df['home_team'].map(teams_dict)
#Create a new column with the number of the away team
df['away_team_num'] = df['away_team'].map(teams_dict)

#pass the values of the columns winner, home_team_num and away_team_num to an integer
df['winner'] = df['winner'].astype(int)
df['home_team_num'] = df['home_team_num'].astype(int)
df['away_team_num'] = df['away_team_num'].astype(int)


df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,home_team_rank,away_team_rank,winner,home_team_num,away_team_num
23653,2014-2015,1,26,3/8/15,12:00 PM,Barcelona,Rayo Vallecano,6:1,1.0,12.0,0,14,25
21886,2010-2011,1,1,8/29/10,9:00 PM,RCD Mallorca,Real Madrid,0:0,8.0,8.0,1,20,10
23580,2014-2015,1,19,1/17/15,6:00 PM,Valencia,UD Almería,3:2,5.0,18.0,0,2,34
20742,2007-2008,1,1,8/26/07,7:00 PM,RCD Mallorca,Levante,3:0,1.0,14.0,0,20,30
20108,2005-2006,1,14,12/3/05,7:00 PM,Celta de Vigo,Real Betis,2:1,3.0,18.0,0,3,18
46141,2016-2017,2,25,2/12/17,6:00 PM,CD Numancia,Reus Deportiu,1:0,7.0,10.0,0,21,77
42917,2009-2010,2,26,2/28/10,12:00 PM,FC Cartagena,Real Betis,1:2,3.0,6.0,2,67,18
48122,2020-2021,2,37,5/3/21,9:00 PM,Albacete,Alcorcón,0:1,22.0,16.0,2,28,68
43456,2010-2011,2,33,4/10/11,5:00 PM,CD Numancia,Villarreal CF B,2:1,11.0,13.0,0,21,66
18707,2001-2002,1,25,2/10/02,8:00 PM,Real Betis,Athletic,1:1,6.0,6.0,1,18,13


Now we have a numerical data to train the model. We are going to use a Random Forest Classifier to predict the winner of the matchday of the current season. We are going to use the data of the previous season to train the model.

We are going to select the data to train the model.

We are going to use the data from the seasons from 1985 to 2015 to train the model. So the goal is to make a quiniela model for the season from 2016 to 2021.

In [35]:
from sklearn.model_selection import train_test_split

#We will use for training the data from seasons 1985-2015 
all_train_data = df[df['season'] < '2015-2016']

#We split this data for training and testing
train_data, test_data = train_test_split(all_train_data, test_size=0.2, random_state=42)


In [36]:
#Define the model
forest_model = RandomForestRegressor(random_state=100, n_estimators=100)

In [37]:
#Select the train data variables of the model
X_train = train_data[['home_team_num', 'away_team_num', 'home_team_rank', 'away_team_rank']]
y_train = train_data['winner']

#Select the test data variables
X_test = test_data[['home_team_num', 'away_team_num', 'home_team_rank', 'away_team_rank']]
y_test = test_data['winner']

In [38]:
#Train the model
forest_model.fit(X_train, y_train)

In [39]:
#Predict the winner of the matches
y_pred = forest_model.predict(X_test)

#Convert the predictions to integers
y_pred = y_pred.astype(int)

#Add the predictions next to the real values from test data
test_data['prediction_win'] = y_pred

test_data.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,home_team_rank,away_team_rank,winner,home_team_num,away_team_num,prediction_win
39141,2001-2002,2,19,12/16/01,5:00 PM,Racing Ferrol,CD Numancia,2:1,7.0,16.0,0,47,21,0
23511,2014-2015,1,12,11/22/14,8:00 PM,Barcelona,Sevilla FC,5:1,2.0,5.0,0,14,26,0
23552,2014-2015,1,16,12/20/14,10:00 PM,Rayo Vallecano,Espanyol,1:3,12.0,8.0,2,25,5,1
23535,2014-2015,1,14,12/7/14,9:00 PM,Granada CF,Valencia,1:1,16.0,5.0,1,38,2,1
42523,2008-2009,2,32,4/12/09,6:00 PM,Alicante CF,UD Salamanca,0:0,21.0,7.0,1,64,6,1
23572,2014-2015,1,18,1/10/15,10:00 PM,SD Eibar,Getafe,2:1,8.0,14.0,0,40,31,0
21830,2009-2010,1,34,4/24/10,10:00 PM,Valencia,Dep. La Coruña,1:0,3.0,9.0,0,2,15,1
20627,2006-2007,1,27,3/18/07,9:00 PM,Sevilla FC,Celta de Vigo,2:0,1.0,17.0,0,26,3,0
22573,2011-2012,1,32,4/8/12,12:00 PM,Levante,Atlético Madrid,2:0,5.0,6.0,0,30,11,0
44024,2012-2013,2,1,8/18/12,7:00 PM,CD Lugo,Hércules CF,1:0,1.0,15.0,0,74,37,0


In [40]:
#Calculate the accuracy of the model
accuracy = (test_data['winner'] == test_data['prediction_win']).sum() / len(test_data)
print('The accuracy of the model is: ', accuracy)

The accuracy of the model is:  0.46827794561933533


In [41]:
#Save the model
from sklearn.externals import joblib

joblib.dump(forest_model, 'model.pkl')

ImportError: cannot import name 'joblib' from 'sklearn.externals' (/opt/homebrew/anaconda3/lib/python3.9/site-packages/sklearn/externals/__init__.py)

In [None]:
#Use the model tho predict the winner of the matchdays of the season 2019-2020

