# La Quiniela Machine Learning Analysis

On this notebook we're going to analyze the data and the train a model to predict the results of the matches of the La Liga, from spain. We are going to do it with scikit-learn library.T he data source is Transfermarkt, and it was scraped using Python’s library BeautifulSoup4. The data is provided as a SQLite3 database that is inside the repository. This  data set contains a the following table:

Matches: All the matches played between seasons 1928-1929 and 2021-2022 with the date and score. Columns are season, division, matchday, date, time, home_team, away_team, score. Have in mind there is no time information for many of themand also that it contains matches still not played from current season.


In [1]:
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings("ignore")

# Read the data
con = sqlite3.connect("laliga.sqlite")
df = pd.read_sql_query("SELECT * FROM Matches", con)
con.close()

In [2]:
df.head()
df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score
23387,2013-2014,1,37,5/11/14,7:00 PM,Elche CF,Barcelona,0:0
12564,1985-1986,1,33,4/13/86,,Athletic,Real Zaragoza,1:1
3810,1954-1955,1,19,1/16/55,,Barcelona,Celta de Vigo,5:2
19092,2002-2003,1,26,3/16/03,5:00 PM,Espanyol,Málaga CF,2:1
30039,1967-1968,2,10,11/26/67,,Rayo Vallecano,CF Badalona,1:0
48330,2021-2022,1,16,12/5/21,,Celta de Vigo,Valencia,
33897,1988-1989,2,23,2/26/89,,CD Tenerife,Dep. La Coruña,0:3
2757,1950-1951,1,7,10/22/50,,Real Murcia,Celta de Vigo,1:2
27535,1955-1956,2,22,2/26/56,,Sestao Sport,Baracaldo,0:0
9935,1977-1978,1,13,12/11/77,,Rayo Vallecano,Valencia,3:0


For inputing the data to the model we are going to assign to each team a number, and add a colum of the winner of the match. We also are going a colum with information about wich team (home or away) wins.

- 0: Home team wins
- 1: Draw
- 2: Away team wins

In [3]:
#Drop the null values
df = df.dropna()
#Add a column with the winner of the match
df['winner'] = df['score'].str.split(':').str[0].astype(int) - df['score'].str.split(':').str[1].astype(int)
df['winner'] = np.where(df['winner'] > 0, 0, np.where(df['winner'] < 0, 2, 1))

df.sample(10)


Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,winner
21268,2008-2009,1,16,12/20/08,8:00 PM,Real Madrid,Valencia,1:0,0
39649,2002-2003,2,23,2/16/03,5:00 PM,Badajoz 1905,Albacete,0:3,2
24533,2016-2017,1,38,5/21/17,4:45 PM,Atlético Madrid,Athletic,3:1,0
42338,2008-2009,2,16,12/13/08,6:30 PM,SD Huesca,Real Murcia,1:0,0
22928,2012-2013,1,30,4/5/13,9:00 PM,Granada CF,Real Betis,1:5,2
19817,2004-2005,1,22,2/6/05,9:00 PM,Barcelona,Atlético Madrid,0:2,2
20415,2006-2007,1,6,10/15/06,5:00 PM,Villarreal,Espanyol,0:0,1
23489,2014-2015,1,10,11/1/14,4:00 PM,Granada CF,Real Madrid,0:4,2
25394,2019-2020,1,10,10/27/19,4:00 PM,Levante,Espanyol,0:1,2
45812,2015-2016,2,37,5/8/16,8:00 PM,CD Mirandés,SD Huesca,1:0,0


In [4]:
#Assing to each team a number
teams = [df['home_team'].unique()]
#Convert the array to a list
teams = teams[0].tolist()

#Create a dictionary with the teams and their number
teams_dict = {}
for i in range(len(teams)):
    teams_dict[teams[i]] = i

#Create a new column with the number of the home team
df['home_team_num'] = df['home_team'].map(teams_dict)
#Create a new column with the number of the away team
df['away_team_num'] = df['away_team'].map(teams_dict)

#pass the values of the columns winner, home_team_num and away_team_num to an integer
df['winner'] = df['winner'].astype(int)
df['home_team_num'] = df['home_team_num'].astype(int)
df['away_team_num'] = df['away_team_num'].astype(int)


df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,winner,home_team_num,away_team_num
41270,2006-2007,2,3,9/9/06,6:30 PM,Poli Ejido,Xerez CD,0:2,2,45,36
24194,2016-2017,1,4,9/18/16,4:15 PM,Athletic,Valencia,2:1,0,13,2
25276,2018-2019,1,36,5/5/19,6:30 PM,Real Valladolid,Athletic,1:0,0,12,13
42840,2009-2010,2,19,1/10/10,5:00 PM,Girona,UD Las Palmas,0:2,2,43,23
44418,2012-2013,2,37,5/4/13,6:00 PM,UD Las Palmas,Sporting Gijón,4:2,0,23,35
43392,2010-2011,2,27,3/3/11,8:00 PM,Ponferradina,Real Betis,1:1,1,61,18
43733,2011-2012,2,16,12/4/11,8:30 PM,Real Valladolid,Dep. La Coruña,0:0,1,12,15
43114,2010-2011,2,2,9/5/10,5:00 PM,CD Numancia,Celta de Vigo,1:3,2,21,3
47902,2020-2021,2,17,12/7/20,9:00 PM,RCD Mallorca,CD Castellón,3:1,0,20,58
20789,2007-2008,1,6,9/29/07,10:00 PM,Levante,Barcelona,1:4,2,30,14


Now we have a numerical data to train the model. We are going to use a Random Forest Classifier to predict the winner of the matchday of the current season. We are going to use the data of the previous season to train the model.

We are going to select the data to train the model.

We are going to use the data from the seasons from 1985 to 2015 to train the model. So the goal is to make a quiniela model for the season from 2016 to 2021.

In [5]:
from sklearn.model_selection import train_test_split

#We will use for training the data from seasons 1985-2018
all_train_data = df[df['season'] < '2019-2020']

#We split this data for training and testing
train_data, test_data = train_test_split(all_train_data, test_size=0.2, random_state=42)


In [6]:
#Define the model
forest_model = RandomForestRegressor(random_state=100, n_estimators=100)

In [7]:
#Select the train data variables of the model
X_train = train_data[['home_team_num', 'away_team_num']]
y_train = train_data['winner']

#Select the test data variables
X_test = test_data[['home_team_num', 'away_team_num']]
y_test = test_data['winner']

In [8]:
#Train the model
forest_model.fit(X_train, y_train)

In [9]:
#Predict the winner of the matches
y_pred = forest_model.predict(X_test)

#Convert the predictions to integers
y_pred = y_pred.astype(int)

#Add the predictions next to the real values from test data
test_data['prediction_win'] = y_pred

test_data.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,winner,home_team_num,away_team_num,prediction_win
41241,2005-2006,2,42,6/18/06,6:00 PM,Hércules CF,Recr. Huelva,0:2,2,37,27,1
44603,2013-2014,2,12,11/2/13,6:00 PM,SD Eibar,Real Zaragoza,3:2,0,40,8,1
22492,2011-2012,1,24,2/19/12,4:00 PM,Athletic,Málaga CF,3:0,0,13,24,0
24729,2017-2018,1,20,1/20/18,1:00 PM,Espanyol,Sevilla FC,0:3,2,5,26,0
25131,2018-2019,1,22,2/2/19,6:30 PM,Barcelona,Valencia,2:2,1,14,2,0
47195,2018-2019,2,37,5/5/19,6:00 PM,Albacete,CD Numancia,0:0,1,28,21,1
18710,2001-2002,1,26,2/16/02,9:30 PM,Barcelona,Dep. La Coruña,3:2,0,14,15,0
39945,2003-2004,2,8,10/19/03,5:00 PM,UD Salamanca,Málaga B,1:1,1,6,53,0
22709,2012-2013,1,8,10/20/12,6:00 PM,Real Madrid,Celta de Vigo,2:0,0,10,3,0
46987,2018-2019,2,18,12/16/18,6:00 PM,UD Las Palmas,CD Tenerife,1:1,1,23,17,0


In [10]:
#Calculate the accuracy of the model
accuracy = (test_data['winner'] == test_data['prediction_win']).sum() / len(test_data)
print('The accuracy of the model is: ', accuracy)

The accuracy of the model is:  0.4095620193915079


In [66]:
#Save the model
from sklearn.externals import joblib

joblib.dump(forest_model, 'model.pkl')

ImportError: cannot import name 'joblib' from 'sklearn.externals' (c:\Users\Pauro14\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\externals\__init__.py)

In [11]:
#Use the model tho predict the winner of the matchdays of the season 2019-2020
season_19_20 = df[df['season'] == '2019-2020']
test_19_20 = season_19_20[['home_team_num', 'away_team_num']]
season_19_20['prediction_win'] = forest_model.predict(test_19_20).astype(int)
accuracy_19_20 = (season_19_20['winner'] == season_19_20['prediction_win']).sum() / len(season_19_20)
print('The accuracy of the model for predicting the 2019-2020 season is: ', accuracy_19_20)



The accuracy of the model for predicting the 2019-2020 season is:  0.40617577197149646
