# La Quiniela Machine Learning Analysis

On this notebook we're going to analyze the data and the train a model to predict the results of the matches of the La Liga, from spain. We are going to do it with scikit-learn library.T he data source is Transfermarkt, and it was scraped using Python’s library BeautifulSoup4. The data is provided as a SQLite3 database that is inside the repository. This  data set contains a the following table:

Matches: All the matches played between seasons 1928-1929 and 2021-2022 with the date and score. Columns are season, division, matchday, date, time, home_team, away_team, score. Have in mind there is no time information for many of themand also that it contains matches still not played from current season.


In [2]:
import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings("ignore")

# Read the data
con = sqlite3.connect("laliga.sqlite")
df = pd.read_sql_query("SELECT * FROM Matches", con)
con.close()

In [3]:
df.head()
df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score
2659,1949-1950,1,20,2/12/50,,Real Oviedo,Dep. La Coruña,0:2
47837,2020-2021,2,12,11/13/20,9:00 PM,Real Zaragoza,Real Oviedo,1:2
32127,1980-1981,2,2,9/14/80,,Levante,Elche CF,2:1
8765,1973-1974,1,19,1/20/74,,Elche CF,CD Málaga,0:1
2102,1946-1947,1,18,2/2/47,,Valencia,Dep. La Coruña,3:0
27756,1956-1957,2,16,12/30/56,,Avilés Ind.,Indauchu,0:1
41293,2006-2007,2,5,9/23/06,6:30 PM,Hércules CF,Real Murcia,0:1
27489,1955-1956,2,17,1/22/56,,Ferrol,CE Sabadell,1:1
39613,2002-2003,2,20,1/26/03,5:00 PM,SD Compostela,Levante,3:1
44342,2012-2013,2,30,3/16/13,8:00 PM,SD Huesca,Villarreal,0:1


For inputing the data to the model we are going to assign to each team a number, and add a colum of the winner of the match. We also are going a colum with information about wich team (home or away) wins.

- 0: Home team wins
- 1: Draw
- 2: Away team wins

In [4]:
#Drop the null values
df = df.dropna()
#Add a column with the winner of the match
df['winner'] = df['score'].str.split(':').str[0].astype(int) - df['score'].str.split(':').str[1].astype(int)
df['winner'] = np.where(df['winner'] > 0, 0, np.where(df['winner'] < 0, 2, 1))

df.sample(10)


Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,winner
45061,2014-2015,2,11,11/2/14,5:00 PM,Real Valladolid,Girona,2:1,0
24805,2017-2018,1,27,3/4/18,6:30 PM,Real Sociedad,Alavés,2:1,0
23862,2015-2016,1,9,10/24/15,10:05 PM,Málaga CF,Dep. La Coruña,2:0,0
24525,2016-2017,1,37,5/14/17,8:00 PM,Villarreal,Dep. La Coruña,0:0,1
42227,2008-2009,2,6,10/4/08,5:30 PM,Sevilla Atl.,Real Sociedad,1:0,0
40209,2003-2004,2,32,4/11/04,5:00 PM,Elche CF,SD Eibar,2:0,0
20148,2005-2006,1,18,1/7/06,9:00 PM,Athletic,Dep. La Coruña,1:2,2
19104,2002-2003,1,27,3/23/03,5:00 PM,Málaga CF,Real Betis,0:0,1
43139,2010-2011,2,4,9/19/10,5:00 PM,Recr. Huelva,Albacete,0:0,1
23078,2013-2014,1,7,9/27/13,9:00 PM,Real Valladolid,Málaga CF,2:2,1


In [7]:
#Assing to each team a number
teams = [df['home_team'].unique()]
#Convert the array to a list
teams = teams[0].tolist()

#Create a dictionary with the teams and their number
teams_dict = {}
for i in range(len(teams)):
    teams_dict[teams[i]] = i

#Create a new column with the number of the home team
df['home_team_num'] = df['home_team'].map(teams_dict)
#Create a new column with the number of the away team
df['away_team_num'] = df['away_team'].map(teams_dict)

#pass the values of the columns winner, home_team_num and away_team_num to an integer
df['winner'] = df['winner'].astype(int)
df['home_team_num'] = df['home_team_num'].astype(int)
df['away_team_num'] = df['away_team_num'].astype(int)


df.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,winner,home_team_num,away_team_num
44299,2012-2013,2,26,2/17/13,12:00 PM,Recr. Huelva,Racing,1:0,0,27,9
45186,2014-2015,2,23,1/31/15,6:00 PM,CE Sabadell,Barcelona B,1:4,2,70,69
20080,2005-2006,1,11,10/26/05,8:00 PM,Real Betis,Villarreal,2:3,2,18,16
25035,2018-2019,1,12,11/11/18,6:30 PM,Rayo Vallecano,Villarreal,2:2,1,25,16
20235,2005-2006,1,26,3/5/06,7:00 PM,Málaga CF,Valencia,0:0,1,24,2
21807,2009-2010,1,31,4/11/10,9:00 PM,RCD Mallorca,Valencia,3:2,0,20,2
24165,2016-2017,1,1,8/21/16,10:15 PM,Atlético Madrid,Alavés,1:1,1,11,1
20847,2007-2008,1,11,11/4/07,9:00 PM,Athletic,Recr. Huelva,2:0,0,13,27
39493,2002-2003,2,9,11/2/02,10:00 PM,UD Las Palmas,Real Zaragoza,0:1,2,23,8
19275,2003-2004,1,6,10/5/03,5:00 PM,Real Murcia,Albacete,1:0,0,29,28


Now we have a numerical data to train the model. We are going to use a Random Forest Classifier to predict the winner of the matchday of the current season. We are going to use the data of the previous season to train the model.

We are going to select the data to train the model.

We are going to use the data from the seasons from 1985 to 2015 to train the model. So the goal is to make a quiniela model for the season from 2016 to 2021.

In [44]:
from sklearn.model_selection import train_test_split

#We will use for training the data from seasons 1985-2015 
all_train_data = df[df['season'] < '2015-2016']

#We split this data for training and testing
train_data, test_data = train_test_split(all_train_data, test_size=0.2, random_state=42)


In [47]:
#Define the model
forest_model = RandomForestRegressor(random_state=100, n_estimators=100)

In [50]:
#Select the train data variables of the model
X_train = train_data[['home_team_num', 'away_team_num']]
y_train = train_data['winner']

#Select the test data variables
X_test = test_data[['home_team_num', 'away_team_num']]
y_test = test_data['winner']

In [51]:
#Train the model
forest_model.fit(X_train, y_train)

In [61]:
#Predict the winner of the matches
y_pred = forest_model.predict(X_test)

#Convert the predictions to integers
y_pred = y_pred.astype(int)

#Add the predictions next to the real values from test data
test_data['prediction_win'] = y_pred

test_data.sample(10)

Unnamed: 0,season,division,matchday,date,time,home_team,away_team,score,winner,home_team_num,away_team_num,prediction_win
43718,2011-2012,2,15,11/26/11,6:00 PM,Villarreal CF B,CD Numancia,0:0,1,66,21,0
20717,2006-2007,1,36,5/27/07,9:00 PM,Sevilla FC,Real Zaragoza,3:1,0,26,8,0
18922,2002-2003,1,9,11/9/02,9:30 PM,Valencia,Real Betis,1:1,1,2,18,0
41616,2006-2007,2,34,4/22/07,6:00 PM,Hércules CF,Sporting Gijón,1:1,1,37,35,1
21313,2008-2009,1,20,1/25/09,5:00 PM,Recr. Huelva,Real Betis,1:0,0,27,18,0
40956,2005-2006,2,16,12/11/05,5:00 PM,Sporting Gijón,Lorca Dep.,1:0,0,35,57,1
39607,2002-2003,2,19,1/19/03,5:00 PM,Elche CF,Badajoz 1905,0:1,2,39,46,1
18150,2000-2001,1,8,11/1/00,5:00 PM,Athletic,Real Zaragoza,1:2,2,13,8,0
23234,2013-2014,1,22,2/2/14,5:00 PM,Real Betis,Espanyol,2:0,0,18,5,0
21166,2008-2009,1,5,9/28/08,7:00 PM,Valencia,Dep. La Coruña,4:2,0,2,15,0


In [63]:
#Calculate the accuracy of the model
accuracy = (test_data['winner'] == test_data['prediction_win']).sum() / len(test_data)
print('The accuracy of the model is: ', accuracy)

The accuracy of the model is:  0.4182132067328442


In [66]:
#Save the model
from sklearn.externals import joblib

joblib.dump(forest_model, 'model.pkl')

ImportError: cannot import name 'joblib' from 'sklearn.externals' (c:\Users\Pauro14\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\externals\__init__.py)

In [None]:
#Use the model tho predict the winner of the matchdays of the season 2019-2020

