This notebook utilizes a soccer matches dataset from FiveThirtyEight. The Matches data set that is being used here contains match-by-match SPI ratings and forecasts back to 2016.

Before working with this data set, I had previousy attempted to create regression models on the NBA dataset from FiveThirtyEight. However, my prediction with that was consistently below 10% even after multiple attempts of feature engineering. So hopefully this gives some better results. 

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('../data/raw/spi_matches.csv')
df.head()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
0,2016-08-12,1843,French Ligue 1,Bastia,Paris Saint-Germain,51.16,85.68,0.0463,0.838,0.1157,...,32.4,67.7,0.0,1.0,0.97,0.63,0.43,0.45,0.0,1.05
1,2016-08-12,1843,French Ligue 1,AS Monaco,Guingamp,68.85,56.48,0.5714,0.1669,0.2617,...,53.7,22.9,2.0,2.0,2.45,0.77,1.75,0.42,2.1,2.1
2,2016-08-13,2411,Barclays Premier League,Hull City,Leicester City,53.57,66.81,0.3459,0.3621,0.2921,...,38.1,22.2,2.0,1.0,0.85,2.77,0.17,1.25,2.1,1.05
3,2016-08-13,2411,Barclays Premier League,Burnley,Swansea City,58.98,59.74,0.4482,0.2663,0.2854,...,36.5,29.1,0.0,1.0,1.24,1.84,1.71,1.56,0.0,1.05
4,2016-08-13,2411,Barclays Premier League,Middlesbrough,Stoke City,56.32,60.35,0.438,0.2692,0.2927,...,33.9,32.5,1.0,1.0,1.4,0.55,1.13,1.06,1.05,1.05


In [5]:
df.tail()

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
20886,2019-06-09,1871,Spanish Segunda Division,Osasuna,Real Oviedo,31.47,28.58,0.4935,0.2097,0.2968,...,,,,,,,,,,
20887,2019-06-09,1871,Spanish Segunda Division,Gimnástic Tarragona,Lugo,20.93,24.81,0.3965,0.2842,0.3193,...,,,,,,,,,,
20888,2019-06-09,1871,Spanish Segunda Division,Rayo Majadahonda,Reus Deportiu,24.48,24.29,0.4446,0.2459,0.3095,...,,,,,,,,,,
20889,2019-06-09,1871,Spanish Segunda Division,Almeria,Albacete,33.09,31.97,0.4693,0.228,0.3028,...,,,,,,,,,,
20890,2019-06-09,1871,Spanish Segunda Division,Tenerife,Real Zaragoza,27.87,32.5,0.4076,0.3056,0.2868,...,,,,,,,,,,


In [6]:
df.columns

Index(['date', 'league_id', 'league', 'team1', 'team2', 'spi1', 'spi2',
       'prob1', 'prob2', 'probtie', 'proj_score1', 'proj_score2',
       'importance1', 'importance2', 'score1', 'score2', 'xg1', 'xg2', 'nsxg1',
       'nsxg2', 'adj_score1', 'adj_score2'],
      dtype='object')

First I'm going to run a raw regression model on this just to see what the accuracy turns out to be. The first bit of feature engineering that I'll do is label encoding the team1 and team2 columns.

In [7]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(df['team1'])
df['team1'] = le.transform(df['team1'])
le.fit(df['team2'])
df['team2'] = le.transform(df['team2'])
df.head(3)
df.tail(3)

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
20888,2019-06-09,1871,Spanish Segunda Division,504,519,24.48,24.29,0.4446,0.2459,0.3095,...,,,,,,,,,,
20889,2019-06-09,1871,Spanish Segunda Division,31,29,33.09,31.97,0.4693,0.228,0.3028,...,,,,,,,,,,
20890,2019-06-09,1871,Spanish Segunda Division,619,514,27.87,32.5,0.4076,0.3056,0.2868,...,,,,,,,,,,


With some simple feature engineering, I think it is possible to create a regression model. The first one that I will try is the random forest regression model. 

In [12]:
#We want to check for/replace any null values prior to testing the model. 
df.fillna(0)
df = df.dropna()
df.tail(3)

Unnamed: 0,date,league_id,league,team1,team2,spi1,spi2,prob1,prob2,probtie,...,importance1,importance2,score1,score2,xg1,xg2,nsxg1,nsxg2,adj_score1,adj_score2
14717,2018-10-23,1818,UEFA Champions League,566,394,74.96,94.08,0.2034,0.5574,0.2392,...,72.8,37.2,0.0,3.0,1.23,2.66,1.12,2.11,0.0,3.15
14718,2018-10-23,1818,UEFA Champions League,508,669,91.28,61.05,0.8568,0.0362,0.107,...,52.7,32.2,2.0,1.0,2.95,1.52,4.14,0.57,2.1,1.05
14719,2018-10-23,1818,UEFA Champions League,18,124,77.42,73.56,0.5331,0.1977,0.2692,...,100.0,100.0,3.0,0.0,2.16,0.73,2.27,1.31,3.15,0.0


So here I'm not quite sure if the fillna function actually filled all the NaN values with zeros; so instead I just used dropna to get rid of the instances where NaN existed. 

In [19]:
#splitting the data between train and test on the 80/20 split like discussed in class. 
from sklearn.model_selection import train_test_split
labels = df['score1']
#here I'm not sure if the prediction will result the same for both score1 and score2. We can test both 
#and see what happens
features = df[['team1', 'team2', 'spi1', 'spi2', 'prob1', 'prob2', 'importance1', 'importance2', 'probtie', 'xg1', 'xg2', 'nsxg1', 'nsxg2']]
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    labels, 
                                                    test_size=0.20, 
                                                    random_state=42)

At first I checked the prediction accuracy with only a select number of features, but I'm not totally sure which features are more important in this case so I'm going to include all of them in the features for now. 

In [20]:
from sklearn.ensemble import RandomForestClassifier
#creating a random forest classifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
#Predicting the label of the new data set
prediction = model.predict(X_test)
print (prediction)

[0. 0. 1. ... 1. 1. 3.]


In [21]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, prediction,))

0.3398328690807799


I'm thinking that one thing that can be done to improve the accurace would be to figure out what features carry more weight in the determination of the score. Aside from that, the data set seems to be pretty complete. 