### Data Loading

In [None]:
import pandas as pd
from helpers import *

Firstly, let's create our data-frame from our source data. Additionally, we'll transform the `Score_home` and `Score_away` columns into our target variable such that:
$$
y = \cases{-1 \\ 
            0 \\
            1 }
$$
using the helper function `score_to_win()`

In [2]:
DATA_SRC = '../Data/PL_site_2006_2018/masterdata.csv'
df = pd.read_csv(DATA_SRC)

# create win/lose label
df['target'] = df[['Score_home', 'Score_away']].apply(score_to_win, axis = 1)
df.head()

Unnamed: 0,MatchID,Home_team,Away_team,Score_home,Score_away,Possession_home,Possession_away,Shots_on_target_home,Shots_on_target_away,Shots_home,...,Corners_away,Offsides_home,Offsides_away,Yellow_cards_home,Yellow_cards_away,Fouls_conceded_home,Fouls_conceded_away,Red_cards_home,Red_cards_away,target
0,5937,Blackburn,Reading,3,3,54.0,46.0,6.0,4.0,15.0,...,10.0,5.0,3.0,2.0,0.0,18.0,7.0,0.0,0.0,0
1,5938,Bolton,Aston Villa,2,2,47.1,52.9,2.0,2.0,11.0,...,6.0,0.0,2.0,2.0,1.0,10.0,11.0,0.0,0.0,0
2,5939,Chelsea,Everton,1,1,59.3,40.7,7.0,6.0,20.0,...,2.0,6.0,2.0,2.0,1.0,13.0,7.0,0.0,0.0,0
3,5940,Liverpool,Charlton,2,2,61.6,38.4,5.0,4.0,23.0,...,2.0,6.0,4.0,0.0,0.0,5.0,13.0,0.0,0.0,0
4,5941,Man Utd,West Ham,0,1,65.3,34.7,7.0,2.0,30.0,...,3.0,0.0,1.0,0.0,2.0,13.0,12.0,0.0,0.0,-1


### Feature Extraction
Now, we must drop several variables from the above table in order to fit our model. We'll create `df_wo` to pass in. This leaves us with 24 avaiable features.

In [31]:
df_wo = df.drop(columns = ['target', 'MatchID', 'Home_team', 'Away_team', 'Score_home', 'Score_away'])
list(df_wo)

['Possession_home',
 'Possession_away',
 'Shots_on_target_home',
 'Shots_on_target_away',
 'Shots_home',
 'Shots_away',
 'Touches_home',
 'Touches_away',
 'Passes_home',
 'Passes_away',
 'Tackles_home',
 'Tackles_away',
 'Clearances_home',
 'Clearances_away',
 'Corners_home',
 'Corners_away',
 'Offsides_home',
 'Offsides_away',
 'Yellow_cards_home',
 'Yellow_cards_away',
 'Fouls_conceded_home',
 'Fouls_conceded_away',
 'Red_cards_home',
 'Red_cards_away']

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [33]:
X = df_wo.values
y = df['target'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)

In [34]:
lr = LogisticRegression(random_state = 42)

lr.fit(X_train, y_train)
lr.score(X_test, y_test)



0.6535087719298246

In the hopes of gaining information on the relevance of our features, and their predictive power, we can investigate the coefficients within our model. Each has 3 values, a value for each of the possible 3 classes: win, draw, lose.

In [36]:
for i, feature in enumerate(list(df_wo)):
    print(feature, ": ", lr.coef_[:,i])

Possession_home :  [ 0.13747925  0.02769014 -0.16396916]
Possession_away :  [-0.17997966 -0.03453582  0.17974564]
Shots_on_target_home :  [-0.30363313 -0.1959425   0.42429311]
Shots_on_target_away :  [ 0.41669243 -0.06490135 -0.32400902]
Shots_home :  [ 0.02544738  0.01205473 -0.03173392]
Shots_away :  [-0.02896331 -0.0045982   0.02951583]
Touches_home :  [-0.00551344  0.00455849  0.00075069]
Touches_away :  [ 0.00577394  0.0008052  -0.00444356]
Passes_home :  [-0.01177352 -0.00886142  0.01828333]
Passes_away :  [ 0.01408628  0.00099977 -0.01505982]
Tackles_home :  [-1.19447653e-02  7.35051827e-03  6.00960049e-05]
Tackles_away :  [ 0.02500506 -0.01600747 -0.00443609]
Clearances_home :  [-0.0401921  -0.00961571  0.04117623]
Clearances_away :  [ 0.03673126  0.0076595  -0.0480113 ]
Corners_home :  [-0.02518386  0.00899903  0.02540298]
Corners_away :  [ 0.02786562  0.00852624 -0.03348711]
Offsides_home :  [-0.06468234 -0.02413637  0.07332252]
Offsides_away :  [ 0.00391521  0.01536046 -0.02