In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
import pandas as pd

# Classification Model Building

In this notebook, I will be creating a model to predict game results based off of games metadata.

First we import the data

In [2]:
games = pd.read_pickle(os.path.join('Data','final_games.pkl'))
games.head()

Unnamed: 0,HOME_TEAM_ABBREV,AWAY_TEAM_ABBREV,SEASON,HOME_RECORD,ROAD_RECORD,HOME_TEAM_WINS,OFF_HOME_TEAM,DEF_HOME_TEAM,REB_HOME_TEAM,OFF_AWAY_TEAM,...,REB_AWAY_TEAM,OFF_HOME_RATING,DEF_HOME_RATING,REB_HOME_RATING,OFF_AWAY_RATING,DEF_AWAY_RATING,REB_AWAY_RATING,OFFENSE_DIFFERENCE,DEFENSE_DIFFERENCE,REBOUND_DIFFERENCE
0,CHA,MIL,2019,-10,20,0,"[31.006061923099047, 38.54431945273706, 32.704...","[22.10810810810811, 28.81081081081081, 22.2162...","[23.308831973407106, 31.182094864883947, 35.85...","[35.22876717660929, 65.69235449876932, 37.7094...",...,"[17.28106695654748, 58.916993867334874, 24.020...",33.477084,20.021622,26.02933,39.084703,26.043243,26.156013,-5.607619,-6.021622,-0.126683
1,MIN,DAL,2019,-15,11,0,"[31.63000087955722, 24.95066073256286, 31.9678...","[29.405405405405403, 5.351351351351353, 21.459...","[17.25898881532161, 13.456670998027207, 24.888...","[41.62244146926659, 29.710952103096684, 48.975...",...,"[22.37412606063427, 21.685969263523607, 38.388...",35.712786,20.486486,23.266696,36.429985,20.762162,21.56297,-0.717199,-0.275676,1.703727
2,LAC,PHI,2019,19,-13,1,"[44.03598551522123, 37.12656298881634, 28.6805...","[53.513513513513516, 14.702702702702705, 8.054...","[26.533392671106416, 30.47002411684483, 18.105...","[25.93974282213501, 42.6812218695769, 44.48622...",...,"[9.964396462598044, 30.33662390755899, 42.5476...",40.421789,25.940541,25.653422,34.18872,22.178378,21.569685,6.233069,3.762162,4.083737
3,DEN,TOR,2019,17,10,1,"[43.82752587784179, 32.27739138834381, 54.2482...","[28.108108108108105, 21.513513513513516, 34.81...","[29.068883152536877, 24.15430293073206, 63.988...","[27.52057045795501, 34.03891798056395, 45.9075...",...,"[14.924055357398997, 26.57847134410255, 40.123...",36.647265,22.421622,24.336735,33.382946,19.313514,19.538902,3.264319,3.108108,4.797833
4,SAC,DET,2019,-2,-13,1,"[45.720084398700706, 29.937134564603657, 36.76...","[14.702702702702705, 16.702702702702705, 15.40...","[34.7626909366028, 23.82034569793837, 23.35391...","[27.664451648755787, 32.20682601470979, 37.932...",...,"[10.074796140113614, 23.019953413609556, 41.90...",39.696424,22.264865,25.286163,30.49255,15.410811,17.782184,9.203875,6.854054,7.50398


Then I create tables of the parameters and the target

In [3]:
X = games[['HOME_RECORD', 'ROAD_RECORD', 'OFF_HOME_RATING', 'DEF_HOME_RATING', 'REB_HOME_RATING',
           'OFF_AWAY_RATING', 'DEF_AWAY_RATING', 'REB_AWAY_RATING','OFFENSE_DIFFERENCE', 
           'DEFENSE_DIFFERENCE','REBOUND_DIFFERENCE']]
y = games['HOME_TEAM_WINS']

In [4]:
X.dtypes

HOME_RECORD             int64
ROAD_RECORD             int64
OFF_HOME_RATING       float64
DEF_HOME_RATING       float64
REB_HOME_RATING       float64
OFF_AWAY_RATING       float64
DEF_AWAY_RATING       float64
REB_AWAY_RATING       float64
OFFENSE_DIFFERENCE    float64
DEFENSE_DIFFERENCE    float64
REBOUND_DIFFERENCE    float64
dtype: object

## Logistic Regression

The first classifier I used to make the model is the Logistic Regression

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [6]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression( solver = 'lbfgs')
model.fit(X_train, y_train)
pred = model.predict(X_test)

The model has a score of around 70% which is signifigantly better than guessing, but I hoped for a model with at least a 80% score.

In [7]:
model.score(X_test, y_test)

0.725717776420281

To find the average percent accuracy I will take a couple of samples and average the score

In [8]:
def return_score():
    """ returns score of one trial """
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    m = LogisticRegression( solver = 'lbfgs')
    m.fit(X_train, y_train)
    return m.score(X_test, y_test)

In [9]:
def return_average(n):
    """ returns average score """
    return np.array([return_score() for i in range(n)]).mean()

In [10]:
return_average(10)

0.721930360415394

The confusion matrix shows that our model struggles more with guessing negative or losses for the home team as the false positive score is too high for comfort. This could be a result of actual upsets that we have been seeing in the past year or it could be because our model is overfitting. Still if I was a betting man ( I'm not saying I am) an around 70% accruacy score is not comforting. 

Looking at our diagnostics, I am concerned with the accruacy score of this classifier as it is too low for comfort. Specifically, the specificity( chance that the machine predicts the home team losing) is too low.

In [11]:
from sklearn import metrics
tn, fp, fn, tp = metrics.confusion_matrix(y_test, pred).ravel()
tn, fp, fn, tp

(436, 253, 196, 752)

In [12]:
##amount of diagnosed home wins that were actually home wins
print(metrics.precision_score(y_test,pred))
##amount of positives home wins that was actually diagnosed correctly
print(metrics.recall_score(y_test,pred))
## How many times does our model correctly predict the outcome
print(metrics.accuracy_score(y_test,pred))
## measure of how well a test labels home wins
print(metrics.f1_score(y_test,pred))

## specificity
tn / (tn + fp)

0.7482587064676617
0.7932489451476793
0.725717776420281
0.7700972862263185


0.6328011611030478

## Random Forest Classifier

Next, I'll try a random forest classifier so that I can look at important features and also see what happens if I change the parameters. Currently we get again an around 70% accuracy score, but that is with default paramters

In [13]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
guess = model.predict(X_test)

In [14]:
model.score(X_test, y_test)

0.7263286499694563

Now let's do a grid search to find the best parameters I could be using for the random forest classifier. As you can see the best parameters are {'max_depth': None, 'min_samples_leaf': 20, 'min_samples_split': 7}

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {
    'max_depth': [2,3,4,5,7,10,13,15,18,None], 
    'min_samples_split':[2,3,5,7,10,15,20],
    'min_samples_leaf':[2,3,5,7,10,15,20,22]
}

clf = GridSearchCV(RandomForestClassifier(), parameters, cv=5, scoring = 'accuracy')
clf.fit(X_train, y_train)

In [None]:
clf.best_params_

In [None]:
model = RandomForestClassifier(max_depth = 18, min_samples_leaf = 22, min_samples_split = 3)
model.fit(X_train, y_train)
pred = model.predict(X_test)
model.score(X_test, y_test)

This shows that our random forest classifier dos a little better than our logistic regression

In [None]:
metrics.confusion_matrix(y_test, pred).ravel()

Now lets get the average score again with the same parameters

In [None]:
def return_score():
    """ returns score of one trial """
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    m = RandomForestClassifier( min_samples_leaf = 20, min_samples_split = 7)
    m.fit(X_train, y_train)
    return m.score(X_test, y_test)


In [None]:
def return_average(n):
    """ returns average score """
    return np.array([return_score() for i in range(n)]).mean()

In [None]:
return_average(10)

Now to look at what features are the most important. It seems that the rebounding categories currently are least significant

In [None]:
pd.DataFrame(model.feature_importances_, X.columns.values, columns=['importance']).sort_values('importance', ascending = False)

Let's get rid of the rebounding parameters first, as you can see it does improve the model

In [None]:
X = games[['HOME_RECORD', 'ROAD_RECORD', 'OFF_HOME_RATING', 'DEF_HOME_RATING',
           'OFF_AWAY_RATING', 'DEF_AWAY_RATING','OFFENSE_DIFFERENCE', 
           'DEFENSE_DIFFERENCE']]
y = games['HOME_TEAM_WINS']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
model = RandomForestClassifier( min_samples_leaf = 20, min_samples_split = 7)
model.fit(X_train, y_train)
model.score(X_test, y_test)

Now that I know these parameters and features are the best let's check the confusion matrix of this model to see if it performs better than a logistic regression

In [None]:
X = games[['HOME_RECORD', 'ROAD_RECORD', 'OFF_HOME_RATING', 'DEF_HOME_RATING',
           'OFF_AWAY_RATING', 'DEF_AWAY_RATING','OFFENSE_DIFFERENCE', 
           'DEFENSE_DIFFERENCE']]
y = games['HOME_TEAM_WINS']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
model = RandomForestClassifier( min_samples_leaf = 20, min_samples_split = 7)
model.fit(X_train, y_train)
pred = model.predict(X_test)
model.score(X_test,y_test)

It looks like the confusion matrix for a random forest classifier is slighlty better than the classifier of a logistic regression as the accruacy score is slightly higher. However, the specificity is still low, I realized that its probably because there are a good amount of upsets

In [None]:
from sklearn import metrics
tn, fp, fn, tp = metrics.confusion_matrix(y_test, pred).ravel()
tn, fp, fn, tp

In [None]:
##amount of diagnosed home wins that were actually home wins
print(metrics.precision_score(y_test,pred))
##amount of positives home wins that was actually diagnosed correctly
print(metrics.recall_score(y_test,pred))
## How many times does our model correctly predict the outcome
print(metrics.accuracy_score(y_test,pred))
## measure of how well a test labels home wins
print(metrics.f1_score(y_test,pred))

## specificity
tn / (tn + fp)

In [None]:
return_average(10)

## KN Neighbors Classifier

Now let's try a KN Neighbors Classifier model and fit the model with a similar training and test set. The baseline model gives me around the same accuracy but let's do a grid search again

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X_train, y_train)
guess = model.predict(X_test)

In [None]:
model.score(X_test, y_test)

In [None]:
def KN_return_score():
    """ returns score of one trial """
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    m = KNeighborsClassifier()
    m.fit(X_train, y_train)
    return m.score(X_test, y_test)

def KN_return_average(n):
    """ returns average score """
    return np.array([KN_return_score() for i in range(n)]).mean()

KN_return_average(10)

Here I do a grid search on the n_neighbors parameter of my KN Neighbors calssifier and find that the best parameter is 15 neighbors, so I use it and fit a new model against the test. As you can see it gives me a model that is simlar to the random forest model. So, I'm convinced that I probably need more features to make this model better... or there are just simply a lot of upsets

In [None]:
parameters = {
    'n_neighbors': [3,4,5,6,7,8,9,10,11,12,13,14,15,16]
}

clf = GridSearchCV(KNeighborsClassifier(), parameters, cv=5, scoring = 'accuracy')
clf.fit(X_train,y_train)

In [None]:
clf.best_params_

In [None]:
model = KNeighborsClassifier(n_neighbors = 15)
model.fit(X_train, y_train)
pred = model.predict(X_test)
model.score(X_test,y_test)

In [None]:
def KN_return_score():
    """ returns score of one trial """
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    m = KNeighborsClassifier(n_neighbors = 15)
    m.fit(X_train, y_train)
    return m.score(X_test, y_test)
return_average(10)

Looking at the confusion matrix I get an accuracy similar to the one I get for a random forest classifier showing me that my model is probably as good as it can ever be. The specificity socre is low and that tells me that there are quite a lot of upsets

In [None]:
tn, fp, fn, tp = metrics.confusion_matrix(y_test, pred).ravel()
tn, fp, fn, tp

In [None]:
##amount of diagnosed home wins that were actually home wins
print(metrics.precision_score(y_test,pred))
##amount of positives home wins that was actually diagnosed correctly
print(metrics.recall_score(y_test,pred))
## How many times does our model correctly predict the outcome
print(metrics.accuracy_score(y_test,pred))
## measure of how well a test labels home wins
print(metrics.f1_score(y_test,pred))

## specificity
tn / (tn + fp)