# NBA
## Game Winner Prediction

I always wondered, given some key stats from a game (not number of points for each team of course), can we actually predict which team won the game?

Therefore, I decided to train a model to predict the outcome of a game with the key stats from a game as a feature.

Before we start, I listed down some potential stats that might be useful to be put as a feature for the model:
- Assists
- Rebounds
- Steals
- Blocks
- Turnovers
- Field Goal Percentages
- Three Point Field Goal Percentages
- Free Throw Percentages
- Free Throws Made

The model that we will use is Logistic Regression as the dependent variable is 'HOME_TEAM_WINS' which will be binary (0/1).

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
import pickle

In [2]:
official_boxscore = pd.read_csv('./output_data/detailed_boxscore.csv')
display(official_boxscore.head())

Unnamed: 0.1,Unnamed: 0,SEASON,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT_home,AST_home,REB_home,STL_home,BLK_home,...,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,STL_away,BLK_away,FTM_away,TO_away,HOME_TEAM_WINS
0,0,2019,85.0,0.354,0.9,0.229,22.0,47.0,7.0,8.0,...,0.402,0.762,0.226,20.0,61.0,7.0,6.0,16.0,16.0,0
1,1,2019,91.0,0.364,0.4,0.31,19.0,57.0,9.0,3.0,...,0.468,0.632,0.275,28.0,56.0,10.0,7.0,12.0,14.0,0
2,2,2019,136.0,0.592,0.805,0.542,25.0,37.0,6.0,6.0,...,0.505,0.65,0.488,27.0,37.0,7.0,4.0,13.0,9.0,1
3,3,2019,133.0,0.566,0.7,0.5,38.0,41.0,6.0,1.0,...,0.461,0.897,0.263,24.0,36.0,9.0,4.0,26.0,11.0,1
4,4,2019,106.0,0.407,0.885,0.257,18.0,51.0,8.0,7.0,...,0.413,0.667,0.429,23.0,42.0,6.0,4.0,16.0,14.0,1


In [3]:
feature_columns = list(official_boxscore.columns[3:12])+list(official_boxscore.columns[13:22])
print(f'Features for the model:{feature_columns}')
x = official_boxscore[feature_columns]
y = official_boxscore['HOME_TEAM_WINS']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.15,random_state=0)

Features for the model:['FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home', 'STL_home', 'BLK_home', 'FTM_home', 'TO_home', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away', 'STL_away', 'BLK_away', 'FTM_away', 'TO_away']


In [4]:
logistic_regression= LogisticRegression(max_iter=2000)
logistic_regression.fit(x_train,y_train)
y_pred=logistic_regression.predict(x_test)
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))

Accuracy:  0.9353535353535354


Successfully predicted a model with an accuracy of 93.54% using logistic regression using the 18 stats.

Next, I would like to explore what stats that contribute to the model the most. The stats that contribute to the model the most is actually can be said as the stats that can decide who won the game the most.

In [5]:
importance = logistic_regression.coef_[0]
columns = list(x.columns)
coefficients = pd.DataFrame(importance,columns, columns=['Coefficients'])
coefficients = coefficients.abs().sort_values('Coefficients', ascending = False)
display(coefficients)

Unnamed: 0,Coefficients
FG_PCT_home,24.427376
FG_PCT_away,24.317845
FG3_PCT_home,7.327599
FG3_PCT_away,7.300488
FT_PCT_home,1.469755
FT_PCT_away,1.387851
TO_away,0.359184
TO_home,0.347544
REB_away,0.186081
REB_home,0.18362


FG_PCT for each team is the key stats that can determine the winner for the game while steals and blocks did not really affect an outcome of a game.

I decided to remove BLK and STL as the features as their coefficient is less than 0.1.

In [6]:
top_features = coefficients.index[:14]
print(top_features)

Index(['FG_PCT_home', 'FG_PCT_away', 'FG3_PCT_home', 'FG3_PCT_away',
       'FT_PCT_home', 'FT_PCT_away', 'TO_away', 'TO_home', 'REB_away',
       'REB_home', 'FTM_away', 'FTM_home', 'AST_home', 'AST_away'],
      dtype='object')


In [7]:
x2 = official_boxscore[top_features]
y2 = official_boxscore['HOME_TEAM_WINS']
x_train2,x_test2,y_train2,y_test2 = train_test_split(x2,y2,test_size=0.15,random_state=0)
logistic_regression2 = LogisticRegression(max_iter=2000)
logistic_regression2.fit(x_train2,y_train2)
y_pred2=logistic_regression2.predict(x_test2)
print('Accuracy: ',metrics.accuracy_score(y_test2, y_pred2))

Accuracy:  0.9365079365079365


Achieved higher accuracy.

In [8]:
pickle.dump(logistic_regression2, open('./model/game_predictor.pkl', 'wb'))