Pick a dataset of your choice with a binary outcome and the potential for at least 15 features.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

- Vanilla logistic regression
- Ridge logistic regression
- Lasso logistic regression

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn import neighbors
from sklearn import linear_model
from scipy.stats import normaltest, boxcox
from statsmodels.stats import diagnostic
from scipy import stats
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.decomposition import PCA

import mdst # my data science tools

%matplotlib inline

the dataset used in this challenge can be found here:
https://www.kaggle.com/dansbecker/nba-shot-logs

#### Dataset description: 
NBA data on shots taken during the 2014-2015 season, who took the shot, where on the floor was the shot taken from, who was the nearest defender, how far away was the nearest defender, time on the shot clock, and much more.

In [2]:
nba = pd.read_csv('nba-shot-logs/shot_logs.csv')
nba.head()

Unnamed: 0,GAME_ID,MATCHUP,LOCATION,W,FINAL_MARGIN,SHOT_NUMBER,PERIOD,GAME_CLOCK,SHOT_CLOCK,DRIBBLES,...,SHOT_DIST,PTS_TYPE,SHOT_RESULT,CLOSEST_DEFENDER,CLOSEST_DEFENDER_PLAYER_ID,CLOSE_DEF_DIST,FGM,PTS,player_name,player_id
0,21400899,"MAR 04, 2015 - CHA @ BKN",A,W,24,1,1,1:09,10.8,2,...,7.7,2,made,"Anderson, Alan",101187,1.3,1,2,brian roberts,203148
1,21400899,"MAR 04, 2015 - CHA @ BKN",A,W,24,2,1,0:14,3.4,0,...,28.2,3,missed,"Bogdanovic, Bojan",202711,6.1,0,0,brian roberts,203148
2,21400899,"MAR 04, 2015 - CHA @ BKN",A,W,24,3,1,0:00,,3,...,10.1,2,missed,"Bogdanovic, Bojan",202711,0.9,0,0,brian roberts,203148
3,21400899,"MAR 04, 2015 - CHA @ BKN",A,W,24,4,2,11:47,10.3,2,...,17.2,2,missed,"Brown, Markel",203900,3.4,0,0,brian roberts,203148
4,21400899,"MAR 04, 2015 - CHA @ BKN",A,W,24,5,2,10:34,10.9,2,...,3.7,2,missed,"Young, Thaddeus",201152,1.1,0,0,brian roberts,203148


In [3]:
def null_summary(df):
    row_count = len(df)
    
    #columns
    null_counts = len(df) - df.count()
    pct_nulls = round(100*null_counts/row_count,4)
    dtype = df.dtypes
    columns = ['null_counts', 'pct_nulls', 'dtype']
    
    null_df = pd.DataFrame([null_counts, pct_nulls, dtype]).T
    null_df.columns = columns
    
    print('total rows: ', row_count)
    return null_df
    
    

In [4]:
def look_at_nulls(df):
    null_rows = df[df.isnull().any(axis=1)]
    print('rows: ', len(null_rows))
    return null_rows

In [5]:
null_summary(nba)

total rows:  128069


Unnamed: 0,null_counts,pct_nulls,dtype
GAME_ID,0,0.0,int64
MATCHUP,0,0.0,object
LOCATION,0,0.0,object
W,0,0.0,object
FINAL_MARGIN,0,0.0,int64
SHOT_NUMBER,0,0.0,int64
PERIOD,0,0.0,int64
GAME_CLOCK,0,0.0,object
SHOT_CLOCK,5567,4.3469,float64
DRIBBLES,0,0.0,int64


In [6]:
null_df = look_at_nulls(nba)
null_df[['SHOT_CLOCK', 'GAME_CLOCK']].head()

rows:  5567


Unnamed: 0,SHOT_CLOCK,GAME_CLOCK
2,,0:00
24,,0:04
54,,0:01
76,,0:01
129,,0:02


Our only null values are in the shot clock column. This is because the shot clock is turned off when there isn't enough game clock left in the quarter. We will replace these null values with the number of seconds left in the quarter.

In [7]:
# turn game clock into a number value using the last 2 digits of game clock
nba['game_clock_num'] = pd.to_numeric(nba['GAME_CLOCK'].str.slice(-2))

# use game clock number value to fillna in shot clock column
nba['SHOT_CLOCK'].fillna(nba['game_clock_num'], inplace=True)


In [8]:
# choose features for models
features = nba[['DRIBBLES', 'SHOT_CLOCK', 'CLOSE_DEF_DIST', 'SHOT_DIST', 'SHOT_NUMBER', 'PERIOD']]

#capture potential nonlinear relationships to logged odds
for feature in features.columns:
    features[feature+'_root'] = features[feature]**.5
    features[feature+'_squared'] = features[feature]**2
    features[feature+'_cubed'] = features[feature]**3
    
features['HOME'] = np.where(nba['LOCATION']=='H',1,0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [9]:
y = nba['FGM']
X = features
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=333)

In [10]:

lr = LogisticRegression(C=1e9)
lr.fit(X_train, y_train)

print(lr.score(X_train,y_train))

cvs = cross_val_score(lr, X_train, y_train, cv=10)

print('cv mean: {}   cv std: {}\n'.format(np.mean(cvs), np.std(cvs)))
print(cvs)

0.616807378849
cv mean: 0.6162413274128203   cv std: 0.005502516327664719

[ 0.61516689  0.61018934  0.62356041  0.60589498  0.62131564  0.61829006
  0.62332845  0.61288433  0.61337238  0.61841078]


In [11]:
print('cv mean: {}   cv std: {}\n'.format(np.mean(cvs), np.std(cvs)))
print(cvs)

cv mean: 0.6162413274128203   cv std: 0.005502516327664719

[ 0.61516689  0.61018934  0.62356041  0.60589498  0.62131564  0.61829006
  0.62332845  0.61288433  0.61337238  0.61841078]


In [12]:
y_pred = lr.predict(X_test)
bcm = mdst.Binary_confusion_matrix(y_test, y_pred) #binary confusion matrix
bcm.display_metrics() # displays binary confusion matrix stats
bcm.df # shows confusion matrix as a pandas df

Accuracy = 0.6160302959319123
Sensitivity = 0.41560673771055345
Specificity = 0.7828730862784375
Precision = 0.614407318002795
negative predictive value = 0.6167502677112101



Unnamed: 0,pred_true,pred_false
actual_true,4836,6800
actual_false,3035,10943


In [13]:
nba['FGM'].value_counts()/len(nba)

0    0.547861
1    0.452139
Name: FGM, dtype: float64

In [14]:
from functools import reduce
def display_equation(linear_model, X, Y):
    equation_parts = ['{} = '.format(Y.name + ' odds')]
    for i, coef in enumerate(linear_model.coef_.reshape(-1)):
        equation_parts.append('({}*{}) + '.format(round(coef, 4), X.columns[i]))
    equation_parts.append('{}'.format(round(linear_model.intercept_[0], 4)))
    print(reduce(lambda x, y: x+y, equation_parts))

In [15]:

def display_coef(linear_model, X):
    for i, coef in enumerate(linear_model.coef_.reshape(-1)):
        print( '{}*{}'.format(round(coef,4),X.columns[i]))
    print(round(linear_model.intercept_[0],4))


In [16]:

def display_coef(linear_model, X):
    coeffs = pd.DataFrame()
    coeffs['coeff'] = linear_model.coef_.reshape(-1)
    
    coeffs['feature'] = X.columns
    coeffs['abs_coeff'] = abs(coeffs['coeff'])
    
    return coeffs.sort_values(by='abs_coeff', ascending=False)
        

In [17]:
display_coef(lr,X)

Unnamed: 0,coeff,feature,abs_coeff
2,0.219778,CLOSE_DEF_DIST,0.219778
3,-0.193949,SHOT_DIST,0.193949
12,0.091037,CLOSE_DEF_DIST_root,0.091037
0,-0.061592,DRIBBLES,0.061592
15,-0.05532,SHOT_DIST_root,0.05532
1,0.054806,SHOT_CLOCK,0.054806
6,-0.053267,DRIBBLES_root,0.053267
9,0.028329,SHOT_CLOCK_root,0.028329
4,0.019501,SHOT_NUMBER,0.019501
24,0.017165,HOME,0.017165


# Ridge Logistic Regression

In [18]:
rlr = LogisticRegression(penalty='l2', C=.2)
rlr.fit(X_train, y_train)
print(rlr.score(X_train, y_train))
rlr.score(X_test, y_test)

0.617090429945


0.61645974857499808

In [19]:
y_pred = rlr.predict(X_test)
bcm = mdst.Binary_confusion_matrix(y_test, y_pred)
bcm.display_metrics()
bcm.df

Accuracy = 0.6164597485749981
Sensitivity = 0.41457545548298386
Specificity = 0.7845185291171841
Precision = 0.6156202143950995
negative predictive value = 0.6168297896276297



Unnamed: 0,pred_true,pred_false
actual_true,4824,6812
actual_false,3012,10966


In [20]:
display_coef(rlr,X)

Unnamed: 0,coeff,feature,abs_coeff
2,0.230456,CLOSE_DEF_DIST,0.230456
3,-0.195436,SHOT_DIST,0.195436
12,0.095615,CLOSE_DEF_DIST_root,0.095615
15,-0.05439,SHOT_DIST_root,0.05439
0,-0.052545,DRIBBLES,0.052545
1,0.048827,SHOT_CLOCK,0.048827
6,-0.048409,DRIBBLES_root,0.048409
9,0.028832,SHOT_CLOCK_root,0.028832
4,0.019842,SHOT_NUMBER,0.019842
24,0.019396,HOME,0.019396


# Lasso Logistic Regression

In [21]:
llr = LogisticRegression(penalty='l1', C=.3)
llr.fit(X_train,y_train)
print(llr.score(X_train, y_train))
llr.score(X_test, y_test)

0.616836659997


0.6159912547825408

In [22]:
display_coef(llr,X)

Unnamed: 0,coeff,feature,abs_coeff
2,0.537857,CLOSE_DEF_DIST,0.537857
9,0.488922,SHOT_CLOCK_root,0.488922
12,-0.364706,CLOSE_DEF_DIST_root,0.364706
3,-0.192514,SHOT_DIST,0.192514
21,-0.188647,PERIOD_root,0.188647
6,-0.122665,DRIBBLES_root,0.122665
1,-0.086345,SHOT_CLOCK,0.086345
15,-0.075612,SHOT_DIST_root,0.075612
18,0.068876,SHOT_NUMBER_root,0.068876
13,-0.040855,CLOSE_DEF_DIST_squared,0.040855


In [23]:
y_pred = llr.predict(X_test)
bcm = mdst.Binary_confusion_matrix(y_test, y_pred)
bcm.display_metrics()
bcm.df

Accuracy = 0.6159912547825408
Sensitivity = 0.4098487452732898
Specificity = 0.7875947918157103
Precision = 0.6163091238046007
negative predictive value = 0.6158536585365854



Unnamed: 0,pred_true,pred_false
actual_true,4769,6867
actual_false,2969,11009


### One Feature Logistic Regression

In [27]:
y = nba['FGM']
X = features[['SHOT_DIST']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=333)
lr = LogisticRegression(C=1e9)
lr.fit(X_train, y_train)

print(lr.score(X_train,y_train))

cvs = cross_val_score(lr, X_train, y_train, cv=10)

print('cv mean: {}   cv std: {}\n'.format(np.mean(cvs), np.std(cvs)))
print(cvs)

0.592718754575
cv mean: 0.5929431925602167   cv std: 0.005297878649089816

[ 0.5937927   0.58852235  0.60023424  0.5834472   0.59779426  0.59467109
  0.59960957  0.59472914  0.58809175  0.58853963]


In [28]:
y_pred = lr.predict(X_test)
bcm = mdst.Binary_confusion_matrix(y_test, y_pred) #binary confusion matrix
bcm.display_metrics() # displays binry confusion matrix stats
bcm.df # shows confusion matrix as a pandas df

Accuracy = 0.5921761536659639
Sensitivity = 0.5136644895152973
Specificity = 0.6575332665617398
Precision = 0.5552768487551096
negative predictive value = 0.6189225589225589



Unnamed: 0,pred_true,pred_false
actual_true,5977,5659
actual_false,4787,9191


### In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

With the dominant class (missed FGs) occuring 55% of the time, I was hoping that theses models would have had better predictive power than they did (a little under 62% for each model). 

The selected features are all things that are either known or suspected to have an impact on FG% in the game of basketball. Interestingly, a logistic regression model using only shot distance as a feature did pretty well on its own achieving 59.2% accuracy.

The ridge and lasso regression models did not significantly improve on the plain logistic regression model. This seems plausible since the plain logistic regression model showed no signs of overfitting. 

The regularization parameter on both the ridge and lasso models were chosen through trial and error optimization.

Plain logistic regression seems fine for this task, although lasso regression is a nice tool to identify and confirm the most important and least important features. Ridge and Lasso regression may still be needed if more features were ever added to this model.