# Lab4: Statistical Learning in Python

Outline:
1. Data Pre-procesing: pandas
2. statsmodels and sklearn

# Problem Statement

Estimate the **probability of Stephen Curry scoring a triple shot** in any given game as a function of other predictors such as period and position.


# 1. Loading data

In [None]:
import numpy as np

In [None]:
# import module
import helper_basketball as h
import imp
imp.reload(h);

In [None]:
params = {'PlayerID':'201939',
          'PlayerPosition':'',
          'Season':'2016-17',
          'ContextMeasure':'FGA',
          'DateFrom':'',
          'DateTo':'',
          'GameID':'',
          'GameSegment':'',
          'LastNGames':'0',
          'LeagueID':'00',
          'Location':'',
          'Month':'0',
          'OpponentTeamID':'0',
          'Outcome':'',
          'Period':'0',
          'Position':'',
          'RookieYear':'',
          'SeasonSegment':'',
          'SeasonType':'Regular Season',
          'TeamID':'0',
          'VsConference':'',
          'VsDivision':''}

shotdata = h.get_nba_data('shotchartdetail', params)
shotdata.head()

# 2. Data Pre-processing

Our task is first to obtain the total number of attempted and scored shots in any given game.

In [None]:
# See dtype of each column
shotdata.dtypes

In [None]:
# Unique values of column of interest
shotdata["EVENT_TYPE"].unique()

In [None]:
shotdata["SHOT_ZONE_AREA"].unique()

In [None]:
shotdata["SHOT_TYPE"].unique()

In [None]:
shotdata["SHOT_ZONE_RANGE"].unique()

In [None]:
shotdata["GAME_DATE"].unique()

In [None]:
shotdata["SHOT_ATTEMPTED_FLAG"].unique()

In [None]:
shotdata["SHOT_MADE_FLAG"].unique()

In [None]:
train_data = shotdata.query('SHOT_TYPE=="3PT Field Goal"') # Only 3 pointers made
train_data

# 3. Logistic regression

We assume that the total number of scored shots are the realized value of a Binomial experiment where:


- no. of trials: the total number of triple shots attempted.

- no. of successes: total number of triple shots scored.

- $p_{i}$ is the probability of scoring a triple in any given game (which is our parameter of interest).


## 3.1 `statsmodels` package

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
sm.GLM?

In [None]:
# Fitting models using R-style formulas:
# See: http://www.statsmodels.org/dev/example_formulas.html
fitted_model1 = smf.glm(formula = 'SHOT_MADE_FLAG ~ LOC_X + LOC_Y + C(PERIOD) + C(SHOT_ZONE_AREA)',
                        data=train_data, 
                        family=sm.families.Binomial()).fit()

In [None]:
# See results
print(fitted_model1.summary())

## 3.2 `scilearn` package

In [None]:
from patsy import dmatrices # For constructing design matrices from R-types of formulae
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

### 3.2.1 Prepare data for logistic regression

In [None]:
# create dataframes with an intercept column and dummy variables 
y, X = dmatrices('SHOT_MADE_FLAG ~ LOC_X + LOC_Y + C(PERIOD) + C(SHOT_ZONE_AREA)',
                  train_data, return_type="dataframe")

In [None]:
# flatten y into a 1-D array
y = np.ravel(y)

In [None]:
y

### 3.2.2 Train and test data 

In [None]:
from sklearn.model_selection import train_test_split 
train_test_split?

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, # Predictors
                                                    y, # response
                                                    test_size=0.3, # % of test data 
                                                    random_state=123) # seed for random sampling

### 3.3.3 Model fitting

In [None]:
LogisticRegression?

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

https://stackoverflow.com/questions/26319259/sci-kit-and-regression-summary

> There exists no `R` type regression summary report in sklearn. The main reason is that sklearn is used for predictive modelling / machine learning and the evaluation criteria are based on performance on previously unseen data (such as predictive r^2 for regression).
For a more classic statistical approach, take a look at statsmodels.

### 3.3.4 Predicting the test set results and calculating the accuracy

In [None]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'
      .format(logreg.score(X_test, y_test)))

### 3.3.5 Cross Validation

Cross validation attempts to avoid overfitting while still producing a prediction for each observation dataset. We are using 10-fold Cross-Validation to train our Logistic Regression model.

In [None]:
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=7) # 10 fold CV
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))

### 3.3.6 Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

### 3.3.7 Compute precision, recall, F-measure and support

To quote from Scikit Learn:

- The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative.

- The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

- The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

- The F-beta score weights the recall more than the precision by a factor of beta. beta = 1.0 means recall and precision are equally important.

- The support is the number of occurrences of each class in y_test.


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

### 3.3.8 ROC curve

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

## Other ML methods

  - `KNeighborsClassifier`
  - `DecisionTreeClassifier(max_depth=5)`
  - `RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)`
  - `QuadraticDiscriminantAnalysis()`

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

### Example: Random Forest

In [None]:
clf = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1).fit(X_train, y_train)
score = clf.score(X_test, y_test)
logit_roc_auc = roc_auc_score(y_test, clf.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Random Forest (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

# 4. Conclusions

In [None]:
Pr_pred = clf.predict_proba(X_test)
Pr_pred = Pr_pred[:,1] # Probability of scoring a 3pt
Pr_pred 

In [None]:
plt.figure(figsize=(12,11))
plt.scatter(X_test.LOC_X, X_test.LOC_Y,c=Pr_pred)
h.draw_court(outer_lines=True)
plt.colorbar()
plt.xlim(300,-300)
plt.ylim(-100,500)
plt.show()