# Allie Surina
## Capstone Project: Predicting Disabled List Placements 

### Project Goal: 
To game log data for MLB games from 2000 - 2016, along with whether a player was placed on the Disabled List after a particular game (i.e. he was injured during his previous game), and predict whether a a Disabled List placement will happen to a starting player for a game.

### Data:
* `games_dls.csv` Merged data from `games.csv` and `injury.csv` that is the basis of the model.

### Methods:
* I plan to use machine learning classification model to predict whether a disabled list placement will happen after a particular game.

## Import Packages

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

## Import Data

In [38]:
# Import the game and disabled list merged data from games_dls.csv
file = './datasets/games_features_for_model.csv'
df = pd.read_csv(file,index_col=0)

## Import Sklearn Packages for Model Building

In [105]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score,\
                        precision_score, classification_report

In [39]:
df.drop('date', axis=1,inplace=True)

### Injury columns is multi-class, so I need to create a binary column:

In [41]:
df.injury = df.injury.map(lambda x: 1 if x>0 else 0)
df.injury.value_counts()

0    36399
1     4893
Name: injury, dtype: int64

In [42]:
cols = [col for col in df.columns if 'injury' not in col]
X = df.loc[:,cols]
y = df.loc[:,'injury']

In [43]:
X.index = pd.to_datetime(X.index)

In [44]:
X.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 41292 entries, 2000-03-29 to 2016-10-02
Data columns (total 180 columns):
num_game                int64
v_team_game_num         int64
h_team_game_num         int64
v_team_score            int64
h_team_score            int64
game_length_outs        int64
attendance              float64
time_game_min           int64
v_at_bats               int64
v_hits                  int64
v_doubles               int64
v_triples               int64
v_homeruns              int64
v_RBI                   int64
v_sac_hits              int64
v_sac_files             int64
v_hit_pitch             int64
v_walks                 int64
v_int_walks             int64
v_strikeouts            int64
v_stol_base             int64
v_caught_steal          int64
v_grnd_dbl_plays        int64
v_awd_fst_catch_intf    int64
v_left_on_base          int64
v_pitchers              int64
v_ind_earn_runs         int64
v_team_earn_runs        int64
v_wild_pitch            int64
v

In [45]:
y.value_counts()

0    36399
1     4893
Name: injury, dtype: int64

## Establish the Baseline

In [91]:
# Calculating the Baseline
baseline = 1- y.mean()
baseline

0.8815024702121477

## Train Test Split

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(27665, 180) (13627, 180) (27665,) (13627,)


## Scaling Training and Testing Data

In [47]:
ss = StandardScaler()
ss.fit(X_train)
X_train_s = ss.transform(X_train)
X_test_s = ss.transform(X_test)

## Logistic Regression Classification

In [52]:
logreg = LogisticRegressionCV(n_jobs=-1,verbose=2)
logreg.fit(X_train_s, y_train)
print ('Max auc_roc:', logreg.scores_[1].mean(axis=0).max())  # is correct

[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   32.5s finished


Max auc_roc: 0.879956623579


In [84]:
logreg.score(X_test_s, y_test)
y_preds = logreg.predict(X_test_s)

In [77]:
print(len(y_test), len(y_preds))

13627 13627


In [78]:
y_test.describe()

count    13627.000000
mean         0.115359
std          0.319467
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: injury, dtype: float64

### Logistic Regression Confusion Matrix Shows That I Misclassified 1572 of the Real Positives
* True Positives were 0!
* False Negatives were 1572

In [90]:
confusion_matrix(y_pred=y_preds, y_true=y_test)

array([[12055,     0],
       [ 1572,     0]])

In [95]:
accuracy_score(y_pred=y_preds, y_true=y_test)

0.88464078667351587

## K-Neighbors Classifier: Better at Finding True Positives, with a few False Positives Thrown In

In [59]:
knn = KNeighborsClassifier()

In [92]:
knn.fit(X_train_s,y_train)
cross_val_score(knn, X_test_s,y_test, cv=10)

array([ 0.87609971,  0.87096774,  0.87527513,  0.8760088 ,  0.86647102,
        0.86857562,  0.87591777,  0.8773862 ,  0.87518355,  0.87812041])

In [97]:
y_preds_knn =knn.predict(X_test_s)

In [108]:
conmat = confusion_matrix(y_pred=y_preds_knn, y_true=y_test)
print(conmat)

[[11905   150]
 [ 1544    28]]


In [99]:
accuracy_score(y_pred=y_preds_knn, y_true=y_test)

0.87568797240772001

In [102]:
recall_score(y_test,y_preds_knn)

0.017811704834605598

In [107]:
print(classification_report(y_test,y_preds_knn))

             precision    recall  f1-score   support

          0       0.89      0.99      0.93     12055
          1       0.16      0.02      0.03      1572

avg / total       0.80      0.88      0.83     13627



### Change the Threshhold

In [109]:
y_pp = knn.predict_proba(X_test_s)
confusion = pd.DataFrame(conmat, 
                         index=['had_no_DL_after','had_a_DL_after'],
                        columns=['pred_no_DL','pred_a_DL'])
confusion

Unnamed: 0,pred_no_DL,pred_a_DL
had_no_DL_after,11905,150
had_a_DL_after,1544,28


In [110]:
y_pp = pd.DataFrame(y_pp,
                    columns=['class_0_pp','class_1_pp'])
y_pp.head()

Unnamed: 0,class_0_pp,class_1_pp
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,0.8,0.2
4,0.8,0.2


### IF we want to change the threshhold from 50/50, to reduce num of false negs, we need to move threshhold to left, like 10%, Add a new column to dataframe we just created and call it predicted class threshhold 10

In [112]:
y_pp['pred_class_thresh10'] = [1 if x >= .1 else 0 for x in y_pp.class_1_pp.values]
#Going down 1 class values, list of predicted probabiliteis, reduced it by
  # returning a 1 if that pp value >= 0.1
y_pp.head()

Unnamed: 0,class_0_pp,class_1_pp,pred_class_thresh10
0,1.0,0.0,0
1,1.0,0.0,0
2,1.0,0.0,0
3,0.8,0.2,1
4,0.8,0.2,1


In [113]:
conmat_update = metrics.confusion_matrix(y_test, y_pp.pred_class_thresh10)
confusion = pd.DataFrame(conmat_update, 
                         index=['had_no_DL_after','had_a_DL_after'],
                         columns=['pred_no_DL','pred_a_DL'])
confusion

Unnamed: 0,pred_no_DL,pred_a_DL
had_no_DL_after,6826,5229
had_a_DL_after,816,756


In [114]:
y_pp['pred_class_thresh30'] = [1 if x >= .3 else 0 for x in y_pp.class_1_pp.values]
#Going down 1 class values, list of predicted probabiliteis, reduced it by
  # returning a 1 if that pp value >= 0.1
y_pp.head()

Unnamed: 0,class_0_pp,class_1_pp,pred_class_thresh10,pred_class_thresh30
0,1.0,0.0,0,0
1,1.0,0.0,0,0
2,1.0,0.0,0,0
3,0.8,0.2,1,0
4,0.8,0.2,1,0


In [115]:
conmat_update = metrics.confusion_matrix(y_test, y_pp.pred_class_thresh30)
confusion = pd.DataFrame(conmat_update, 
                         index=['had_no_DL_after','had_a_DL_after'],
                         columns=['pred_no_DL','pred_a_DL'])
confusion

Unnamed: 0,pred_no_DL,pred_a_DL
had_no_DL_after,10867,1188
had_a_DL_after,1381,191


In [116]:
y_pp['pred_class_thresh20'] = [1 if x >= .2 else 0 for x in y_pp.class_1_pp.values]
#Going down 1 class values, list of predicted probabiliteis, reduced it by
  # returning a 1 if that pp value >= 0.1
conmat_update = metrics.confusion_matrix(y_test, y_pp.pred_class_thresh20)
confusion = pd.DataFrame(conmat_update, 
                         index=['had_no_DL_after','had_a_DL_after'],
                         columns=['pred_no_DL','pred_a_DL'])
confusion

Unnamed: 0,pred_no_DL,pred_a_DL
had_no_DL_after,6826,5229
had_a_DL_after,816,756
