## Day 25 Lecture 1 Assignment

In this assignment, we will evaluate the performance of the model we built yesterday on the Chicago traffic crash data. We will also perform hyperparameter tuning and evaluate a final model using additional metrics (e.g. AUC-ROC, precision, recall, etc.)

In [0]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
import seaborn as sns

  import pandas.util.testing as tm


Since we will be building on the model we built in the last assignment, we will need to redo all of the data preparation steps up to the point of model building. These steps include creating the response, missing value imputation, and one-hot encoding our selected categorical variables. The quickest way to get going would be to open last week's assignment, make a copy, and build on it from there.

Statsmodels' implementation of logistic has certain advantages over scikit-learn's, such as clean, easy to read model summary output and statistical inference values (e.g. p-values). However, scikit-learn is preferable for model evaluation, so we will switch to the scikit-learn implementation for this exercise. 

Run logistic regression on the training set and use the resulting model to make predictions on the test set. Calculate the train and test error using logarithmic loss. How do they compare to each other?

In [0]:
crash_data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/traffic_crashes_chicago.csv')

In [0]:
def missingness_summary(df, print_log=False, sort='none'):
    summary = df.apply(lambda x: x.isna().sum() / x.shape[0])
    
    if print_log == True:
        if sort == 'none':
            print(summary)
        elif sort == 'ascending':
            print(summary.sort_values())
        elif sort == 'descending':
            print(summary.sort_values(ascending=False))
        else:
            print('Invalid value for sort parameter.')
        
    return summary

In [0]:
# answer goes here

crash_data.loc[(crash_data['DAMAGE'] == 'OVER $1,500'), 'DAMAGE'] = 1
crash_data.loc[(crash_data['DAMAGE'] != 1), 'DAMAGE'] = 0

missing = missingness_summary(crash_data)

drop = []
for index in missing.index:
    if missing[index] > 0.05:
        drop.append(index)
        
crash_data.drop(drop, axis=1, inplace=True)
crash_data.head()

crash_data.drop('STREET_NO', axis=1).fillna(crash_data.median(), inplace=True)

crash_data.dropna(inplace=True)

crash_data.head()

Unnamed: 0,RD_NO,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,ALIGNMENT,ROADWAY_SURFACE_COND,ROAD_DEFECT,REPORT_TYPE,CRASH_TYPE,DAMAGE,DATE_POLICE_NOTIFIED,PRIM_CONTRIBUTORY_CAUSE,SEC_CONTRIBUTORY_CAUSE,STREET_NO,STREET_DIRECTION,STREET_NAME,BEAT_OF_OCCURRENCE,NUM_UNITS,MOST_SEVERE_INJURY,INJURIES_TOTAL,INJURIES_FATAL,INJURIES_INCAPACITATING,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN
6,JC413474,8/30/2019 14:20,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,REAR END,DIVIDED - W/MEDIAN (NOT RAISED),STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,NO INJURY / DRIVE AWAY,1,8/30/2019 14:25,FAILING TO REDUCE SPEED TO AVOID CRASH,UNABLE TO DETERMINE,5335,S,WESTERN AVE,923.0,2.0,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,2.0,0.0
7,JC414382,8/31/2019 4:35,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",FIXED OBJECT,NOT DIVIDED,"CURVE, LEVEL",DRY,NO DEFECTS,ON SCENE,INJURY AND / OR TOW DUE TO CRASH,1,8/31/2019 4:35,PHYSICAL CONDITION OF DRIVER,NOT APPLICABLE,1501,N,HUMBOLDT DR,1423.0,1.0,NONINCAPACITATING INJURY,1.0,0.0,0.0,1.0,0.0,0.0,0.0
8,JC413930,8/30/2019 18:30,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,NOT DIVIDED,STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,NOT ON SCENE (DESK REPORT),NO INJURY / DRIVE AWAY,0,8/30/2019 19:57,FOLLOWING TOO CLOSELY,FOLLOWING TOO CLOSELY,5900,N,SHERIDAN RD,2022.0,2.0,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,5.0,0.0
9,JC415166,8/31/2019 18:50,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,DIVIDED - W/MEDIAN (NOT RAISED),STRAIGHT AND LEVEL,DRY,NO DEFECTS,ON SCENE,NO INJURY / DRIVE AWAY,1,8/31/2019 19:02,DRIVING SKILLS/KNOWLEDGE/EXPERIENCE,UNABLE TO DETERMINE,5555,N,CLARK ST,2013.0,2.0,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,2.0,0.0
10,JC415064,8/31/2019 13:15,30,TRAFFIC SIGNAL,UNKNOWN,CLOUDY/OVERCAST,DAYLIGHT,SIDESWIPE SAME DIRECTION,DIVIDED - W/MEDIAN (NOT RAISED),STRAIGHT AND LEVEL,UNKNOWN,UNKNOWN,NOT ON SCENE (DESK REPORT),INJURY AND / OR TOW DUE TO CRASH,1,8/31/2019 16:49,FAILING TO REDUCE SPEED TO AVOID CRASH,DISTRACTION - FROM INSIDE VEHICLE,7500,S,COTTAGE GROVE AVE,624.0,2.0,NONINCAPACITATING INJURY,1.0,0.0,0.0,1.0,0.0,2.0,0.0


In [0]:
test = pd.get_dummies(crash_data[['WEATHER_CONDITION', 'FIRST_CRASH_TYPE']]).drop(['WEATHER_CONDITION_CLEAR', 'FIRST_CRASH_TYPE_REAR END'], axis=1)

crash_data_onehot = pd.concat([crash_data.loc[:, ['DAMAGE', 'POSTED_SPEED_LIMIT', 'INJURIES_TOTAL']], test], axis=1)
crash_data_onehot.head()

Unnamed: 0,DAMAGE,POSTED_SPEED_LIMIT,INJURIES_TOTAL,WEATHER_CONDITION_BLOWING SNOW,WEATHER_CONDITION_CLOUDY/OVERCAST,WEATHER_CONDITION_FOG/SMOKE/HAZE,WEATHER_CONDITION_FREEZING RAIN/DRIZZLE,WEATHER_CONDITION_OTHER,WEATHER_CONDITION_RAIN,WEATHER_CONDITION_SEVERE CROSS WIND GATE,WEATHER_CONDITION_SLEET/HAIL,WEATHER_CONDITION_SNOW,WEATHER_CONDITION_UNKNOWN,FIRST_CRASH_TYPE_ANGLE,FIRST_CRASH_TYPE_ANIMAL,FIRST_CRASH_TYPE_FIXED OBJECT,FIRST_CRASH_TYPE_HEAD ON,FIRST_CRASH_TYPE_OTHER NONCOLLISION,FIRST_CRASH_TYPE_OTHER OBJECT,FIRST_CRASH_TYPE_OVERTURNED,FIRST_CRASH_TYPE_PARKED MOTOR VEHICLE,FIRST_CRASH_TYPE_PEDALCYCLIST,FIRST_CRASH_TYPE_PEDESTRIAN,FIRST_CRASH_TYPE_REAR TO FRONT,FIRST_CRASH_TYPE_REAR TO REAR,FIRST_CRASH_TYPE_REAR TO SIDE,FIRST_CRASH_TYPE_SIDESWIPE OPPOSITE DIRECTION,FIRST_CRASH_TYPE_SIDESWIPE SAME DIRECTION,FIRST_CRASH_TYPE_TRAIN,FIRST_CRASH_TYPE_TURNING
6,1,30,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,1,30,1.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,30,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
9,1,30,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
10,1,30,1.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [0]:
X = crash_data_onehot.drop(['DAMAGE'], axis = 1)
Y = crash_data_onehot['DAMAGE']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

In [0]:
# logit = LogisticRegression(solver = 'lbfgs', penalty = 'l2', max_iter = 5000)
# print(type(X_train))
# print(type(Y_train))
# logit.fit(X_train, Y_train)
# train_probs = logit.predict_proba(X_train)
# test_probs = logit.predict_proba(X_test)
    
# print(logit.score(X_train, Y_train))
# print(logit.score(X_test, Y_test))
# print(log_loss(Y_train, train_probs))
# print(log_loss(Y_test, test_probs))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


ValueError: ignored

Next, evaluate the performance of the same model using 10-fold CV. Use the training data and labels, and print out the mean log loss for each of the 10 CV folds, as well as the overall CV-estimated test error. How do the estimates from the individual folds compare to the result from our previous single holdout set? How much variability in the estimated test error do you see across the 10 folds?

Note: scikit-learn's *cross_val_score* function provides a simple, one-line method for doing this. However, be careful - the default score returned by this function may not be log loss!

In [0]:
# answer goes here

c_values = []
train_scores=[]
test_scores=[]
train_loss = []
test_loss = []

c_low = -7
c_high = 1

for c in np.logspace(c_low, c_high, num=c_high-c_low+1):
    c_values.append(c)
    
    logit = LogisticRegression(C=c, penalty='l2', solver='lbfgs')

    logit.fit(X_train, Y_train)
    train_probs = logit.predict_proba(X_train)
    test_probs = logit.predict_proba(X_test)
    
    train_scores.append(logit.score(X_train, Y_train))
    test_scores.append(logit.score(X_test, Y_test))
    train_loss.append(log_loss(Y_train, train_probs))
    test_loss.append(log_loss(Y_test, test_probs))
score_list = list(zip(train_scores, test_scores, train_loss, test_loss))

score_data = pd.DataFrame(score_list, index=c_values, 
                          columns=['Train Score', 'Test Score', 'Train Log Loss', 'Test Log Loss'])
score_data
type(logit)

ValueError: ignored

In [0]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

cross_val_score(logit, X_train, Y_train, scoring=make_scorer(log_loss), cv=10)

In [0]:
ax = sns.lineplot(data=score_data)
ax.set(xscale='linear')
plt.axvline(score_data['Test Score'].idxmax(), color='k')
plt.axvline(score_data['Test Log Loss'].idxmin(), color='r')
plt.xlabel('C')
plt.ylabel('Scores')
plt.text(0.0005, 0.68, 'max test\naccuracy')
plt.text(0.2, 0.6, 'min test\nlog loss', color='r')
plt.title('Scores vs C Hyperparameter')
plt.show()

Scikit-learn's logistic regression function has a built-in regularization parameter, C (the larger the value of C, the smaller the degree of regularization). Use cross-validation and grid search to find an optimal value for the parameter C.

In [0]:
# answer goes here





Re-train a logistic regression model using the best value of C identified by 10-fold CV on the training data and labels. Afterwards, do the following:

- Determine the precision, recall, and F1-score of our model using a cutoff/threshold of 0.5 (hint: scikit-learn's *classification_report* function may be helpful)
- Plot or otherwise generate a confusion matrix
- Plot the ROC curve for our logistic regression model

Note: the performance of our simple logistic regression model with just four features will not be very good, but this is not entirely unexpected. There are many other features that can be incorporated into the model to improve its performance; feel free to experiment!

In [0]:
# answer goes here



