## Day 24 Lecture 2 Assignment

In this assignment, we will build our a more complex logistic regression model, this time on both numeric and categorical data. We will use the Chicago traffic crashes dataset loaded below and analyze the model generated for this dataset.

In [25]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

  import pandas.util.testing as tm


In [4]:
def missingness_summary(df, print_log=False, sort='none'):
    summary = df.apply(lambda x: x.isna().sum() / x.shape[0])
    
    if print_log == True:
        if sort == 'none':
            print(summary)
        elif sort == 'ascending':
            print(summary.sort_values())
        elif sort == 'descending':
            print(summary.sort_values(ascending=False))
        else:
            print('Invalid value for sort parameter.')
        
    return summary

In [5]:
crash_data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/traffic_crashes_chicago.csv')

In [None]:
crash_data.head()

Unnamed: 0,RD_NO,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,LANE_CNT,...,WORKERS_PRESENT_I,NUM_UNITS,MOST_SEVERE_INJURY,INJURIES_TOTAL,INJURIES_FATAL,INJURIES_INCAPACITATING,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN
0,JC334993,7/4/2019 22:33,45,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,DIVIDED - W/MEDIAN BARRIER,,...,,,,,,,,,,
1,JC370822,7/30/2019 10:22,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,TURNING,DIVIDED - W/MEDIAN (NOT RAISED),,...,,,,,,,,,,
2,JC387098,8/10/2019 17:00,25,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,ONE-WAY,,...,,1.0,,,,,,,,
3,JC395195,8/16/2019 16:53,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,NOT DIVIDED,,...,,1.0,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,JC396604,8/17/2019 16:04,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,PARKING LOT,,...,,1.0,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,1.0,0.0


First, create a binary response column by modifying the "DAMAGE" column. Consider "OVER \$1500" to be the positive class, and under \$1500 to be the negative class.

In [6]:
# answer goes here

crash_data.loc[crash_data['DAMAGE']  !=  "OVER $1,500", 'bDamaged'] = 0
crash_data.loc[crash_data['DAMAGE']  ==  "OVER $1,500", 'bDamaged'] = 1

crash_data['bDamaged'].describe()



count    372585.000000
mean          0.563418
std           0.495963
min           0.000000
25%           0.000000
50%           1.000000
75%           1.000000
max           1.000000
Name: bDamaged, dtype: float64

Using the code from Day 21, Lecture 1 as a starting point, devise an appropriate way to address missing values. You have a lot of freedom here; we will proceed by taking the following steps:

- Dropping all columns with more than 5% missing data
- Imputing the median for numeric columns with less than 5% missing data (except for STREET_NO; imputing it in this manner would not make any sense)
- Dropping rows with missing data for categorical columns that have less than 5% missing data

In [7]:
# answer goes here

def missingness_summary(new_dataframe, print_log = False, sort='None'):
  missing_val = (new_dataframe.isnull().sum())/len(new_dataframe)

  if print_log == True:
    print(f'---SUMMARY---\n {missing_val.to_string()}% \n')
  
  if sort == 'asc':
    missing_val.sort_values(ascending=True, inplace=True)
  else:
    missing_val.sort_values(ascending=False, inplace=True)


  return missing_val



In [8]:

summary = missingness_summary(crash_data)

In [9]:
summary = summary[summary < 0.05]
summary.index

Index(['REPORT_TYPE', 'MOST_SEVERE_INJURY', 'INJURIES_NO_INDICATION',
       'INJURIES_UNKNOWN', 'INJURIES_TOTAL', 'INJURIES_REPORTED_NOT_EVIDENT',
       'INJURIES_NON_INCAPACITATING', 'INJURIES_INCAPACITATING',
       'INJURIES_FATAL', 'NUM_UNITS', 'BEAT_OF_OCCURRENCE', 'STREET_DIRECTION',
       'STREET_NAME', 'FIRST_CRASH_TYPE', 'LIGHTING_CONDITION',
       'WEATHER_CONDITION', 'bDamaged', 'DEVICE_CONDITION',
       'TRAFFIC_CONTROL_DEVICE', 'POSTED_SPEED_LIMIT', 'CRASH_DATE',
       'TRAFFICWAY_TYPE', 'PRIM_CONTRIBUTORY_CAUSE', 'ALIGNMENT',
       'ROADWAY_SURFACE_COND', 'ROAD_DEFECT', 'CRASH_TYPE', 'DAMAGE',
       'DATE_POLICE_NOTIFIED', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO', 'RD_NO'],
      dtype='object')

In [10]:
crash_data = crash_data[summary.index]

In [11]:
crash_data.shape

(372585, 32)

In [12]:
t = crash_data._get_numeric_data()

In [14]:
crash= t.iloc[:, :-1].fillna(t.median())
crash['STREET_NO'] = t['STREET_NO']
crash['WEATHER_CONDITION'] = crash_data['WEATHER_CONDITION']  
crash['FIRST_CRASH_TYPE'] = crash_data['FIRST_CRASH_TYPE']

In [15]:
crash.shape

(372585, 14)

Finally, choose a few numeric and categorical features (2-3 of each) to include in the model. (You can definitely include more than this, but too many features, especially categorical ones, will most likely lead to convergence issues). One hot encode the chosen categorical features, being sure to omit one of the categories (which will serve as a "reference" level) to avoid perfect multicollinearity.

Again, you have a lot of freedom here; we will proceed with the following features, dropping the most commonly occurring category for the two categorical variables ("CLEAR" for weather, "REAR END" for first crash type):
POSTED_SPEED_LIMIT, WEATHER_CONDITION, INJURIES_TOTAL, FIRST_CRASH_TYPE

In [16]:
# answer goes here

n_t = pd.get_dummies(crash[['WEATHER_CONDITION', 'FIRST_CRASH_TYPE']], drop_first = True)



In [17]:
crash = pd.concat([crash, n_t], axis= 1)

In [18]:
crash.head()

Unnamed: 0,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,INJURIES_TOTAL,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NON_INCAPACITATING,INJURIES_INCAPACITATING,INJURIES_FATAL,NUM_UNITS,BEAT_OF_OCCURRENCE,bDamaged,POSTED_SPEED_LIMIT,STREET_NO,WEATHER_CONDITION,FIRST_CRASH_TYPE,WEATHER_CONDITION_CLEAR,WEATHER_CONDITION_CLOUDY/OVERCAST,WEATHER_CONDITION_FOG/SMOKE/HAZE,WEATHER_CONDITION_FREEZING RAIN/DRIZZLE,WEATHER_CONDITION_OTHER,WEATHER_CONDITION_RAIN,WEATHER_CONDITION_SEVERE CROSS WIND GATE,WEATHER_CONDITION_SLEET/HAIL,WEATHER_CONDITION_SNOW,WEATHER_CONDITION_UNKNOWN,FIRST_CRASH_TYPE_ANIMAL,FIRST_CRASH_TYPE_FIXED OBJECT,FIRST_CRASH_TYPE_HEAD ON,FIRST_CRASH_TYPE_OTHER NONCOLLISION,FIRST_CRASH_TYPE_OTHER OBJECT,FIRST_CRASH_TYPE_OVERTURNED,FIRST_CRASH_TYPE_PARKED MOTOR VEHICLE,FIRST_CRASH_TYPE_PEDALCYCLIST,FIRST_CRASH_TYPE_PEDESTRIAN,FIRST_CRASH_TYPE_REAR END,FIRST_CRASH_TYPE_REAR TO FRONT,FIRST_CRASH_TYPE_REAR TO REAR,FIRST_CRASH_TYPE_REAR TO SIDE,FIRST_CRASH_TYPE_SIDESWIPE OPPOSITE DIRECTION,FIRST_CRASH_TYPE_SIDESWIPE SAME DIRECTION,FIRST_CRASH_TYPE_TRAIN,FIRST_CRASH_TYPE_TURNING
0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,114.0,1.0,45,300,CLEAR,REAR END,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,631.0,1.0,30,8201,CLEAR,TURNING,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,332.0,0.0,25,6747,CLEAR,PARKED MOTOR VEHICLE,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1831.0,0.0,30,554,CLEAR,PARKED MOTOR VEHICLE,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1921.0,0.0,30,3700,CLEAR,PARKED MOTOR VEHICLE,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


Split the data into train and test, with 80% training and 20% testing. By default, the LR output from statsmodels does not include an intercept terms; add a constant column to the training data so that an intercept term is calculated for the LR model (hint: sm.add_constant() is a useful function to accomplish this).

In [24]:
crash.drop(columns=['WEATHER_CONDITION','FIRST_CRASH_TYPE'])

Unnamed: 0,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,INJURIES_TOTAL,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NON_INCAPACITATING,INJURIES_INCAPACITATING,INJURIES_FATAL,NUM_UNITS,BEAT_OF_OCCURRENCE,bDamaged,POSTED_SPEED_LIMIT,STREET_NO,WEATHER_CONDITION_CLEAR,WEATHER_CONDITION_CLOUDY/OVERCAST,WEATHER_CONDITION_FOG/SMOKE/HAZE,WEATHER_CONDITION_FREEZING RAIN/DRIZZLE,WEATHER_CONDITION_OTHER,WEATHER_CONDITION_RAIN,WEATHER_CONDITION_SEVERE CROSS WIND GATE,WEATHER_CONDITION_SLEET/HAIL,WEATHER_CONDITION_SNOW,WEATHER_CONDITION_UNKNOWN,FIRST_CRASH_TYPE_ANIMAL,FIRST_CRASH_TYPE_FIXED OBJECT,FIRST_CRASH_TYPE_HEAD ON,FIRST_CRASH_TYPE_OTHER NONCOLLISION,FIRST_CRASH_TYPE_OTHER OBJECT,FIRST_CRASH_TYPE_OVERTURNED,FIRST_CRASH_TYPE_PARKED MOTOR VEHICLE,FIRST_CRASH_TYPE_PEDALCYCLIST,FIRST_CRASH_TYPE_PEDESTRIAN,FIRST_CRASH_TYPE_REAR END,FIRST_CRASH_TYPE_REAR TO FRONT,FIRST_CRASH_TYPE_REAR TO REAR,FIRST_CRASH_TYPE_REAR TO SIDE,FIRST_CRASH_TYPE_SIDESWIPE OPPOSITE DIRECTION,FIRST_CRASH_TYPE_SIDESWIPE SAME DIRECTION,FIRST_CRASH_TYPE_TRAIN,FIRST_CRASH_TYPE_TURNING
0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,114.0,1.0,45,300,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,631.0,1.0,30,8201,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,332.0,0.0,25,6747,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1831.0,0.0,30,554,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1921.0,0.0,30,3700,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
372580,1.0,0.0,1.0,0.0,1.0,0.0,0.0,2.0,815.0,0.0,30,4520,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
372581,3.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1622.0,1.0,30,5958,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
372582,0.0,0.0,1.0,0.0,1.0,0.0,0.0,2.0,512.0,1.0,35,10400,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
372583,3.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1733.0,1.0,30,3806,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [26]:
# answer goes here

crash_X = crash[['INJURIES_TOTAL', 'WEATHER_CONDITION_CLEAR', 'POSTED_SPEED_LIMIT', 'FIRST_CRASH_TYPE_FIXED OBJECT']].values
crash_Y = crash.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(crash_X, crash_Y, test_size=0.2)


In [27]:
X_train_const = sm.add_constant(X_train)

Fit the logistic regression model using the statsmodels package and print out the coefficient summary. Which variables (in particular, which categories of our categorical variables) appear to be the most important, and what effect do they have on the probability of a crash resulting in $1500 or more in damages?

In [28]:
# answer goes here

sm_model = sm.Logit(y_train, X_train_const).fit()
print(sm_model.summary())



  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


         Current function value: inf
         Iterations: 35




                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:               298068
Model:                          Logit   Df Residuals:                   298063
Method:                           MLE   Df Model:                            4
Date:                Thu, 15 Oct 2020   Pseudo R-squ.:                     inf
Time:                        21:48:16   Log-Likelihood:                   -inf
converged:                      False   LL-Null:                        0.0000
Covariance Type:            nonrobust   LLR p-value:                     1.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.9827      0.031    -94.828      0.000      -3.044      -2.921
x1             0.1211      0.009     13.333      0.000       0.103       0.139
x2             0.1161      0.013      8.633      0.0



Create a LogisticRegression model with sklearn. Use the .predict() method (using X_test) to get a y_pred. Create a confusion matrix comparing your actual y_test to your prediction. What do you notice about your type of error?

In [29]:
# answer goes here

from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(max_iter=1000)
logit.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [30]:

damage_prob = logit.predict(X_test)

In [31]:
from sklearn.metrics import confusion_matrix

matrix = confusion_matrix(y_test, damage_prob)
matrix

array([[64091,    18],
       [10406,     2]])