# Final Project Submission

- Student name: Vinayak Modgil 
- Student pace: self paced / part time / full time: Full Time
- Scheduled project review date/time:
- Instructor name: Yish Lim
- Blog post URL:
- Video of 5-min Non-Technical Presentation:

# Table of Contents
- [Introduction](#Introduction)
- [Data Collection](#Data-Collection)
- [Data Cleaning](#Data-Cleaning)
- [Data Exploration](#Data-Exploration)
- [Data Modeling](#Data-Modeling)
- [Data Interpretation](#Data-Interpretation)
- [Recommendations and Conclusions](#Recommendations-and-Conclusions)

# Introduction

Crash data shows information about each traffic crash on city streets within the City of Chicago limits and under the jurisdiction of Chicago Police Department (CPD). Data are shown as is from the electronic crash reporting system (E-Crash) at CPD, excluding any personally identifiable information. A dataset housing this information can be found [here](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if).

## Business Statement
It is very crucial for the Vehicle Safety Board to determine the cause of an accident. With this paricular dataset, the city of Chicago has been chosen for the analysis of the accidents occuring in the city. 

## Analysis Methodology

The dataset has information on about 520,000 car crashes in the city of Chicago, for which about 60% have a known contributory cause. Information on these craashes include many important factors that led to the crashes and the aftermath of the crashes.I will clean and explore the data to be utilized with a clasification machine learning model to predict the most known contributory cause.

More specifically, I will dive deep into exploring and tuning the models so that the best known contributory cause can be known. From there, I will make predictions and conclusions which will finally lead to the most prevailing cause of an accident occuring in the city of Chicago.

# Data Collection

## Importing necessary packages

In [None]:
#data wrangling and visualization packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import statsmodels.api as sm
import scipy.stats as stats

#feature engineering packages
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler

#feature selection packages
from feature_engine.selection import DropDuplicateFeatures

#modeling packages
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

#modeling evaluation packages
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import plot_roc_curve, roc_curve, auc
from sklearn.metrics import get_scorer

#optimization packages
from sklearn.model_selection import GridSearchCV

In [None]:
#notebook settings
pd.set_option("display.max_columns", 40)
pd.options.display.float_format = '{:,}'.format

import warnings
warnings.filterwarnings('ignore')

## Global Functions

In [None]:
from sklearn.impute import SimpleImputer

impute_mean = SimpleImputer(strategy = "mean")
impute_median = SimpleImputer(strategy = "median")
impute_mode = SimpleImputer(strategy = "most_frequent")
impute_cont_const = SimpleImputer(strategy = "constant", fill_value = 0)
impute_cat_const = SimpleImputer(strategy = "constant", fill_value= "missing")


def clean_df(df):
    '''
    Takes dataset df as input and returns a clean dataset 
    with null values taken care of.
    '''
    # Dividing datasets in continuous and catergorical variables
    cont_features = [col for col in df.columns if df[col].dtype in [np.float64, np.int64]]
    cat_features = [col for col in df.columns if df[col].dtype in [np.object]]
    
    
    #filling injuries continuous variables with mean
    injuries = ["INJURIES_TOTAL", "INJURIES_FATAL", "INJURIES_INCAPACITATING", "INJURIES_NON_INCAPACITATING"
               , "INJURIES_REPORTED_NOT_EVIDENT", "INJURIES_NO_INDICATION", "INJURIES_UNKNOWN"]
    
    df[injuries] = impute_mean.fit_transform(df[injuries])
    
    # filling latitude and longitude continuous variables with 0
    lat_long = ["LATITUDE", "LONGITUDE"]
    
    df[lat_long] = impute_cont_const.fit_transform(df[lat_long])
    
    #filling beat of occurrence continuous variable with median
    beat_of_occ = ["BEAT_OF_OCCURRENCE"]
    df[beat_of_occ] = impute_median.fit_transform(df[beat_of_occ])
    
    # Filling null categorical values with "missing"
    cat_vars = ["RD_NO", "CRASH_DATE_EST_I", "LANE_CNT", "REPORT_TYPE", "INTERSECTION_RELATED_I",
               "NOT_RIGHT_OF_WAY_I", "HIT_AND_RUN_I", "PHOTOS_TAKEN_I", "STATEMENTS_TAKEN_I", "DOORING_I", "WORK_ZONE_I",
               "WORK_ZONE_TYPE", "WORKERS_PRESENT_I", "MOST_SEVERE_INJURY", "LOCATION", "STREET_DIRECTION", "STREET_NAME"]
    
    df[cat_vars] = impute_cat_const.fit_transform(df[cat_vars])    
    
    
    return df

In [None]:
def rows_to_drop(df, y=None):
    '''
    Cleans rows which are not needed
    '''
    if y!= None:
        df_with_index = df.set_index(y)
        
        df_with_index.drop(labels=["UNABLE TO DETERMINE", "NOT APPLICABLE"], axis=0, inplace=True)
        df_with_index.reset_index(inplace=True)
    return df_with_index

In [None]:
def rows_to_drop_unknown(df, y=None):
    '''
    Cleans rows which are not needed
    '''
    if y!= None:
        df_with_index = df.set_index(y)
        
        df_with_index.drop(labels=["UNKNOWN"], axis=0, inplace=True)
        df_with_index.reset_index(inplace=True)
    return df_with_index

In [None]:
def drop_quasi_const(df):
    '''
    Function taken from Feature Engineering course on Udemy to drop all
    the constant and quasi-constant features.
    - df: A dataframe
    '''
    #Create an empty list
    quasi_const_feat = []
    
    #Iterate over every feature
    for feature in df.columns:
        
        #Find the predominant value, the value that is 
        # shared by most observations
        predominant = (df[feature].value_counts() /
                       np.float(len(df))).sort_values(ascending=False).values[0]
        
        #Evaluate the predominant feature: do more than 99% of the observations
        #show 1 value?
        if predominant > 0.998:
            
            #if yes, append it to the empt list
            quasi_const_feat.append(feature)
            
    df.drop(labels=quasi_const_feat, axis=1, inplace=True)
    return df
            
    

In [None]:
def col_summary(df, num_col=None, cat_cols=None, y_col = "PRIM_CONTRIBUTORY_CAUSE", label_count = 25, thresh = 0.025):
    '''
    this function gives a brief summary of a single col 
    in the dataset df. Also, it shows the essential plots
    required for the column w.r.t the dependent variable.
    
    arguments:
    df - given dataset
    num_col - numerical column in the dataset
    cat_cols - categorical columns in the dataset
    y_col - dependent variable
    label_count - number of labels to draw in bar graph
    '''
    if num_col != None:
        #print the column name
        print(f'Column Name: {num_col}') 
        #print the number of unique values
        print(f'Number of unique values: {df[num_col].nunique()}') 
        #print the number of duplicate values
        print(f'There are {df[num_col].duplicated().sum()} duplicates')
        #print the number of null values
        print(f'There are {df[num_col].isna().sum()} null values')
        #print the number of values equal to 0
        print(f'There are {(df[num_col] == 0).sum()} zeros')
        print('\n')
        #print the value counts percentage
        print('Value Counts Percentage', '\n', 
              df[num_col].value_counts(normalize=True, dropna=False).round(2)*100)
        print('\n')
        #print descriptive statistics
        print('Descriptive Metrics:','\n',
              df[num_col].describe())
        #plot boxplot, histogram         
        fig, ax = plt.subplots(nrows=4, figsize=(15,80))
        
        histogram = df[num_col].hist(ax=ax[0])
        ax[0].set_title(f'Distribution of {num_col}');
        
        scatter = df.plot(kind='scatter', x=num_col, y=y_col,ax=ax[1]);
        ax[1].set_title(f'{y_col} vs {num_col}');

        boxplot = df.boxplot(column=num_col, ax=ax[2]);
        ax[2].set_title(f'Boxplot of {num_col}');

        sm.graphics.qqplot(df[num_col], dist=stats.norm, line='45', fit=True, ax=ax[3])
        ax[3].set_title(f'QQ plot of {num_col}');
        plt.tight_layout()

        plt.show()
        return
    
    else:
        
        for col in cat_cols:
            print('============================')
            #print the column name
            print(f'Column Name: {col}')
            print('\n')
            #print the number of unique values
            print(f'Number of unique values: {df[col].nunique()}')
            print('\n')
            #print the number of duplicate values
            print(f'There are {df[col].duplicated().sum()} duplicates')
            print('\n')
            #print the number of null values
            print(f'There are {df[col].isna().sum()} null values')
            print('\n')
            #print the number of values equal to '0'
            print(f'There are {(df[col] == "0").sum()} zeros')
            print('\n')
            #print the value counts percentage
            print('Value Counts Percentage', '\n', 
                  df[col].value_counts(dropna=False).round(2))
            print('\n')

            #plot barplot, histogram     
            fig, ax = plt.subplots(figsize=(15,10))
                        
            bar_graph = df[col].value_counts(normalize=True, 
                                             dropna=False)[:label_count].plot.bar(label=f'{col} Percentage')
            ax.axhline(y=thresh, color='red', linestyle='--', 
                        label=f'{thresh*100}% Threshold')
            ax.set_title(f'{col} Value Counts')
            ax.set_xlabel(f'{col} Labels')
            ax.set_ylabel('Percentage')
            ax.legend()

            plt.tight_layout()

            plt.show()

        return    

In [None]:
def plot_confusion(y_true, y_pred):
    #Create an instance of confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    #Plot it on a heatmap 
    sns.heatmap(cm, annot=True, fmt="0.2g", cmap = sns.color_palette("Blues"))
    print
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.show()
    
    

In [None]:
def model_evaluation(model,X_train, X_test, y_train, y_test, prev_model=None):
    

## Import Data

In [None]:
#df_cars = pd.read_csv("data/Traffic_Crashes_-_Vehicles.csv")
df_crashes = pd.read_csv("data/Traffic_Crashes_-_Crashes.csv")
df_crashes.head()

## Data Schema

**Taken From:** [Chicago car crash website](https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if)
- `CRASH_RECORD_ID`
This number can be used to link to the same crash in the Vehicles and People datasets. This number also serves as a unique ID in this dataset.

- `RD_NO`
Chicago Police Department report number. For privacy reasons, this column is blank for recent crashes.

- `CRASH_DATE_EST_I`	
Crash date estimated by desk officer or reporting party (only used in cases where crash is reported at police station days after the crash)

- `CRASH_DATE`	
Date and time of crash as entered by the reporting officer

- `POSTED_SPEED_LIMIT`	
Posted speed limit, as determined by reporting officer

- `TRAFFIC_CONTROL_DEVICE`	
Traffic control device present at crash location, as determined by reporting officer

- `DEVICE_CONDITION`	
Condition of traffic control device, as determined by reporting officer

- `WEATHER_CONDITION`	
Weather condition at time of crash, as determined by reporting officer

- `LIGHTING_CONDITION`	
Light condition at time of crash, as determined by reporting officer

- `FIRST_CRASH_TYPE`	
Type of first collision in crash

- `TRAFFICWAY_TYPE`	
Trafficway type, as determined by reporting officer

- `LANE_CNT`	
Total number of through lanes in either direction, excluding turn lanes, as determined by reporting officer (0 = intersection)

- `ALIGNMENT`	
Street alignment at crash location, as determined by reporting officer

- `ROADWAY_SURFACE_COND`	
Road surface condition, as determined by reporting officer

- `ROAD_DEFECT`	
Road defects, as determined by reporting officer

- `REPORT_TYPE`	
Administrative report type (at scene, at desk, amended)

- `CRASH_TYPE`	
A general severity classification for the crash. Can be either Injury and/or Tow Due to Crash or No Injury / Drive Away

- `INTERSECTION_RELATED_I`	
A field observation by the police officer whether an intersection played a role in the crash. Does not represent whether or not the crash occurred within the intersection.

- `NOT_RIGHT_OF_WAY_I`	
Whether the crash begun or first contact was made outside of the public right-of-way.

- `HIT_AND_RUN_I`	
Crash did/did not involve a driver who caused the crash and fled the scene without exchanging information and/or rendering aid

- `DAMAGE`	
A field observation of estimated damage.

- `DATE_POLICE_NOTIFIED`	
Calendar date on which police were notified of the crash

- `PRIM_CONTRIBUTORY_CAUSE`	
The factor which was most significant in causing the crash, as determined by officer judgment

- `SEC_CONTRIBUTORY_CAUSE`	
The factor which was second most significant in causing the crash, as determined by officer judgment

- `STREET_NO`	
Street address number of crash location, as determined by reporting officer

- `STREET_DIRECTION`	
Street address direction (N,E,S,W) of crash location, as determined by reporting officer

- `STREET_NAME`	
Street address name of crash location, as determined by reporting officer

- `BEAT_OF_OCCURRENCE`	
Chicago Police Department Beat ID. Boundaries available at https://data.cityofchicago.org/d/aerh-rz74

- `PHOTOS_TAKEN_I`	
Whether the Chicago Police Department took photos at the location of the crash

- `STATEMENTS_TAKEN_I`	
Whether statements were taken from unit(s) involved in crash

- `DOORING_I`	
Whether crash involved a motor vehicle occupant opening a door into the travel path of a bicyclist, causing a crash

- `WORK_ZONE_I`	
Whether the crash occurred in an active work zone

- `WORK_ZONE_TYPE`	
The type of work zone, if any

- `WORKERS_PRESENT_I`	
Whether construction workers were present in an active work zone at crash location

- `NUM_UNITS`	
Number of units involved in the crash. A unit can be a motor vehicle, a pedestrian, a bicyclist, or another non-passenger roadway user. Each unit represents a mode of traffic with an independent trajectory.

- `MOST_SEVERE_INJURY`	
Most severe injury sustained by any person involved in the crash

- `INJURIES_TOTAL`	
Total persons sustaining fatal, incapacitating, non-incapacitating, and possible injuries as determined by the reporting officer

- `INJURIES_FATAL`	
Total persons sustaining fatal injuries in the crash

- `INJURIES_INCAPACITATING`	
Total persons sustaining incapacitating/serious injuries in the crash as determined by the reporting officer. Any injury other than fatal injury, which prevents the injured person from walking, driving, or normally continuing the activities they were capable of performing before the injury occurred. Includes severe lacerations, broken limbs, skull or chest injuries, and abdominal injuries.

- `INJURIES_NON_INCAPACITATING`	
Total persons sustaining non-incapacitating injuries in the crash as determined by the reporting officer. Any injury, other than fatal or incapacitating injury, which is evident to observers at the scene of the crash. Includes lump on head, abrasions, bruises, and minor lacerations.

- `INJURIES_REPORTED_NOT_EVIDENT`	
Total persons sustaining possible injuries in the crash as determined by the reporting officer. Includes momentary unconsciousness, claims of injuries not evident, limping, complaint of pain, nausea, and hysteria.

- `INJURIES_NO_INDICATION`	
Total persons sustaining no injuries in the crash as determined by the reporting officer

- `INJURIES_UNKNOWN`	
Total persons for whom injuries sustained, if any, are unknown

- `CRASH_HOUR`	
The hour of the day component of CRASH_DATE.

- `CRASH_DAY_OF_WEEK`	
The day of the week component of CRASH_DATE. Sunday=1

- `CRASH_MONTH`	
The month component of CRASH_DATE.

- `LATITUDE`	
The latitude of the crash location, as determined by reporting officer, as derived from the reported address of crash

- `LONGITUDE`	
The longitude of the crash location, as determined by reporting officer, as derived from the reported address of crash

- `LOCATION`	
The crash location, as determined by reporting officer, as derived from the reported address of crash, in a column type that allows for mapping and other geographic analysis in the data portal software
Point

## Investigate Data

In [None]:
df_crashes.info()

> **Observations**
> - Many columns to explore for null value imputation
> - Column names are already standardized
> - Data types will require further evaluation during engineering

In [None]:
#evaluate numerical data descriptive statistics
df_crashes.describe()

>  **Observations**
> - Few of these numerical features should be transformed into a categorical feature.
> - `INJURIES_TOTAL`, `INJURIES_FATAL`, `INJURIES_INCAPACITATING`, `INJURIES_NON_INCAPACITATING`, `INJURIES_REPORTED_NOT_EVIDENT`, `INJURIES_NO_INDICATION`, `INJURIES_UNKNOWN`, `CRASH_HOUR` has a minimumum of 0 which may be placeholder for unknown. 

# Data Cleaning

In [None]:
df_crashes_clean = df_crashes.copy()

In [None]:
df_crashes_clean.isnull().sum()

In [None]:
# Using the global function clean_df to impute null values
df_crashes_clean = clean_df(df_crashes_clean)
df_crashes_clean.isnull().sum()

## Feature Evaluation

In [None]:
#Create a list of all columns
num_cols = [col for col in df_crashes_clean.columns if df_crashes_clean[col].dtype in [np.float64, np.int64]]
cat_cols= [col for col in df_crashes_clean.columns if df_crashes_clean[col].dtype in [np.object]]
print(f"There are {len(num_cols)} numerical columns : \n {num_cols}")
print("\n")
print(f"There are {len(cat_cols)} categorical columns : \n {cat_cols}")

In [None]:
# Display first 5 rows of numeric columns
df_crashes_clean.head()

In [None]:
# posted speed limit summary
col_summary(df_crashes_clean, num_col="POSTED_SPEED_LIMIT")

> **Observations**
> - Does not seem to have extreme outliers

> **Actions**
> - Keep all the values in the column

In [None]:
# Street no summary
col_summary(df_crashes_clean, num_col="STREET_NO")

> **Observations**
> - `STREET_NO` should be changed to a categorical variable as it is a unique identifier.

> **Actions**
> - Recast `STREET_NO` as categorical

In [None]:
#Summary of BEAT_OF_OCCURRENCE
col_summary(df_crashes_clean, num_col="BEAT_OF_OCCURRENCE")

> **Observations**
> - Needs to be changed to a categorical variable as it is an identifier.
> **Actions**
> - Recast `BEAT_OF_OCCURRENCE` as a categorical variable.

In [None]:
#Summary of NUM_UNITS
col_summary(df_crashes_clean, num_col = "NUM_UNITS")

> **Observations**
> - There are outliers present here.

> **Actions**
> - Remove outliers from `NUM_UNITS`

In [None]:
col_summary(df_crashes_clean, num_col="INJURIES_TOTAL")

> **Observations**
> - Maybe useful for modeling by engineering features

> **Actions**
> - Keep the column `INJURIES_TOTAL`

In [None]:
col_summary(df_crashes_clean, num_col="INJURIES_FATAL")

> **Observations**
> - Seems to be useful for modeling

> **Actions**
> - Keep the column `INJURIES_FATAL`

In [None]:
col_summary(df_crashes_clean, num_col="INJURIES_INCAPACITATING")

> **Observations**
> - `INJURIES INCAPACITATING` has many zeros and needs to evaluated for outliers.

> **Actions**
> - Check for outliers in the column.
> - Keep the column.

In [None]:
col_summary(df_crashes_clean, num_col="INJURIES_NON_INCAPACITATING")

> **Observations**
> - Doesnt seem useful for modeling

> **Actions**
> - Drop the column `INJURIES_NON_INCAPACITATING`

In [None]:
col_summary(df_crashes_clean, num_col="INJURIES_REPORTED_NOT_EVIDENT")

> **Observations**
> about 95% of the values are 0 and the data schema does not clearly state what it means, I will drop this column from analysis.

> **Actions**
> - Drop `INJURIES_REPORTED_NOT_EVIDENT`

In [None]:
col_summary(df_crashes_clean, num_col="INJURIES_NO_INDICATION")

> **Observations**
> - Column seems to useful for classification

> **Actions**
> - Keep the column

In [None]:
col_summary(df_crashes_clean, num_col="INJURIES_UNKNOWN")

> **Observations**
> - Doesnt seem to be a useful column

> **Actions**
> - Drop `INJURIES_UNKNOWN`

In [None]:
col_summary(df_crashes_clean, num_col="CRASH_HOUR")

> **Observations**
> - Seems useful for modeling

> **Actions**
> - Keep the column

In [None]:
col_summary(df_crashes_clean, num_col="CRASH_DAY_OF_WEEK")

> **Observations**
> - Doesnt seem to be useful

> **Actions**
> - Drop `CRASH_DAY_OF_WEEK`

In [None]:
col_summary(df_crashes_clean, num_col="CRASH_MONTH")

> **Observations**
> - Doesnt seem to be useful

> **Actions**
> - Drop `CRASH_MONTH`

In [None]:
col_summary(df_crashes_clean, num_col="LATITUDE")

> **Observations**
> - Latitude should be a categorical as it is an identifier

> **Actions**
> - Recast `LATITUDE` as a categorical feature

In [None]:
col_summary(df_crashes_clean, num_col="LONGITUDE")

> **OBSERVATION**
> - `LONGITUDE` seems to be a categorical variable as it is an identifier.

> **Action**
> - Recast `LONGITUDE` as a categorical variable/

In [None]:
col_summary(df_crashes_clean, cat_cols = ["CRASH_RECORD_ID"])

In [None]:
col_summary(df_crashes_clean, cat_cols = ["RD_NO"])

In [None]:
col_summary(df_crashes_clean, cat_cols = ["CRASH_DATE", "CRASH_DATE_EST_I"])

In [None]:
col_summary(df_crashes_clean, cat_cols=["TRAFFIC_CONTROL_DEVICE"])

In [None]:
col_summary(df_crashes_clean, cat_cols=["DEVICE_CONDITION", "WEATHER_CONDITION", "LIGHTING_CONDITION"])

In [None]:
col_summary(df_crashes_clean, cat_cols=['FIRST_CRASH_TYPE',
 'TRAFFICWAY_TYPE','REPORT_TYPE', 'CRASH_TYPE'])

In [None]:
 col_summary(df_crashes_clean, cat_cols=['LANE_CNT', 'ALIGNMENT', 'ROADWAY_SURFACE_COND',
 'ROAD_DEFECT'])

In [None]:
col_summary(df_crashes_clean, cat_cols=['INTERSECTION_RELATED_I',
 'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I'])

In [None]:
col_summary(df_crashes_clean, cat_cols=['DAMAGE', 'DATE_POLICE_NOTIFIED'])

In [None]:
col_summary(df_crashes_clean, cat_cols=["PRIM_CONTRIBUTORY_CAUSE", "SEC_CONTRIBUTORY_CAUSE"])

In [None]:
col_summary(df_crashes_clean, cat_cols=['STREET_DIRECTION', 'STREET_NAME'])

In [None]:
col_summary(df_crashes_clean, cat_cols=['PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I', 
 'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I'])

In [None]:
col_summary(df_crashes_clean, cat_cols=["MOST_SEVERE_INJURY"])

In [None]:
col_summary(df_crashes_clean, cat_cols=["LOCATION"])

-----------------------------------------------------
**Feature evaluation done**



> **Summary of actions to take**
>- recast `STREET_NO` as a string
>- recast  `BEAT_OF_OCCURRENCE` as a string
>- recast `LATITUDE` as a string
>- recast `LONGITUDE` as a string
>- drop `INJURIES_REPORTED_NOT_EVIDENT` column
>- drop `INJURIES_UNKNOWN` column
>- drop `CRASH_DAY_OF_WEEK` column
>- drop ` CRASH_MONTH` column
>- drop `CRASH_DATE` column
>- drop `CRASH_RECORD_ID` column
>- drop `INTERSECTION_RELATED_I` column
>- drop `STREET_DIRECTION` column
>- drop `STREET_NAME` column
>- drop `PHOTOS_TAKEN_I` column
>- drop `STATEMENTS_TAKEN_I` column
>- drop `WORK_ZONE_I` column
>- drop `WORK_ZONE_TYPE` column
>- drop `WORKERS_PRESENT_I` column
>- drop `LANE_CNT` column
>- drop `ALIGNMENT` column

## Data type Recasting

In [None]:
df_crashes_clean.dtypes

In [None]:
#convert STREET_NO to categorical
df_crashes_clean["STREET_NO"] = df_crashes_clean["STREET_NO"].astype(str)

In [None]:
#Convert BEAT_OF_OCCURRENCE to categorical
df_crashes_clean["BEAT_OF_OCCURRENCE"] = df_crashes_clean["BEAT_OF_OCCURRENCE"].astype(str)

In [None]:
# Convert LATITUDE to categorical
df_crashes_clean["LATITUDE"] = df_crashes_clean["LATITUDE"].astype(str)

In [None]:
df_crashes_clean["LONGITUDE"] = df_crashes_clean["LONGITUDE"].astype(str)

In [1]:
df_crashes_clean["CRASH_DATE_YR"] = pd.to_datetime(df_crashes_clean["CRASH_DATE"]).dt.year

NameError: name 'pd' is not defined

In [None]:
df_crashes_clean

## Feature/Row Drop

In [None]:
df_crashes_clean = df_crashes_clean.drop(columns=["RD_NO","CRASH_DAY_OF_WEEK", "CRASH_MONTH", "CRASH_RECORD_ID", "INTERSECTION_RELATED_I",
                                            "STREET_DIRECTION", "STREET_NAME", "PHOTOS_TAKEN_I", "STATEMENTS_TAKEN_I", 
                                            "WORK_ZONE_I", "WORK_ZONE_TYPE", "WORKERS_PRESENT_I", "LANE_CNT", "ALIGNMENT", "CRASH_DATE_EST_I", "CRASH_DATE",
                                                 "NOT_RIGHT_OF_WAY_I", "HIT_AND_RUN_I", "DATE_POLICE_NOTIFIED", "DOORING_I", "INJURIES_UNKNOWN", "LOCATION"], axis=1)
df_crashes_clean

In [None]:
df_crashes_clean = rows_to_drop(df_crashes_clean,
                                y="PRIM_CONTRIBUTORY_CAUSE")

In [None]:
df_crashes_clean = rows_to_drop(df_crashes_clean,
                                y="SEC_CONTRIBUTORY_CAUSE")

In [None]:
df_crashes_clean = rows_to_drop_unknown(df_crashes_clean, y="LIGHTING_CONDITION")

In [None]:
df_crashes_clean = drop_quasi_const(df_crashes_clean)
df_crashes_clean

## Outlier Removal

In [None]:
df_crashes_clean.loc[df_crashes_clean["NUM_UNITS"] <= 5, ["NUM_UNITS"]].hist(bins=2)

## Train-test Split

In [None]:
# Create train-test split
X = df_crashes_clean.drop(columns="LIGHTING_CONDITION")
y = df_crashes_clean["LIGHTING_CONDITION"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
X_train_tf = X_train.copy()
X_test_tf = X_test.copy()

## Feature Engineering

In this section I will create new features which will improve the ability to gain insights into the data and help modeling.

### `SEVERELY_INJURED`

In [None]:
X_train_tf["SEVERELY_INJURED"] = X_train_tf["INJURIES_TOTAL"] >= 5
X_train_tf["SEVERELY_INJURED"].value_counts()

X_test_tf["SEVERELY_INJURED"] = X_test_tf["INJURIES_TOTAL"] >= 5
X_test_tf["SEVERELY_INJURED"].value_counts()

## Data Exploration


In [None]:
# create data exploration df
df_explore = pd.concat([X_train_tf, X_test_tf], axis=0)
df_explore["LIGHTING_CONDITION"] = pd.concat([y_train, y_test])

In [None]:
df_explore

# Data Modeling

I will take 2 major steps in preprocessing the data for modeling:
1. Scale numerical data
2. Encode categorical data

## Model Preprocessing

In [None]:
#training columns
X_train_tf.columns

In [None]:
X_train_tf.drop(columns=["LATITUDE", "LONGITUDE"], axis=1, inplace=True)
X_test_tf.drop(columns=["LATITUDE", "LONGITUDE"], axis=1, inplace=True)

In [None]:
X_test_tf.drop("STREET_NO", axis=1, inplace=True)
X_train_tf.drop("STREET_NO", axis=1, inplace=True)

In [None]:
cat_cols = X_train_tf.select_dtypes(include="object").columns
num_cols = X_train_tf.select_dtypes(exclude="object").columns
num_cols

In [None]:
cat_cols

In [None]:
ohe = OneHotEncoder(sparse=False, drop="first")
ohe.fit(X_train_tf[cat_cols])
train_ohe_df = pd.DataFrame(ohe.transform(X_train_tf[cat_cols]),
                           columns=ohe.get_feature_names(cat_cols),
                            index=X_train_tf.index)

test_ohe_df = pd.DataFrame(ohe.transform(X_test_tf[cat_cols]),
                           columns=ohe.get_feature_names(cat_cols),
                           index=X_test_tf.index)


In [None]:
test_ohe_df

In [None]:
scaler = StandardScaler()
scaler.fit(X_train_tf[num_cols])

train_scale_df = pd.DataFrame(scaler.transform(X_train_tf[num_cols]),
                             columns=num_cols, index=X_train_tf.index)\

test_scale_df = pd.DataFrame(scaler.transform(X_test_tf[num_cols]),
                             columns=num_cols, index=X_test_tf.index)

In [None]:
test_scale_df

In [None]:
X_train_tf = pd.concat([train_ohe_df, train_scale_df], axis=1)
X_train_tf

In [None]:
X_test_tf = pd.concat([test_ohe_df, test_scale_df], axis=1)
X_test_tf

## Logistic Regression
First I will create a logistic regression model and check for the scores.|

### Linearity with Target

> **Observation**
> - The features seem to have a linear relationship with target

### Multicollinearity

In [None]:
sns.heatmap(df_explore.corr().abs().round(2), annot=True, cmap="Blues")

> **Observation**
> - High correlation between `INJURIES_TOTAL` and `INJURIES_NON_INCAPACITATING` is observed.

> **Action** 
>- drop `INJURIES_NON_INCAPACITATING` column.

### Model 1

In [None]:
X_train_tf.drop("INJURIES_NON_INCAPACITATING", axis=1, inplace=True)
X_test_tf.drop("INJURIES_NON_INCAPACITATING", axis=1, inplace=True)

In [None]:
X_train_lr = X_train_tf.copy()
X_test_lr = X_test_tf.copy()

In [None]:
lr1 = LogisticRegression()

In [None]:
lr1.fit(X_train_lr, y_train)

In [None]:
lr1.score(X_train_lr, y_train), lr1.score(X_test_lr, y_test)

In [None]:
y_train.value_counts(normalize=True)

In [None]:
y_test_pred = lr1.predict(X_test_tf)

In [None]:
plot_confusion(y_test, y_test_pred)

### Feature Selection

In [None]:
dup = DropDuplicateFeatures(missing_values="raise")
dup.fit(X_train_lr)

In [None]:
dup.duplicated_feature_sets_

In [2]:
X_train_lr = dup.transform(X_train_lr)
X_test_lr = dup.transform(X_test_lr)

NameError: name 'dup' is not defined

In [None]:
X_train_lr.shape, X_test_lr.shape

### Model 2

In [None]:
lr_2 = LogisticRegression()
lr_2.fit(X_train_lr, y_train)

lr