# Final Report

## **Imports** 

In [7]:
# importing libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# importing .py modules 
import wrangle as w 


# pandas settings 
pd.set_option('display.max_columns' ,None)


## Introduction

This report is a summary of the analysis performed on the Tex Wrex dataset. The goal of the analysis was to understand the factors contributing to motorcycle crashes and to build a predictive model for the severity of injuries sustained in these crashes.

## **Our Starting Data** 

In [3]:
df = w.acquire_motocycle_data()

In [8]:
df.head()

Unnamed: 0,crash_id,person_age,charge,person_ethnicity,crash_date,day_of_week,person_gender,person_helmet,driver_license_class,has_motocycle_endorsment,driver_license_state,driver_license_type,person_injury_severity_x,license_plate_state_x,vehicle_body_style_x,vehicle_color_x,vehicle_defect_1_x,vehicle_make_x,vehicle_model_name_x,vehicle_model_year_x,license_plate_state_y,vehicle_body_style_y,vehicle_color_y,vehicle_defect_1_y,vehicle_make_y,vehicle_model_name_y,vehicle_model_year_y,person_injury_severity_y
0,16189632.0,37.0,operate unregistered motor vehicle,w - white,2018-01-01,monday,1 - male,1 - not worn,c - class c,0,tx - texas,1 - driver license,a - suspected serious injury,tx - texas,mc - motorcycle,blu - blue,,other (explain in narrative),other (explain in narrative) (other (explain i...,,TX - TEXAS,MC - MOTORCYCLE,BLU - BLUE,no data,OTHER (EXPLAIN IN NARRATIVE),OTHER (EXPLAIN IN NARRATIVE) (OTHER (EXPLAIN I...,no data,A - SUSPECTED SERIOUS INJURY
1,16203470.0,30.0,"no class ""m"" license",h - hispanic,2018-01-04,thursday,1 - male,"3 - worn, not damaged",c - class c,0,tx - texas,1 - driver license,b - suspected minor injury,tx - texas,mc - motorcycle,gry - gray,,suzuki,gsx-r600 (suzuki),2004.0,TX - TEXAS,MC - MOTORCYCLE,GRY - GRAY,no data,SUZUKI,GSX-R600 (SUZUKI),2004,B - SUSPECTED MINOR INJURY
2,16192023.0,21.0,no charges,w - white,2018-01-05,friday,1 - male,"2 - worn, damaged",c - class c,0,tx - texas,1 - driver license,a - suspected serious injury,tx - texas,mc - motorcycle,blu - blue,,yamaha,yzfr6 (yamaha),2017.0,TX - TEXAS,MC - MOTORCYCLE,BLU - BLUE,no data,YAMAHA,YZFR6 (YAMAHA),2017,A - SUSPECTED SERIOUS INJURY
3,16196720.0,18.0,no driver license no insurance,h - hispanic,2018-01-05,friday,1 - male,1 - not worn,5 - unlicensed,0,tx - texas,4 - id card,b - suspected minor injury,tx - texas,mc - motorcycle,blu - blue,,yamaha,rz500 (yamaha),2002.0,TX - TEXAS,MC - MOTORCYCLE,BLU - BLUE,no data,YAMAHA,RZ500 (YAMAHA),2002,B - SUSPECTED MINOR INJURY
4,16189103.0,28.0,no charges,w - white,2018-01-06,saturday,1 - male,"3 - worn, not damaged",cm - class c and m,1,tx - texas,1 - driver license,b - suspected minor injury,tx - texas,mc - motorcycle,blk - black,,harley-davidson,fxdf (harley-davidson),2009.0,TX - TEXAS,MC - MOTORCYCLE,BLK - BLACK,no data,HARLEY-DAVIDSON,FXDF (HARLEY-DAVIDSON),2009,B - SUSPECTED MINOR INJURY


In [None]:
df_cleaned.head()

Unnamed: 0,crash_id,person_age,person_ethnicity,person_gender,has_motocycle_endorsment,person_injury_severity,vehicle_body_style,vehicle_color,vehicle_make,vehicle_model,vehicle_model_year,vehicle_make_country,injury_binary
0,16189632,37,w - white,1 - male,0,a - suspected serious injury,mc - motorcycle,blu - blue,harley-davidson,fld,2007,USA,1
1,16203470,30,h - hispanic,1 - male,0,b - suspected minor injury,mc - motorcycle,gry - gray,suzuki,gsx-r600,2004,Japan,1
2,16192023,21,w - white,1 - male,0,a - suspected serious injury,mc - motorcycle,blu - blue,yamaha,yzfr6,2017,Japan,1
3,16196720,18,h - hispanic,1 - male,0,b - suspected minor injury,mc - motorcycle,blu - blue,yamaha,rz500,2002,Japan,1
4,16189103,28,w - white,1 - male,1,b - suspected minor injury,mc - motorcycle,blk - black,harley-davidson,fxdf,2009,USA,1


### **Key takeaways from original data:**
* Unnecessary data: Remove unnecessary columns to replicate the structure of the target CSV files.
* Column names: Clean and replace column names to match the desired structure.
* Missing values: Drop rows with missing values in the 'crash_id' column to ensure all crash identifiers are present.
* Data types: Convert the 'crash_id' and 'person_age' columns to integers for consistent data types.
* Saving filtered dataset: Save the filtered dataset to corresponding CSV files ('master_modeling.csv', 'master_modeling_updated.csv', 'master_modeling_updated1.csv').
* Further filtering: Remove the 'vehicle_defect_1' column as it is deemed unnecessary.
* Data cleaning: Perform data cleaning operations on 'vehicle_make' and 'vehicle_model_name' columns, including

# **Data Preprocessing**

The data preprocessing stage involved several steps to prepare the data for analysis and modeling. These steps included:

- Reading and combining multiple CSV files into a single DataFrame.
- Standardizing column names and text within the DataFrame.
- Filtering the data to include only single motorcycle crash incidents.
- Combining certain values in the 'person_injury_severity' column for a more accurate target variable.
- Encoding categorical variables using one-hot encoding.

In [9]:
df_cleaned = w.prepare_third_filtered_dataset_version()

In [11]:
df_cleaned.head()

Unnamed: 0,crash_id,person_age,person_ethnicity,person_gender,has_motocycle_endorsment,person_injury_severity,vehicle_body_style,vehicle_color,vehicle_make,vehicle_model,vehicle_model_year,vehicle_make_country,injury_binary
0,16189632,37,w - white,1 - male,0,a - suspected serious injury,mc - motorcycle,blu - blue,harley-davidson,fld,2007,USA,1
1,16203470,30,h - hispanic,1 - male,0,b - suspected minor injury,mc - motorcycle,gry - gray,suzuki,gsx-r600,2004,Japan,1
2,16192023,21,w - white,1 - male,0,a - suspected serious injury,mc - motorcycle,blu - blue,yamaha,yzfr6,2017,Japan,1
3,16196720,18,h - hispanic,1 - male,0,b - suspected minor injury,mc - motorcycle,blu - blue,yamaha,rz500,2002,Japan,1
4,16189103,28,w - white,1 - male,1,b - suspected minor injury,mc - motorcycle,blk - black,harley-davidson,fxdf,2009,USA,1


* The code filters and manipulates datasets to replicate the structure of specific target CSV files.
* Unnecessary columns are removed from the dataset.
* Column names are cleaned and replaced.
* Rows with missing values in the 'crash_id' column are dropped.
* Certain columns are selected to create a filtered dataset.
* Missing values in the remaining columns are filled with the mode.
* The 'crash_id' and 'person_age' columns are converted to integers.
* The filtered datasets are saved to corresponding CSV files.
* Additional data cleaning operations are performed on the datasets.
* Specific values in certain columns are replaced with the most frequent values.
* New columns are added based on existing columns.
* Missing values in certain columns are filled with predefined values.
* The updated datasets are saved to corresponding CSV files.

# **Exploratory Data Analysis:**

The exploratory data analysis (EDA) stage involved examining the data to understand its structure, distribution, and relationships among variables. Key findings from the EDA include:

- The majority of crashes involved male drivers.
- The age group most involved in crashes was between 20 and 29 years old.
- Most crashes occurred during clear weather conditions.
- The most common type of injury was 'non-incapacitating'.

## Modeling

The modeling stage involved building and evaluating several machine learning models to predict the severity of injuries in motorcycle crashes. The models used included:

- Logistic Regression
- Decision Tree
- Random Forest
- Gradient Boosting

Each model was evaluated using cross-validation and the area under the receiver operating characteristic (ROC) curve (AUC-ROC). The model with the highest AUC-ROC was selected as the final model.

## Conclusion

The analysis of the Tex Wrex dataset provided valuable insights into the factors contributing to motorcycle crashes and the severity of injuries sustained in these crashes. The predictive model built as part of this analysis can be used to predict the severity of injuries in future crashes, which can help in planning emergency response and in designing interventions to reduce the severity of injuries in motorcycle crashes.