<a href="https://colab.research.google.com/github/vanepsm/cs5262-cyclist-crashes-nyc/blob/main/new_york_city_cyclist_accidents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Investigating cyclist accidents in New York City

## Background

### Introduction
**The objective of this project is to identify factors contributing to cyclist injuries and fatalities.**

With the increasing popularity of bicycles and e-bikes, it's imperative for drivers, cyclists, pedestrians, and policymakers to recognize the risks associated with cyclists sharing our public roadways and to minimize cyclist injuries and fatalities.

### Literature review

The [National Safety Council](https://injuryfacts.nsc.org/home-and-community/safety-topics/bicycle-deaths/) reports a 37% increase in preventable cycling deaths over the past decade, making up 2% of all motor vehicle fatalities. Understanding the factors contributing to these incidents is crucial for reducing cyclist deaths and injuries.

Additionally, e-bikes are gaining popularity, surpassing sales of electric cars. The [New York Times](https://www.nytimes.com/2021/11/08/business/e-bikes-urban-transit.html), [PBS](https://www.pbs.org/newshour/show/e-bike-popularity-is-surging-creating-regulatory-challenges-on-u-s-roads) and many other publications have highlighted the rise of e-bikes and the accompanying safety challenges. This new wave of cyclists introduces distinct dynamics on the roads.

The increasing accidents involving motor vehicles and cyclists present a growing problem worldwide. As transportation solutions evolve, it's essential to gain insights into these trends to effectively address the emerging challenges.


### Challenges
Motor vehicle accidents are complicated events, and accidents between motor vehicles and bicycles are more complex yet. Many data points that we wish we knew about cyclist fatalities are not known. Was the cyclist riding on the road, sidewalk or bike lane? Was the cyclist wearing a helmet? Were there adverse weather conditions present? None of these data points are reported in the New York City vehicle collision dataset.

Other data elements that do exist, like contributing factors to the accident or vehicle types, have dozens (maybe hundreds) of possible values.  How can we analyze the data in a way that teases the signal from the noise? I'm hoping this project can help do just that.

## Project Description

### Topic
Our topic will focus on motor vehicle crashes that involve cyclist injuries or deaths, and attempt to analyze these data to determine what factors are correlated with cyclist injuries and fatalities. By training data models on this dataset I hope to learn patterns, relationships and structures from the data. I also hope to be able to enhance decision making for cyclists, motorists and policymakers.

This is also my first machine learning project, so I hope to actually learn how to do all those cool things I just said.



### Data
This project seeks to gain insights into vehicle accidents involving cyclists extracted from the New York City vehicle collisions dataset.
> https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes

A subset of the data pertaining only to cyclist accidents has been extracted and is available here:
> https://drive.google.com/file/d/1CFaRXe3Y6PWHpYOoGD7qih6-1oWzUi0e/view

#### Data sub-selection criteria
The original data set contains over two million rows, and encapsulates a broad spectrum of vehicle collisions. In order to narrow this dataset down to a size appropriate for CS5262 the following was done:
- Import the original CSV data to a relational database for refinement.
- Idenify a subset of rows:
 - Include only the rows with cyclist injuries or fatalities
 - Remove all rows with more than 3 vehicles involved in the crash
 - Remove all rows with a null or empty zip code
- Identify a subset of columns:
 - Exclude all columns with details about a 4th or 5th vehicle, as there aren't very many and add too many columns to our dataset
 - Exclude duplicate latitude/longitude columns
- Export the subset data to a new CSV for use in this project.

## Performance Metrics


### Binary Classification
- **Injured vs. Killed.** I would like to classify observations into two classes. "Cyclist Injured" vs "Cyclist Killed".
- **E-bike or not?** I would like to classify e-bikes vs. traditional bikes as the recent rise in e-bike popularity may be largely unexplored.

### Clustering
I am hoping to extract meaningful patterns from these data, some of which may include:
- **Physical location.** Intuition tells us there are locations that are more dangerous for cyclists than others. Intersections, curves, congested spaces, etc... What role does location play in injuries and fatalities?
- **Time.** It seems likely that certain times will see more accidents than others. Around what times do fatalities tend to cluster?
- **Types of vehicles involved.** It's possible that certain vehicle types will result in more accidents overall, or perhaps result in death more often than injury. What kind of vehicles are more prone to injuries? What are more prone to fatalities?
- **Types of bikes.** Do newer alternative types of bicycles like the e-bike pose a greater risk for death? What is the breakdown of e-bike deaths and injuries versus traditional bikes?
- **Contributing factors for the accident.** What types of contributing factors are more likely to result in injury or death? Distracted driving, improper use of passing lanes, alcohol intoxication, etc...?

## Data Glossary

It's often helpful to have a reference for data fields and enumerations when doing data analysis. I'm adding some of that information here.

### Data Fields

| Field | Description |
|-------|-------------|
|collision-id| A unique identifier for the collision record. |
|crash-date| The day (yyyy-mm-dd) of the accident.|
|crash-time| The time (24 hour) of the accident.|
|zip code| The ZIP code of the location the accident occurred. |
|latitude| Latitude coordinate of the accident. |
|longitude| Longitude coordinate of the accident. |
|on-street-name| The name of the street the accident took place. |
|cross-street-name| If the accident occurred at an intersection, the cross street will appear here. |
|off-street-name| If the accident happened on a street, the address of that street will appear in this field. |
|persons-injured| The total number of people injured in the accident. |
|persons-killed| The total number of people killed in the accident. |
|pedestrians-injured| The number of pedestrians injured in the accident. |
|pedestrians-killed| The number of pedestrians killed in the accident. |
|cyclists-injured| The number of cyclists injured in the accident. |
|cyclists-killed| The number of cyclists killed in the accident. |
|motorists-injured| The number of motorists killed in the accident. |
|motorists-killed| The number of motorists killed in the accident. |
|contributing-factor-vehicle-1| Contributing factor description for the vehicle involved in the crash. (See Enumerations)|
|contributing-factor-vehicle-2| Contributing factor description for the vehicle involved in the crash. (See Enumerations) |
|contributing-factor-vehicle-3| Contributing factor description for the vehicle involved in the crash. (See Enumerations)|
|vehicle-type-code-1| The type of vehicle involved in the crash. |
|vehicle-type-code-2| The type of vehicle involved in the crash. |
|vehicle-type-code-3| The type of vehicle involved in the crash. |

### Enumerations

|Contributing Factors|
|-|
|NULL|
|"1"|
|"80"|
|"Accelerator Defective"|
|"Aggressive Driving/Road Rage"|
|"Alcohol Involvement"|
|"Animals Action"|
|"Backing Unsafely"|
|"Brakes Defective"|
|"Cell Phone (hand-held)"|
|"Cell Phone (hand-Held)"|
|"Cell Phone (hands-free)"|
|"Driver Inattention/Distraction"|
|"Driver Inexperience"|
|"Driverless/Runaway Vehicle"|
|"Drugs (illegal)"|
|"Drugs (Illegal)"|
|"Eating or Drinking"|
|"Failure to Keep Right"|
|"Failure to Yield Right-of-Way"|
|"Fatigued/Drowsy"|
|"Fell Asleep"|
|"Following Too Closely"|
|"Glare"|
|"Headlights Defective"|
|"Illnes"|
|"Illness"|
|"Lane Marking Improper/Inadequate"|
|"Listening/Using Headphones"|
|"Lost Consciousness"|
|"Obstruction/Debris"|
|"Other Electronic Device"|
|"Other Lighting Defects"|
|"Other Vehicular"|
|"Outside Car Distraction"|
|"Oversized Vehicle"|
|"Passenger Distraction"|
|"Passing or Lane Usage Improper"|
|"Passing Too Closely"|
|"Pavement Defective"|
|"Pavement Slippery"|
|"Pedestrian/Bicyclist/Other Pedestrian Error/Confusion"|
|"Physical Disability"|
|"Prescription Medication"|
|"Reaction to Other Uninvolved Vehicle"|
|"Reaction to Uninvolved Vehicle"|
|"Shoulders Defective/Improper"|
|"Steering Failure"|
|"Texting"|
|"Tinted Windows"|
|"Tire Failure/Inadequate"|
|"Tow Hitch Defective"|
|"Traffic Control Device Improper/Non-Working"|
|"Traffic Control Disregarded"|
|"Turning Improperly"|
|"Unsafe Lane Changing"|
|"Unsafe Speed"|
|"Unspecified"|
|"Using On Board Navigation Device"|
|"Vehicle Vandalism"|
|"View Obstructed/Limited"|
|"Windshield Inadequate"|

In [None]:
#tables and visualizations
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#machine learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer, StandardScaler
from sklearn import config_context
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

# Load data
Here we will load the data into python using pandas and read it in as a dataframe.

In [None]:
# load data from google drive: note the need to reference directly by ID
url = 'https://drive.google.com/file/d/1CFaRXe3Y6PWHpYOoGD7qih6-1oWzUi0e/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
spreadsheet = pd.read_csv(url)
display(spreadsheet.head())
spreadsheet.info()


Unnamed: 0,crash-date,crash-time,zip-code,latitude,longitude,on-street-name,cross-street-name,off-street-name,persons-injured,persons-killed,...,cyclists-killed,motorists-injured,motorists-killed,contributing-factor-vehicle-1,contributing-factor-vehicle-2,contributing-factor-vehicle-3,collision-id,vehicle-type-code-1,vehicle-type-code-2,vehicle-type-code-3
0,2021-12-14,12:54,11217,40.687534,-73.9775,FULTON STREET,SAINT FELIX STREET,,1,0,...,0,0,0,Unspecified,Unspecified,,4487052,Sedan,Bike,
1,2022-04-24,15:35,10019,40.767242,-73.986206,WEST 56 STREET,9 AVENUE,,1,0,...,0,0,0,View Obstructed/Limited,Unspecified,,4521853,Station Wagon/Sport Utility Vehicle,Bike,
2,2021-12-09,23:15,11218,40.640835,-73.98967,12 AVENUE,41 STREET,,1,0,...,0,0,0,Driver Inattention/Distraction,Driver Inattention/Distraction,,4485355,Sedan,Bike,
3,2021-12-08,19:30,10022,40.76175,-73.96899,,,127 EAST 58 STREET,1,0,...,0,0,0,Following Too Closely,Reaction to Uninvolved Vehicle,,4484852,Station Wagon/Sport Utility Vehicle,Bike,
4,2021-12-08,12:00,10011,40.736614,-73.9951,,,44 WEST 14 STREET,1,0,...,0,0,0,Passing or Lane Usage Improper,Unspecified,,4485542,Box Truck,Bike,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38469 entries, 0 to 38468
Data columns (total 23 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   crash-date                     38469 non-null  object 
 1   crash-time                     38469 non-null  object 
 2   zip-code                       38469 non-null  int64  
 3   latitude                       37610 non-null  float64
 4   longitude                      37610 non-null  float64
 5   on-street-name                 31904 non-null  object 
 6   cross-street-name              31903 non-null  object 
 7   off-street-name                6565 non-null   object 
 8   persons-injured                38469 non-null  int64  
 9   persons-killed                 38469 non-null  int64  
 10  pedestrians-injured            38469 non-null  int64  
 11  pedestrians-killed             38469 non-null  int64  
 12  cyclists-injured               38469 non-null 