# Introduction to the dataset

A summary of the columns and the data they represent:

`DATE`: Date of the collision.  
`TIME`: Time of the collision.  
`BOROUGH`: The borough, or area of New York City, where the collision occurred.  
`ZIP CODE`: Zip code of the area, where the collision occured.  
`LATITUDE`: Latitude of the place, where the collision occured.  
`LONGITUDE`: Longitude of the place, where the collision occured.  
`LOCATION`: Latitude and longitude coordinates for the collision.  
`ON STREET NAME`, `CROSS STREET NAME`, `OFF STREET NAME`: Details of the street or intersection where the collision occurred.  
`NUMBER OF PERSONS INJURED`: Total number of people injured.   
`NUMBER OF PERSONS KILLED`: Total number of people killed.  
`NUMBER OF PEDESTRIANS INJURED`: Number of pedestrians who were injured.   
`NUMBER OF PEDESTRIANS KILLED`: Number of pedestrians who were killed.  
`NUMBER OF CYCLIST INJURED`: Number of people on a bike who were injured.   
`NUMBER OF CYCLIST KILLED`:  Number of people on a bike who were killed.  
`NUMBER OF MOTORIST INJURED`: Number of people in a vehicle who were injured.  
`NUMBER OF MOTORIST KILLED`: Number of people in a vehicle who were killed.  
`CONTRIBUTING FACTOR VEHICLE 1` through `CONTRIBUTING FACTOR VEHICLE 5`: Contributing factor for each vehicle in the accident.  
`UNIQUE KEY`: A unique identifier for each collision.  
`VEHICLE TYPE CODE 1` through `VEHICLE TYPE CODE 5`: Type of each vehicle involved in the accident.

The dataset is in a CSV called nyc_vehicle_collisions.csv. Let's read the data into a pandas dataframe and inspect the first few rows of the data:

In [1]:
import pandas as pd

nyc_2018 = pd.read_csv("data/nyc_vehicle_collisions.csv")
pd.options.display.max_columns = 30 # to avoid truncated output 
nyc_2018.head(5)

Unnamed: 0,DATE,TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,UNIQUE KEY,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,01/01/2018,9:49,BROOKLYN,11206.0,40.701878,-73.92986,"(40.701878, -73.92986)",,,257 MELROSE STREET,0.0,0.0,0,0,0,0,0,0,Unspecified,,,,,3820878,PASSENGER VEHICLE,,,,
1,01/01/2018,9:45,QUEENS,11354.0,40.76242,-73.82752,"(40.76242, -73.82752)",37 AVENUE,UNION STREET,,0.0,0.0,0,0,0,0,0,0,Failure to Yield Right-of-Way,Passing Too Closely,,,,3820398,SPORT UTILITY / STATION WAGON,SPORT UTILITY / STATION WAGON,,,
2,01/01/2018,9:45,QUEENS,11004.0,40.74351,-73.717415,"(40.74351, -73.717415)",80 AVENUE,LITTLE NECK PARKWAY,,0.0,0.0,0,0,0,0,0,0,Traffic Control Disregarded,Unspecified,,,,3818788,SPORT UTILITY / STATION WAGON,SPORT UTILITY / STATION WAGON,,,
3,01/01/2018,9:30,QUEENS,11367.0,40.72992,-73.83308,"(40.72992, -73.83308)",PARK DRIVE EAST,136 STREET,,1.0,0.0,0,0,0,0,1,0,Pavement Slippery,,,,,3823401,PASSENGER VEHICLE,,,,
4,01/01/2018,9:30,,,40.583412,-73.9713,"(40.583412, -73.9713)",BELT PARKWAY,,,0.0,0.0,0,0,0,0,0,0,Aggressive Driving/Road Rage,,,,,3818946,PASSENGER VEHICLE,,,,


# Sampling from the dataset

In [4]:
nyc_2018.shape

(231496, 29)

Because our dataset is relatively big with over 230.000 rows, we will use [Simple Random Sampling]() to reduce its size for the purpose of analysis. Since this is meant to be a project showcasing skills, and it won't be used to make any meaningfull decisions, we won't take the structure of the data into consideration (as we would when using  for example [Stratified Sampling]() or [Cluster Sampling]()).

In [2]:
boroughs = nyc_2018["BOROUGH"].value_counts().sum()
null_boroughs = nyc_2018["BOROUGH"].isnull().sum()/nyc_2018.shape[0]
print("Labeled boroughs and % of unlabeled boroughs: ", boroughs, null_boroughs*100)

Labeled boroughs and % of unlabeled boroughs:  149181 35.557849811659814


In [16]:
nyc_2018.shape

(231496, 29)