# Introduction to the dataset

A summary of the columns and the data they represent:

`DATE`: Date of the collision.  
`TIME`: Time of the collision.  
`BOROUGH`: The borough, or area of New York City, where the collision occurred.  
`ZIP CODE`: Zip code of the area, where the collision occured.  
`LATITUDE`: Latitude of the place, where the collision occured.  
`LONGITUDE`: Longitude of the place, where the collision occured.  
`LOCATION`: Latitude and longitude coordinates for the collision.  
`ON STREET NAME`, `CROSS STREET NAME`, `OFF STREET NAME`: Details of the street or intersection where the collision occurred.  
`NUMBER OF PERSONS INJURED`: Total number of people injured.   
`NUMBER OF PERSONS KILLED`: Total number of people killed.  
`NUMBER OF PEDESTRIANS INJURED`: Number of pedestrians who were injured.   
`NUMBER OF PEDESTRIANS KILLED`: Number of pedestrians who were killed.  
`NUMBER OF CYCLIST INJURED`: Number of people on a bike who were injured.   
`NUMBER OF CYCLIST KILLED`:  Number of people on a bike who were killed.  
`NUMBER OF MOTORIST INJURED`: Number of people in a vehicle who were injured.  
`NUMBER OF MOTORIST KILLED`: Number of people in a vehicle who were killed.  
`CONTRIBUTING FACTOR VEHICLE 1` through `CONTRIBUTING FACTOR VEHICLE 5`: Contributing factor for each vehicle in the accident.  
`UNIQUE KEY`: A unique identifier for each collision.  
`VEHICLE TYPE CODE 1` through `VEHICLE TYPE CODE 5`: Type of each vehicle involved in the accident.

Imports and constants

In [1]:
import pandas as pd

data_path = "data/nyc_vehicle_collisions.csv"
sample_path = "data/nyc_sample.csv"

The dataset is in a CSV file called nyc_vehicle_collisions.csv. Let's read the data into a pandas dataframe and inspect the first few rows of the data as well as its size:

In [2]:
nyc_2018 = pd.read_csv(data_path)
nyc_2018.head(5)

Unnamed: 0,DATE,TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,UNIQUE KEY,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,01/01/2018,9:49,BROOKLYN,11206.0,40.701878,-73.92986,"(40.701878, -73.92986)",,,257 MELROSE STREET,...,,,,,3820878,PASSENGER VEHICLE,,,,
1,01/01/2018,9:45,QUEENS,11354.0,40.76242,-73.82752,"(40.76242, -73.82752)",37 AVENUE,UNION STREET,,...,Passing Too Closely,,,,3820398,SPORT UTILITY / STATION WAGON,SPORT UTILITY / STATION WAGON,,,
2,01/01/2018,9:45,QUEENS,11004.0,40.74351,-73.717415,"(40.74351, -73.717415)",80 AVENUE,LITTLE NECK PARKWAY,,...,Unspecified,,,,3818788,SPORT UTILITY / STATION WAGON,SPORT UTILITY / STATION WAGON,,,
3,01/01/2018,9:30,QUEENS,11367.0,40.72992,-73.83308,"(40.72992, -73.83308)",PARK DRIVE EAST,136 STREET,,...,,,,,3823401,PASSENGER VEHICLE,,,,
4,01/01/2018,9:30,,,40.583412,-73.9713,"(40.583412, -73.9713)",BELT PARKWAY,,,...,,,,,3818946,PASSENGER VEHICLE,,,,


In [3]:
nyc_2018.shape

(231496, 29)

The dataset consists of 231496 rows and 29 columns.

# Sampling from the dataset

Because our dataset is relatively big for jupyter notebook with over 230.000 rows, we will use [Simple Random Sampling](https://en.wikipedia.org/wiki/Simple_random_sample) to reduce its size for the purpose of analysis. I will reduce dataset's size to 1/5 of it's original size, but the sample will still be big enough. A large sample decreases the variability of the sampling process, which decreases the chance that the sample will be unrepresentative.

Since this is meant to be a project showcasing skills, and it won't be used to make any meaningfull decisions, I won't take the structure of the data into consideration (as we would when using  for example [Stratified Sampling](https://en.wikipedia.org/wiki/Stratified_sampling) or [Cluster Sampling](https://en.wikipedia.org/wiki/Cluster_sampling)) and I'll perform the analysis ignoring potential misrepresentations. What is more, we don't have a single category that interests us making it hard to define a good stratum.

In [4]:
sample_size = int(nyc_2018.shape[0] * (1/5))

We will use pandas' sample method with random_state parameter, to make the results reproducible.

In [5]:
nyc_sample = nyc_2018.sample(sample_size, random_state=1, axis=0)

Let's check the sample size.

In [6]:
nyc_sample.shape

(46299, 29)

Let's check `BOROUGH`'s frequency distribution to see whether the sampling error is big or not. We will include null values and present the result as percentages.

In [12]:
original = nyc_2018["BOROUGH"].value_counts(normalize=True, dropna=False) * 100
print(original)
sampled = nyc_sample["BOROUGH"].value_counts(normalize=True, dropna=False) * 100
print(sampled)

NaN              35.557850
BROOKLYN         20.429295
QUEENS           17.827090
MANHATTAN        13.565245
BRONX             9.955248
STATEN ISLAND     2.665273
Name: BOROUGH, dtype: float64
NaN              35.653038
BROOKLYN         20.622476
QUEENS           17.775762
MANHATTAN        13.468973
BRONX             9.823106
STATEN ISLAND     2.656645
Name: BOROUGH, dtype: float64


As we can see, regarding the `BOROUGH` column, percentages of values in the sample and in original are about the same. High percentage of missing values may be concerning, that's definitely something we'll have to take a look at when cleaning the data. The last thing to do is to save our sample to a new csv file for further analysis.

In [None]:
exported = nyc_sample.to_csv (sample_path, index = None, header=True)

# Conclusions

We have successfully sampled the dataset using Simple Random Sampling without any big misrepresentations of the data. Next part of the project - cleaning the data and initial analysis can be found [here](nyc-collisions-data-cleaning.ipynb).