<a id='Top'></a>
# Analytics Programming: Module 12
## Statistical Tests
#### Alan Leidner Nov 17, 2019

In this notebook I will analyze car crash data from nyc's opendata set, using statistical tests.


1. [Importing](#Importing)
2. [Analysis & Cleaning](#Clenaing)
    - [111111](#slice_data)
    - [222222](#Graph1)
3. [Statistical Analysis](#Viz2)
    - [3333333](#slice_data2)
    - [4444444](#Map)
4. [Closing Thoughts](#conclusion)
    

DataSource: https://data.cityofnewyork.us/resource/h9gi-nx95



Ingest some or all of your final project data into a pandas data frame.

Do any data cleaning or combining you need to do to make your data tidier, more accurate, or more useful.

Select some data on which you're going to perform a statistical test, and explain why you're choosing that particular data and why the test you're choosing makes sense. You might want to do a correlation, a two-sample t test, a chi-square test, an ANOVA, or another statistical test.

Interpret the result of your test - are you surprised? Is there a statistically significant finding?

### Importing the data<a id='Importing'></a>
First I will import the data using the Socrata API from the website.
I will import all available rows into a dataframe which will let us read the csv data.

In [1]:
import pandas as pd
import datetime as dt
pd.set_option('display.max_colwidth', -1) # Shows full cell text, instead of  truncated data in cells
pd.set_option('display.max.columns', None) # Let's me see all columns when calling a dataframe
crashes = pd.read_csv("https://data.cityofnewyork.us/resource/h9gi-nx95.csv?$limit=2000000")

  interactivity=interactivity, compiler=compiler, result=result)


We will take a quick look at our data before we get started

In [2]:
pd.concat([crashes.head(3), crashes.tail(3)], axis=0)

Unnamed: 0,accident_date,accident_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2012-07-23T00:00:00.000,16:30,QUEENS,11368,40.751545,-73.870843,POINT (-73.8708432 40.7515446),37 AVENUE,JUNCTION BOULEVARD,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,279832,PASSENGER VEHICLE,UNKNOWN,,,
1,2012-08-14T00:00:00.000,18:00,QUEENS,11101,40.744962,-73.935415,POINT (-73.9354154 40.7449621),31 STREET,THOMSON AVENUE,,0.0,0.0,0,0,0,0,0,0,Unspecified,Unspecified,,,,239102,PASSENGER VEHICLE,PICK-UP TRUCK,,,
2,2012-08-02T00:00:00.000,14:40,QUEENS,11104,40.746355,-73.91616,POINT (-73.9161602 40.746355),48 STREET,SKILLMAN AVENUE,,1.0,0.0,0,0,0,0,1,0,Unspecified,,,,,238966,PASSENGER VEHICLE,,,,
1608162,2012-09-02T00:00:00.000,14:30,QUEENS,11358,40.757961,-73.785662,POINT (-73.7856623 40.757961),NORTHERN BOULEVARD,196 STREET,,0.0,0.0,0,0,0,0,0,0,Prescription Medication,Unspecified,,,,259840,PASSENGER VEHICLE,BUS,,,
1608163,2012-09-11T00:00:00.000,9:26,BROOKLYN,11211,40.717516,-73.943169,POINT (-73.943169 40.7175158),WITHERS STREET,HUMBOLDT STREET,,0.0,0.0,0,0,0,0,0,0,Oversized Vehicle,Unspecified,,,,198502,OTHER,PASSENGER VEHICLE,,,
1608164,2012-09-05T00:00:00.000,17:10,QUEENS,11415,40.712001,-73.825456,POINT (-73.8254559 40.7120013),QUEENS BOULEVARD,83 AVENUE,,1.0,0.0,1,0,0,0,0,0,Passenger Distraction,,,,,204233,PASSENGER VEHICLE,,,,


Before I start slicing data to examine it, I will quickly change the `accident_date` column into a datetime object, making the date easier to work with later

In [5]:
crashes['accident_date'] = pd.to_datetime(crashes['accident_date'])
crashes.tail(3)

Unnamed: 0,accident_date,accident_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
1608162,2012-09-02,14:30,QUEENS,11358,40.757961,-73.785662,POINT (-73.7856623 40.757961),NORTHERN BOULEVARD,196 STREET,,0.0,0.0,0,0,0,0,0,0,Prescription Medication,Unspecified,,,,259840,PASSENGER VEHICLE,BUS,,,
1608163,2012-09-11,9:26,BROOKLYN,11211,40.717516,-73.943169,POINT (-73.943169 40.7175158),WITHERS STREET,HUMBOLDT STREET,,0.0,0.0,0,0,0,0,0,0,Oversized Vehicle,Unspecified,,,,198502,OTHER,PASSENGER VEHICLE,,,
1608164,2012-09-05,17:10,QUEENS,11415,40.712001,-73.825456,POINT (-73.8254559 40.7120013),QUEENS BOULEVARD,83 AVENUE,,1.0,0.0,1,0,0,0,0,0,Passenger Distraction,,,,,204233,PASSENGER VEHICLE,,,,


Next Ill look to see how much of my data is populated

In [7]:
crashes.isnull().mean().round(4) * 100

accident_date                    0.00 
accident_time                    0.00 
borough                          30.33
zip_code                         30.34
latitude                         12.22
longitude                        12.22
location                         12.22
on_street_name                   19.60
off_street_name                  33.49
cross_street_name                86.13
number_of_persons_injured        0.00 
number_of_persons_killed         0.00 
number_of_pedestrians_injured    0.00 
number_of_pedestrians_killed     0.00 
number_of_cyclist_injured        0.00 
number_of_cyclist_killed         0.00 
number_of_motorist_injured       0.00 
number_of_motorist_killed        0.00 
contributing_factor_vehicle_1    0.26 
contributing_factor_vehicle_2    13.44
contributing_factor_vehicle_3    93.53
contributing_factor_vehicle_4    98.65
contributing_factor_vehicle_5    99.66
collision_id                     0.00 
vehicle_type_code1               0.34 
vehicle_type_code2       

vehicle 1 and factor 1 are significatly higher than the others. dropping now. cross street is dumb

In [11]:
clean_crashes = crashes.dropna(thresh=(0.30 * crashes.shape[0]), axis=1).copy()
clean_crashes.isnull().mean().round(4) * 100

accident_date                    0.00 
accident_time                    0.00 
borough                          30.33
zip_code                         30.34
latitude                         12.22
longitude                        12.22
location                         12.22
on_street_name                   19.60
off_street_name                  33.49
number_of_persons_injured        0.00 
number_of_persons_killed         0.00 
number_of_pedestrians_injured    0.00 
number_of_pedestrians_killed     0.00 
number_of_cyclist_injured        0.00 
number_of_cyclist_killed         0.00 
number_of_motorist_injured       0.00 
number_of_motorist_killed        0.00 
contributing_factor_vehicle_1    0.26 
contributing_factor_vehicle_2    13.44
collision_id                     0.00 
vehicle_type_code1               0.34 
vehicle_type_code2               16.49
dtype: float64

In [13]:
clean_crashes.drop(columns=["collision_id", "on_street_name", "off_street_name", "contributing_factor_vehicle_2", "vehicle_type_code2"], inplace=True)
clean_crashes.isnull().mean().round(4) * 100

accident_date                    0.00 
accident_time                    0.00 
borough                          30.33
zip_code                         30.34
latitude                         12.22
longitude                        12.22
location                         12.22
number_of_persons_injured        0.00 
number_of_persons_killed         0.00 
number_of_pedestrians_injured    0.00 
number_of_pedestrians_killed     0.00 
number_of_cyclist_injured        0.00 
number_of_cyclist_killed         0.00 
number_of_motorist_injured       0.00 
number_of_motorist_killed        0.00 
contributing_factor_vehicle_1    0.26 
vehicle_type_code1               0.34 
dtype: float64

ALL CLEAN. NOW TO ANALYZE

In [14]:
clean_crashes.describe()

Unnamed: 0,latitude,longitude,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed
count,1411650.0,1411650.0,1608148.0,1608134.0,1608165.0,1608165.0,1608165.0,1608165.0,1608165.0,1608165.0
mean,40.69236,-73.87318,0.2626151,0.001167191,0.05055514,0.0006305323,0.02078891,8.456844e-05,0.1914151,0.0004545553
std,1.136518,2.341375,0.659857,0.03610925,0.2316634,0.02569013,0.1438447,0.009263098,0.6222156,0.0231878
min,0.0,-201.36,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,40.66881,-73.9772,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,40.72258,-73.92981,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,40.76797,-73.86692,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,43.34444,0.0,43.0,8.0,27.0,6.0,4.0,2.0,43.0,5.0


In [17]:
clean_crashes.shape

(1608165, 17)

In [19]:
clean_crashes['vehicle_type_code1'].value_counts().head(40)

PASSENGER VEHICLE                      715236
SPORT UTILITY / STATION WAGON          313500
Sedan                                  156802
Station Wagon/Sport Utility Vehicle    126994
TAXI                                   50670 
VAN                                    26540 
OTHER                                  23982 
PICK-UP TRUCK                          23069 
UNKNOWN                                19929 
Taxi                                   16247 
SMALL COM VEH(4 TIRES)                 14559 
LARGE COM VEH(6 OR MORE TIRES)         14527 
BUS                                    14057 
Pick-up Truck                          10529 
LIVERY VEHICLE                         10481 
Box Truck                              8244  
Bus                                    6777  
MOTORCYCLE                             6536  
BICYCLE                                5568  
Bike                                   3982  
Tractor Truck Diesel                   3606  
Van                               

In [21]:
clean_crashes['vehicle_type_code1'].replace('SPORT UTILITY / STATION WAGON', 'SUV', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Station Wagon/Sport Utility Vehicle', 'SUV', inplace=True)
clean_crashes['vehicle_type_code1'].replace('TAXI', 'taxi', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Bike', 'BICYCLE', inplace=True)
clean_crashes['vehicle_type_code1'].replace('VAN', 'Van', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Motorscooter', 'SCOOTER', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Moped', 'SCOOTER', inplace=True)
clean_crashes['vehicle_type_code1'].replace('van', 'Van', inplace=True)
clean_crashes['vehicle_type_code1'].replace('MOTORCYCLE', 'Motorcycle', inplace=True)
clean_crashes['vehicle_type_code1'].replace('AMBULANCE', 'Ambulance', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Refrigerated Van', 'Van', inplace=True)
clean_crashes['vehicle_type_code1'].replace('PICK-UP TRUCK', 'Pick-up Truck', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Motorbike', 'Motorcycle', inplace=True)
clean_crashes['vehicle_type_code1'].replace('AMBUL', 'Ambulance', inplace=True)
clean_crashes['vehicle_type_code1'].replace('CAB', 'taxi', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Cab', 'taxi', inplace=True)
clean_crashes['vehicle_type_code1'].replace('VAN T', 'Van', inplace=True)
clean_crashes['vehicle_type_code1'].replace('VAN/T', 'Van', inplace=True)
clean_crashes['vehicle_type_code1'].replace('van t', 'Van', inplace=True)
clean_crashes['vehicle_type_code1'].replace('VAN', 'Van', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Ambul', 'Ambulance', inplace=True)
clean_crashes['vehicle_type_code1'].replace('AMB', 'Ambulance', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Ambu', 'Ambulance', inplace=True)
clean_crashes['vehicle_type_code1'].replace('ambul', 'Ambulance', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Ambu', 'Ambulance', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Fire', 'FIRE TRUCK', inplace=True)
clean_crashes['vehicle_type_code1'].replace('fire', 'FIRE TRUCK', inplace=True)
clean_crashes['vehicle_type_code1'].replace('FIRE', 'FIRE TRUCK', inplace=True)
clean_crashes['vehicle_type_code1'].replace('FIRET', 'FIRE TRUCK', inplace=True)
clean_crashes['vehicle_type_code1'].replace('FDNY', 'FIRE TRUCK', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Other', 'Unknown', inplace=True)
clean_crashes['vehicle_type_code1'].replace('BUS', 'Bus', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Box T', 'Box Truck', inplace=True)
clean_crashes['vehicle_type_code1'].replace('GARBA', 'Garbage or Refuse', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Taxi', 'taxy', inplace=True)
clean_crashes['vehicle_type_code1'].replace('taxy', 'taxi', inplace=True)
clean_crashes['vehicle_type_code1'].replace('AM', 'Ambulance', inplace=True)
clean_crashes['vehicle_type_code1'].replace('VN', 'Van', inplace=True)
clean_crashes['vehicle_type_code1'].replace('CONV', 'Convertible', inplace=True)
clean_crashes['vehicle_type_code1'].replace('Garbage or Refuse', 'Dump', inplace=True)
clean_crashes['vehicle_type_code1'].replace('OTHER', 'UNKNOWN', inplace=True)
clean_crashes['vehicle_type_code1'].value_counts().head(50)

PASSENGER VEHICLE                 715236
SUV                               440500
Sedan                             156802
taxi                              66918 
UNKNOWN                           43911 
Pick-up Truck                     33598 
Van                               30382 
Bus                               20834 
SMALL COM VEH(4 TIRES)            14559 
LARGE COM VEH(6 OR MORE TIRES)    14527 
LIVERY VEHICLE                    10481 
BICYCLE                           9550  
Motorcycle                        8673  
Box Truck                         8247  
Ambulance                         4018  
Tractor Truck Diesel              3606  
TK                                2485  
BU                                2229  
Dump                              1972  
Convertible                       1761  
FIRE TRUCK                        1058  
DS                                1006  
4 dr sedan                        886   
PK                                843   
Flat Bed        

There are so many values in a write in, we could clear it up all day, and still not be confident in the data... but we can try to do find some useful analysis. The 2 largest categories that have a comprable level of granularity are SUV and Sedans. SUVs tend to be larger, and we can see that there are more recorded crashes for them. But are they more dangerous? We will perform a _____._ test to see if they are more bad.

In [23]:
from scipy import stats
stats.ttest_ind(clean_crashes[['number_of_persons_killed']][clean_crashes['vehicle_type_code1']=="SUV"], 
                clean_crashes[['number_of_persons_killed']][clean_crashes['vehicle_type_code1']=="Sedan"])

Ttest_indResult(statistic=array([nan]), pvalue=array([nan]))

Nan? maybe # killed not integer

In [27]:
clean_crashes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1608165 entries, 0 to 1608164
Data columns (total 17 columns):
accident_date                    1608165 non-null datetime64[ns]
accident_time                    1608165 non-null object
borough                          1120389 non-null object
zip_code                         1120192 non-null object
latitude                         1411650 non-null float64
longitude                        1411650 non-null float64
location                         1411650 non-null object
number_of_persons_injured        1608148 non-null float64
number_of_persons_killed         1608134 non-null float64
number_of_pedestrians_injured    1608165 non-null int64
number_of_pedestrians_killed     1608165 non-null int64
number_of_cyclist_injured        1608165 non-null int64
number_of_cyclist_killed         1608165 non-null int64
number_of_motorist_injured       1608165 non-null int64
number_of_motorist_killed        1608165 non-null int64
contributing_factor_vehicl

its a float! lets change to int
Great! We can see that 'persons_injured' and 'persons_killed' column values are float. Let's change them from float to integer. To do so, we will first get rid of any missing values in the 'persons_injured' and 'persons_killed' columns.

In [30]:
clean_crashes.dropna(subset = ['number_of_persons_injured'], how='all', inplace=True)
clean_crashes.dropna(subset = ['number_of_persons_killed'], how='all', inplace=True)
clean_crashes['number_of_persons_injured'] = clean_crashes.number_of_persons_injured.astype(int)
clean_crashes['number_of_persons_killed'] = clean_crashes.number_of_persons_killed.astype(int)

In [31]:
stats.ttest_ind(clean_crashes[['number_of_persons_killed']][clean_crashes['vehicle_type_code1']=="SUV"], 
                clean_crashes[['number_of_persons_killed']][clean_crashes['vehicle_type_code1']=="Sedan"])

Ttest_indResult(statistic=array([3.5973131]), pvalue=array([0.00032155]))

They are more lethal, though not enough to outlaw them. That is a small pvalue, two orders below .05.  is likely not random, more lethal.

I wonder though: is it the same for pedestrians?

In [32]:
clean_crashes.dropna(subset = ['number_of_pedestrians_killed'], how='all', inplace=True)
clean_crashes['number_of_pedestrians_killed'] = clean_crashes.number_of_pedestrians_killed.astype(int)

In [33]:
stats.ttest_ind(clean_crashes[['number_of_pedestrians_killed']][clean_crashes['vehicle_type_code1']=="SUV"], 
                clean_crashes[['number_of_pedestrians_killed']][clean_crashes['vehicle_type_code1']=="Sedan"])

Ttest_indResult(statistic=array([4.9614584]), pvalue=array([6.99850216e-07]))

not significant!

In [36]:
stats.ttest_ind(clean_crashes[['number_of_persons_killed']][clean_crashes['vehicle_type_code1']=="SUV"], 
                clean_crashes[['number_of_persons_killed']][clean_crashes['vehicle_type_code1']=="PASSENGER VEHICLE"])

Ttest_indResult(statistic=array([3.17566475]), pvalue=array([0.00149498]))

SUV to the standard has sig

In [42]:
stats.ttest_ind(clean_crashes[['number_of_persons_killed']][clean_crashes['vehicle_type_code1']=="SUV"], 
                clean_crashes[['number_of_persons_killed']][clean_crashes['vehicle_type_code1']=="taxi"])

Ttest_indResult(statistic=array([2.24551781]), pvalue=array([0.02473534]))

### Conclusion <a id='conclusion'></a>
There are many ways to present data available on NYC's OpenData website, and here I have provided a few of the fancier ones. Perhaps a next step could be to animate the mapbox visualization, creating a slider which would display these crashes over time? 

We could certainly repurpose these visualizations to examine the car crash factors, or vehicle types, drawing further insights still.

I think NYC's commitment to tracking this data can provide insight into crash trends, and hopefully provide value in the form of policy changes, and saved lives.
Maybe one of these visualizations will help shift  policy one day!
<br><div style="text-align: right">[Begining of the page](#Top)</div>