<a id='Top'></a>
<h1> <center>Analytics Programming: Module 8</center> </h1>
<p><h2><center>Cleaning up the NYC Vehicle Crash Data</center> 
<center>supported by a <a href="https://github.com/yuleidner/Katz_Data_Analytics/blob/master/M8/README.md">M8 README file </a></center></h2></p>
<center>Alan Leidner Oct 26, 2019</center>

## In this notebook, I will demonstrate a few techniques to clean tabular data. 
* ### I'll first ingest a subset of the available dataset into a pandas DataFrame, using the Socrata API, from the NYC Open Data portal.
* ### Then I will use functions to look at the data
* ### Using my best judgement and best practices, I will then clean up some of that data.

### First I'll import 100,000 of the 1.6M available rows. 
#### If you run this notebook again in the future, you may find your data is different than mine, and will require different methods or actions to clean.

In [1]:
import pandas as pd
pd.set_option('display.max.columns', None) # Let's me see all columns of head() & tail() functions
crashes = pd.read_csv("https://data.cityofnewyork.us/resource/h9gi-nx95.csv?$limit=100000")

  interactivity=interactivity, compiler=compiler, result=result)


### I'll use the `describe` function which will "generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values." This will work on numeric and object series, and may point out any glaring wholes in the data.

In [2]:
crashes.describe()

Unnamed: 0,latitude,longitude,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,unique_key
count,88197.0,88197.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,40.719455,-73.91056,0.27828,0.0015,0.05502,0.00084,0.01966,8e-05,0.2036,0.00058,2444278.0
std,0.461516,0.8299,0.663283,0.040716,0.239193,0.028971,0.139691,0.008944,0.62606,0.027197,1933606.0
min,0.0,-74.252876,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1506.0
25%,40.668785,-73.975394,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,185488.8
50%,40.722708,-73.92812,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4177044.0
75%,40.768644,-73.86671,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4203252.0
max,40.912223,0.0,16.0,3.0,6.0,1.0,2.0,1.0,16.0,3.0,4229085.0


### Obviously averages and standard deviations dont tell me a lot about latitude and longitude, but I do wonder why I dont have full data counts in those columns. The rest of analysis doesnt appear to have obvious problems.
### Let's take a look at some of those rows using the `isnull` function

In [3]:
crashes[crashes['latitude'].isnull()]

Unnamed: 0,date,time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,unique_key,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
9,2013-03-10T00:00:00.000,16:00,,,,,,,,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,Unspecified,,,2912938,PASSENGER VEHICLE,PASSENGER VEHICLE,SPORT UTILITY / STATION WAGON,,
10,2013-03-08T00:00:00.000,8:40,,,,,,EAST 176 STREET,EAST TREMONT AVENUE,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,,,,93106,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
20,2013-03-05T00:00:00.000,10:20,,,,,,HAMILTON AVENUE,CLINTON STREET,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,,,,171440,OTHER,OTHER,,,
23,2013-03-18T00:00:00.000,21:00,,,,,,ASTORIA BOULEVARD,49 STREET,,1,0,0,0,0,0,1,0,Unspecified,Unspecified,,,,275377,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
29,2013-03-15T00:00:00.000,10:30,,,,,,BORDEN AVENUE,VANDAM STREET,,0,0,0,0,0,0,0,0,Failure to Yield Right-of-Way,Unspecified,,,,241484,PASSENGER VEHICLE,LARGE COM VEH(6 OR MORE TIRES),,,
43,2013-03-05T00:00:00.000,12:59,,,,,,EAST GUN HILL ROAD,HOLLAND AVENUE,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,,,,96868,PASSENGER VEHICLE,SPORT UTILITY / STATION WAGON,,,
50,2013-03-19T00:00:00.000,13:45,,,,,,,,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,,,,2864520,PASSENGER VEHICLE,SPORT UTILITY / STATION WAGON,,,
86,2013-03-02T00:00:00.000,19:39,,,,,,CHARLES AVENUE,NICHOLAS AVENUE,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,,,,287328,SPORT UTILITY / STATION WAGON,SPORT UTILITY / STATION WAGON,,,
93,2013-02-22T00:00:00.000,12:40,,,,,,3 AVENUE,HAMILTON AVENUE,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,,,,157456,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
95,2013-03-09T00:00:00.000,12:26,,,,,,,,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,,,,2864503,SPORT UTILITY / STATION WAGON,SPORT UTILITY / STATION WAGON,,,


### Seems like those rows have valid data.... I wouldn't delete these rows, as the injury/fatality  data may be useful, but I would drop them for location purposes. 
### If I really had some time, I would write/find a program to reverse map the on_street_name to the other fields.
### For now let's lool at some of the other data description functions, like `info` which prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [4]:
crashes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 29 columns):
date                             100000 non-null object
time                             100000 non-null object
borough                          70721 non-null object
zip_code                         70713 non-null object
latitude                         88197 non-null float64
longitude                        88197 non-null float64
location                         88197 non-null object
on_street_name                   80686 non-null object
off_street_name                  67269 non-null object
cross_street_name                13424 non-null object
number_of_persons_injured        100000 non-null int64
number_of_persons_killed         100000 non-null int64
number_of_pedestrians_injured    100000 non-null int64
number_of_pedestrians_killed     100000 non-null int64
number_of_cyclist_injured        100000 non-null int64
number_of_cyclist_killed         100000 non-null int64
number

### It looks like contributing_factor_vehicle_3-5 and vehicle_type_code_3-5 have very few values, compared to the others. Lets tak e aquick peek.

In [5]:
crashes['contributing_factor_vehicle_1'].value_counts()

Unspecified                                              40764
Driver Inattention/Distraction                           18735
Failure to Yield Right-of-Way                             5253
Following Too Closely                                     4586
Backing Unsafely                                          3393
Other Vehicular                                           2751
Fatigued/Drowsy                                           2571
Passing or Lane Usage Improper                            2249
Turning Improperly                                        2227
Passing Too Closely                                       2105
Unsafe Lane Changing                                      1561
Driver Inexperience                                       1377
Traffic Control Disregarded                               1355
Lost Consciousness                                        1031
Pavement Slippery                                          936
Prescription Medication                                

In [6]:
crashes['contributing_factor_vehicle_3'].value_counts()

Unspecified                                    6336
Other Vehicular                                 102
Driver Inattention/Distraction                   75
Following Too Closely                            71
Fatigued/Drowsy                                  43
Pavement Slippery                                18
Reaction to Uninvolved Vehicle                   11
Outside Car Distraction                           9
Backing Unsafely                                  8
Unsafe Speed                                      8
Traffic Control Disregarded                       6
Passing or Lane Usage Improper                    6
Failure to Yield Right-of-Way                     4
Driver Inexperience                               3
Turning Improperly                                3
Brakes Defective                                  2
Physical Disability                               2
Fell Asleep                                       2
Obstruction/Debris                                2
Lost Conscio

### It looks to me like like the "contributing_factor" fields might record muiltiple factors for the incident, and someone would record up to five factors. I would feel comfotable dropping the last 3 columns, considering how few entries they have in comparison.

In [20]:
crashes.drop(columns=['contributing_factor_vehicle_3', 'contributing_factor_vehicle_4', 'contributing_factor_vehicle_5'])
crashes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 29 columns):
date                             100000 non-null datetime64[ns]
time                             100000 non-null object
borough                          70721 non-null object
zip_code                         70713 non-null object
latitude                         88197 non-null float64
longitude                        88197 non-null float64
location                         88197 non-null object
on_street_name                   80686 non-null object
off_street_name                  67269 non-null object
cross_street_name                13424 non-null object
number_of_persons_injured        100000 non-null int64
number_of_persons_killed         100000 non-null int64
number_of_pedestrians_injured    100000 non-null int64
number_of_pedestrians_killed     100000 non-null int64
number_of_cyclist_injured        100000 non-null int64
number_of_cyclist_killed         100000 non-null int6

In [21]:
crashes.head()

Unnamed: 0,date,time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,unique_key,vehicle_type_code_1,vehicle_type_code_2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2013-02-23,8:00,BROOKLYN,11233.0,40.672294,-73.913117,POINT (-73.913117 40.6722935),EASTERN PARKWAY,PROSPECT PLACE,,0,0,0,0,0,0,0,0,Fatigued/Drowsy,Pavement Slippery,,,,161352,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
1,2013-03-17,22:00,,,40.736876,-73.844945,POINT (-73.8449449 40.7368759),,,,0,0,0,0,0,0,0,0,Lost Consciousness,Unspecified,,,,3065403,PASSENGER VEHICLE,UNKNOWN,,,
2,2013-03-05,12:40,MANHATTAN,10023.0,40.781025,-73.98131,POINT (-73.9813103 40.7810254),WEST 75 STREET,BROADWAY,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,,,,53578,VAN,PASSENGER VEHICLE,,,
3,2013-03-20,9:15,QUEENS,11421.0,40.69305,-73.853838,POINT (-73.8538385 40.6930498),91 STREET,JAMAICA AVENUE,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,,,,205735,PASSENGER VEHICLE,Station Wagon/Sport Utility Vehicle,,,
4,2013-02-23,16:40,BRONX,10466.0,40.888702,-73.839206,POINT (-73.8392055 40.8887024),EAST 233 STREET,MURDOCK AVENUE,,6,0,0,0,0,0,6,0,Unspecified,Unspecified,,,,96792,PASSENGER VEHICLE,PASSENGER VEHICLE,,,


### Already we have a smaller dataframe, which has better consistency! Now for the vehicle types. 

In [8]:
crashes['vehicle_type_code1'].value_counts()

PASSENGER VEHICLE                      28084
Sedan                                  22139
Station Wagon/Sport Utility Vehicle    18839
SPORT UTILITY / STATION WAGON          11169
TAXI                                    2326
Taxi                                    2232
VAN                                     1542
Pick-up Truck                           1521
UNKNOWN                                 1339
OTHER                                   1331
Box Truck                               1046
LARGE COM VEH(6 OR MORE TIRES)           891
BUS                                      860
Bus                                      807
SMALL COM VEH(4 TIRES)                   779
PICK-UP TRUCK                            732
LIVERY VEHICLE                           577
Bike                                     562
Motorcycle                               406
Tractor Truck Diesel                     367
Van                                      300
Convertible                              168
Dump      

In [9]:
crashes['vehicle_type_code_3'].value_counts()

PASSENGER VEHICLE                      1889
Sedan                                  1563
Station Wagon/Sport Utility Vehicle    1317
SPORT UTILITY / STATION WAGON           858
UNKNOWN                                 247
Taxi                                    103
VAN                                      78
Pick-up Truck                            78
TAXI                                     67
OTHER                                    64
PICK-UP TRUCK                            49
LARGE COM VEH(6 OR MORE TIRES)           37
Box Truck                                30
Motorcycle                               27
LIVERY VEHICLE                           25
BUS                                      21
SMALL COM VEH(4 TIRES)                   19
Bike                                     16
Van                                      12
Bus                                      11
Tractor Truck Diesel                     10
Convertible                               8
BICYCLE                         

In [10]:
crashes.head()

Unnamed: 0,date,time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,number_of_motorist_injured,number_of_motorist_killed,contributing_factor_vehicle_1,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,unique_key,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2013-02-23T00:00:00.000,8:00,BROOKLYN,11233.0,40.672294,-73.913117,POINT (-73.913117 40.6722935),EASTERN PARKWAY,PROSPECT PLACE,,0,0,0,0,0,0,0,0,Fatigued/Drowsy,Pavement Slippery,,,,161352,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
1,2013-03-17T00:00:00.000,22:00,,,40.736876,-73.844945,POINT (-73.8449449 40.7368759),,,,0,0,0,0,0,0,0,0,Lost Consciousness,Unspecified,,,,3065403,PASSENGER VEHICLE,UNKNOWN,,,
2,2013-03-05T00:00:00.000,12:40,MANHATTAN,10023.0,40.781025,-73.98131,POINT (-73.9813103 40.7810254),WEST 75 STREET,BROADWAY,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,,,,53578,VAN,PASSENGER VEHICLE,,,
3,2013-03-20T00:00:00.000,9:15,QUEENS,11421.0,40.69305,-73.853838,POINT (-73.8538385 40.6930498),91 STREET,JAMAICA AVENUE,,0,0,0,0,0,0,0,0,Unspecified,Unspecified,,,,205735,PASSENGER VEHICLE,SPORT UTILITY / STATION WAGON,,,
4,2013-02-23T00:00:00.000,16:40,BRONX,10466.0,40.888702,-73.839206,POINT (-73.8392055 40.8887024),EAST 233 STREET,MURDOCK AVENUE,,6,0,0,0,0,0,6,0,Unspecified,Unspecified,,,,96792,PASSENGER VEHICLE,PASSENGER VEHICLE,,,


### It seems to me that the multiple fields for "vehicle_type_code" might indicate different cars in the crash, and that data might be more valueable for analyses on larger crashes. I will leave this data for now, but modify the column names to standardize the dataset using the `rename` function.

In [11]:
crashes.rename(columns={'vehicle_type_code1':'vehicle_type_code_1',
                        'vehicle_type_code2':'vehicle_type_code_2',
                       }, 
               inplace=True)

### When looking at value_counts() of contributing factors, I noted some duplicates. Let's clean those up now, first by looking at the `unique` values.

In [13]:
crashes['contributing_factor_vehicle_1'].unique()

array(['Fatigued/Drowsy', 'Lost Consciousness', 'Unspecified',
       'Prescription Medication', 'Driver Inattention/Distraction',
       'Driver Inexperience', 'Backing Unsafely',
       'Failure to Yield Right-of-Way', 'Alcohol Involvement',
       'Physical Disability', 'Pavement Slippery', 'Glare',
       'Traffic Control Disregarded', 'Passenger Distraction',
       'Other Vehicular', 'Failure to Keep Right', 'Turning Improperly',
       'Outside Car Distraction', 'View Obstructed/Limited',
       'Other Electronic Device', nan, 'Oversized Vehicle',
       'Reaction to Uninvolved Vehicle',
       'Lane Marking Improper/Inadequate', 'Animals Action',
       'Drugs (Illegal)', 'Aggressive Driving/Road Rage',
       'Passing or Lane Usage Improper', 'Illness', 'Obstruction/Debris',
       'Reaction to Other Uninvolved Vehicle', 'Brakes Defective',
       'Pavement Defective', 'Illnes', 'Following Too Closely',
       'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion',
       'Un

### I see "Illnes" missspelled in addition to 'Illness', in  'Drugs (illegal)' and 'Drugs (Illegal)'. Let's fix those

In [14]:
crashes['contributing_factor_vehicle_1'].replace('Illnes', 'Illness', inplace=True)
crashes['contributing_factor_vehicle_1'].replace('Drugs (illegal)', 'Drugs (Illegal)', inplace=True)
crashes['contributing_factor_vehicle_1'].unique()

array(['Fatigued/Drowsy', 'Lost Consciousness', 'Unspecified',
       'Prescription Medication', 'Driver Inattention/Distraction',
       'Driver Inexperience', 'Backing Unsafely',
       'Failure to Yield Right-of-Way', 'Alcohol Involvement',
       'Physical Disability', 'Pavement Slippery', 'Glare',
       'Traffic Control Disregarded', 'Passenger Distraction',
       'Other Vehicular', 'Failure to Keep Right', 'Turning Improperly',
       'Outside Car Distraction', 'View Obstructed/Limited',
       'Other Electronic Device', nan, 'Oversized Vehicle',
       'Reaction to Uninvolved Vehicle',
       'Lane Marking Improper/Inadequate', 'Animals Action',
       'Drugs (Illegal)', 'Aggressive Driving/Road Rage',
       'Passing or Lane Usage Improper', 'Illness', 'Obstruction/Debris',
       'Reaction to Other Uninvolved Vehicle', 'Brakes Defective',
       'Pavement Defective', 'Following Too Closely',
       'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion',
       'Unsafe Speed

In [15]:
crashes['contributing_factor_vehicle_2'].unique()

array(['Pavement Slippery', 'Unspecified', nan,
       'Driver Inattention/Distraction', 'Fatigued/Drowsy',
       'Driver Inexperience', 'Traffic Control Disregarded',
       'Alcohol Involvement', 'Outside Car Distraction',
       'Backing Unsafely', 'Other Vehicular', 'Turning Improperly',
       'Failure to Yield Right-of-Way', 'Passenger Distraction',
       'Lost Consciousness', 'Physical Disability',
       'Reaction to Uninvolved Vehicle', 'Illness',
       'Unsafe Lane Changing', 'Oversized Vehicle',
       'Aggressive Driving/Road Rage', 'Other Electronic Device',
       'Obstruction/Debris', 'View Obstructed/Limited',
       'Prescription Medication', 'Fell Asleep',
       'Passing or Lane Usage Improper', 'Steering Failure',
       'Reaction to Other Uninvolved Vehicle', 'Pavement Defective',
       'Following Too Closely', 'Unsafe Speed',
       'Traffic Control Device Improper/Non-Working',
       'Lane Marking Improper/Inadequate', 'Brakes Defective', 'Glare',
       'Fa

### contributing_factor_vehicle_2 had the same problem with 'Illnes' being missspelled, so we will fix it now.

In [16]:
crashes['contributing_factor_vehicle_2'].replace('Illnes', 'Illness', inplace=True)

### The crashes vehicle_type_code columns had a few similare problems, which we will clean now.

In [17]:
crashes['vehicle_type_code_1'].replace('TAXI', 'Taxi', inplace=True)
crashes['vehicle_type_code_2'].replace('TAXI', 'Taxi', inplace=True)
crashes['vehicle_type_code_3'].replace('TAXI', 'Taxi', inplace=True)
crashes['vehicle_type_code_4'].replace('TAXI', 'Taxi', inplace=True)
crashes['vehicle_type_code_5'].replace('TAXI', 'Taxi', inplace=True)
crashes['vehicle_type_code_1'].replace('PICK-UP TRUCK', 'Pick-up Truck', inplace=True)
crashes['vehicle_type_code_2'].replace('PICK-UP TRUCK', 'Pick-up Truck', inplace=True)
crashes['vehicle_type_code_3'].replace('PICK-UP TRUCK', 'Pick-up Truck', inplace=True)
crashes['vehicle_type_code_4'].replace('PICK-UP TRUCK', 'Pick-up Truck', inplace=True)
crashes['vehicle_type_code_5'].replace('PICK-UP TRUCK', 'Pick-up Truck', inplace=True)
crashes['vehicle_type_code_1'].replace('AMBULANCE', 'Ambulance', inplace=True)
crashes['vehicle_type_code_2'].replace('AMBULANCE', 'Ambulance', inplace=True)
crashes['vehicle_type_code_3'].replace('AMBULANCE', 'Ambulance', inplace=True)
crashes['vehicle_type_code_4'].replace('AMBULANCE', 'Ambulance', inplace=True)
crashes['vehicle_type_code_5'].replace('AMBULANCE', 'Ambulance', inplace=True)
crashes['vehicle_type_code_1'].replace('SPORT UTILITY / STATION WAGON', 'Station Wagon/Sport Utility Vehicle', inplace=True)
crashes['vehicle_type_code_2'].replace('SPORT UTILITY / STATION WAGON', 'Station Wagon/Sport Utility Vehicle', inplace=True)
crashes['vehicle_type_code_3'].replace('SPORT UTILITY / STATION WAGON', 'Station Wagon/Sport Utility Vehicle', inplace=True)
crashes['vehicle_type_code_4'].replace('SPORT UTILITY / STATION WAGON', 'Station Wagon/Sport Utility Vehicle', inplace=True)
crashes['vehicle_type_code_5'].replace('SPORT UTILITY / STATION WAGON', 'Station Wagon/Sport Utility Vehicle', inplace=True)

In [23]:
crashes['vehicle_type_code_1'].value_counts()

Station Wagon/Sport Utility Vehicle    30008
PASSENGER VEHICLE                      28084
Sedan                                  22139
Taxi                                    4558
Pick-up Truck                           2253
VAN                                     1542
UNKNOWN                                 1339
OTHER                                   1331
Box Truck                               1046
LARGE COM VEH(6 OR MORE TIRES)           891
BUS                                      860
Bus                                      807
SMALL COM VEH(4 TIRES)                   779
LIVERY VEHICLE                           577
Bike                                     562
Motorcycle                               406
Tractor Truck Diesel                     367
Van                                      300
Ambulance                                262
Convertible                              168
Dump                                     149
Flat Bed                                 121
MOTORCYCLE

### I could spend another half an hour combing through that dataset for misspelings, weird groups etc. I will focus just on vehicle_type_code_1 for now, and I will leave the rest for another time, if someone is interested in that subset of the data.

In [24]:
crashes['vehicle_type_code_1'].replace('Other', 'Unknown', inplace=True)
crashes['vehicle_type_code_1'].replace('BUS', 'Bus', inplace=True)
crashes['vehicle_type_code_1'].replace('Box T', 'Box Truck', inplace=True)
crashes['vehicle_type_code_1'].replace('GARBA', 'Garbage or Refuse', inplace=True)
crashes['vehicle_type_code_1'].replace('CAB', 'Cab', inplace=True)
crashes['vehicle_type_code_1'].replace('fdny', 'FIRE TRUCK', inplace=True)
crashes['vehicle_type_code_1'].unique()

array(['PASSENGER VEHICLE', 'VAN', 'Station Wagon/Sport Utility Vehicle',
       'Taxi', 'SMALL COM VEH(4 TIRES) ', 'OTHER',
       'LARGE COM VEH(6 OR MORE TIRES)', 'UNKNOWN', 'Pick-up Truck',
       'Bus', 'LIVERY VEHICLE', 'Sedan', 'MOTORCYCLE', 'Ambulance',
       'FIRE TRUCK', 'Bike', nan, 'PK', 'BICYCLE', 'SCOOTER', 'Box Truck',
       'Flat Bed', '4 dr sedan', 'Motorcycle', 'TRACT', 'RMP V', 'Van',
       'Tractor Truck Diesel', 'Convertible', 'Motorbike', 'ambul',
       'Dump', 'E-Sco', 'LIMO', '2015', 'Tractor Truck Gasoline', 'BOX',
       'Tanker', 'DOT T', 'Carry All', 'Garbage or Refuse', 'AMBUL',
       'Chassis Cab', 'Moped', 'E-Bik', 'Ambu', 'TRAIL', 'Beverage Truck',
       'FIRET', 'FDNY', 'TANKE', 'School Bus', 'Truck', 'Armored Truck',
       'Flat Rack', 'TRUCK', 'Bulk Agriculture', '2 dr sedan', 'FLAT/',
       'Tow Truck / Wrecker', 'MTA B', 'van', 'Scoot', 'road', 'FIRE',
       'Motorscooter', 'DELV', 'Horse', 'TOW T', 'TRC', 'Forkl', 'UTILI',
       'COM', 'S

In [26]:
crashes['vehicle_type_code_1'].replace('VAN T', 'Van', inplace=True)
crashes['vehicle_type_code_1'].replace('van', 'Van', inplace=True)
crashes['vehicle_type_code_1'].replace('VAN/T', 'Van', inplace=True)
crashes['vehicle_type_code_1'].replace('van t', 'Van', inplace=True)
crashes['vehicle_type_code_1'].replace('van', 'Van', inplace=True)
crashes['vehicle_type_code_1'].replace('VAN', 'Van', inplace=True)
crashes['vehicle_type_code_1'].replace('AMBU', 'Ambulance', inplace=True)
crashes['vehicle_type_code_1'].replace('Ambul', 'Ambulance', inplace=True)
crashes['vehicle_type_code_1'].replace('AMB', 'Ambulance', inplace=True)
crashes['vehicle_type_code_1'].replace('Ambu', 'Ambulance', inplace=True)
crashes['vehicle_type_code_1'].replace('AMBUL', 'Ambulance', inplace=True)
crashes['vehicle_type_code_1'].replace('ambul', 'Ambulance', inplace=True)
crashes['vehicle_type_code_1'].replace('Ambu', 'Ambulance', inplace=True)
crashes['vehicle_type_code_1'].replace('Fire', 'FIRE TRUCK', inplace=True)
crashes['vehicle_type_code_1'].replace('fire', 'FIRE TRUCK', inplace=True)
crashes['vehicle_type_code_1'].replace('FIRE', 'FIRE TRUCK', inplace=True)
crashes['vehicle_type_code_1'].replace('FIRET', 'FIRE TRUCK', inplace=True)
crashes['vehicle_type_code_1'].replace('FDNY', 'FIRE TRUCK', inplace=True)
crashes['vehicle_type_code_1'].unique()

array(['PASSENGER VEHICLE', 'Van', 'Station Wagon/Sport Utility Vehicle',
       'Taxi', 'SMALL COM VEH(4 TIRES) ', 'OTHER',
       'LARGE COM VEH(6 OR MORE TIRES)', 'UNKNOWN', 'Pick-up Truck',
       'Bus', 'LIVERY VEHICLE', 'Sedan', 'MOTORCYCLE', 'Ambulance',
       'FIRE TRUCK', 'Bike', nan, 'PK', 'BICYCLE', 'SCOOTER', 'Box Truck',
       'Flat Bed', '4 dr sedan', 'Motorcycle', 'TRACT', 'RMP V',
       'Tractor Truck Diesel', 'Convertible', 'Motorbike', 'Dump',
       'E-Sco', 'LIMO', '2015', 'Tractor Truck Gasoline', 'BOX', 'Tanker',
       'DOT T', 'Carry All', 'Garbage or Refuse', 'Chassis Cab', 'Moped',
       'E-Bik', 'TRAIL', 'Beverage Truck', 'TANKE', 'School Bus', 'Truck',
       'Armored Truck', 'Flat Rack', 'TRUCK', 'Bulk Agriculture',
       '2 dr sedan', 'FLAT/', 'Tow Truck / Wrecker', 'MTA B', 'Scoot',
       'road', 'Motorscooter', 'DELV', 'Horse', 'TOW T', 'TRC', 'Forkl',
       'UTILI', 'COM', 'Stake or Rack', '3-Door', 'Refrigerated Van',
       'FEDEX', 'FUEL', 'sc

### Even in one column you can see how much variation there is. I would suggest that whoever created this dataset turns this from a "fill in the blank" text field to a select field froma  predetermined loist, to get better data fidelity, and I hope by this point, you can see why. Although there is more work to be done, I'll turn to another important part of data cleanup: Typecasting.¶

In [18]:
crashes['date'].dtype

dtype('O')

### Pandas has a lot of functionality that will remain untapped until we convert this into a proper datetime object, using the `to_datetime` function

In [19]:
crashes['date'] = pd.to_datetime(crashes['date'])
crashes['date'].dtype

dtype('<M8[ns]')

# <center> <br>[Begining of the page](#Top)</center>