# Los Angeles Traffic Collision Data Cleaning
---

## Summary:
1. Import Modules and Data
2. Data Exploration
3. Data Cleaning
    1. Drop Useless Columns
    2. Drop Null Values
    3. Rename Columns
    4. Map Ethnicities
    5. Map Genders
4. Export Clean Data
    

Data Source: https://data.lacity.org/A-Safe-City/Traffic-Collision-Data-from-2010-to-Present/d5tf-ez2w


## 1. Import Modules and Data

In [1]:
import pandas as pd

# Read csv file
df = pd.read_csv("Data/raw_data/Traffic_Collision_Data_from_2010_to_Present.csv")
df.head()

Unnamed: 0,DR Number,Date Reported,Date Occurred,Time Occurred,Area ID,Area Name,Reporting District,Crime Code,Crime Code Description,MO Codes,Victim Age,Victim Sex,Victim Descent,Premise Code,Premise Description,Address,Cross Street,Location
0,191513598,07/06/2019,07/06/2019,2355,15,N Hollywood,1591,997,TRAFFIC COLLISION,,99.0,M,O,101.0,STREET,GOODLAND AV,GOODLAND DR,"(34.1371, -118.4062)"
1,191611142,07/06/2019,07/06/2019,500,16,Foothill,1677,997,TRAFFIC COLLISION,,45.0,M,W,101.0,STREET,GLENOAKS BL,NETTLETON ST,"(34.2249, -118.3617)"
2,191011898,07/06/2019,07/06/2019,1130,10,West Valley,1028,997,TRAFFIC COLLISION,,25.0,M,A,101.0,STREET,SHERMAN WY,FORBES AV,"(34.2012, -118.4989)"
3,191113135,07/06/2019,07/06/2019,1415,11,Northeast,1153,997,TRAFFIC COLLISION,,29.0,M,O,101.0,STREET,LOS FELIZ BL,FERN DELL DR,"(34.1081, -118.3078)"
4,190117683,07/06/2019,07/06/2019,1230,1,Central,192,997,TRAFFIC COLLISION,,41.0,M,O,101.0,STREET,GRAND AV,PICO BL,"(34.0384, -118.2646)"


## 2. Data Exploration
---

In [2]:
# Explore data-set to see what columns exist and their respective data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 479425 entries, 0 to 479424
Data columns (total 18 columns):
DR Number                 479425 non-null int64
Date Reported             479425 non-null object
Date Occurred             479425 non-null object
Time Occurred             479425 non-null int64
Area ID                   479425 non-null int64
Area Name                 479425 non-null object
Reporting District        479425 non-null int64
Crime Code                479425 non-null int64
Crime Code Description    479425 non-null object
MO Codes                  394276 non-null object
Victim Age                401937 non-null float64
Victim Sex                472440 non-null object
Victim Descent            471720 non-null object
Premise Code              479400 non-null float64
Premise Description       479400 non-null object
Address                   479425 non-null object
Cross Street              458069 non-null object
Location                  479425 non-null object
dtypes: fl

## 3. Data Cleaning

### a) Drop Useless Columns

In [3]:
# Drop columns that will not help during the analysis
drop_columns_df = df.drop(columns=[
    "DR Number", # Not useful for analysis
    "Date Reported", # More interested in Date Occured
    "Area ID", # Could use Area Name
    "Crime Code", # Uniform data: "997"
    "Crime Code Description", # Uniform data: "TRAFFIC COLLISION"
    "MO Codes", # Too many null values
    "Premise Code", # Could use Premise Description
    "Cross Street", # Could use Address or Location
    "Address", # redundant data
    "Premise Description" # Unnecessary
]) 

drop_columns_df.head()

Unnamed: 0,Date Occurred,Time Occurred,Area Name,Reporting District,Victim Age,Victim Sex,Victim Descent,Location
0,07/06/2019,2355,N Hollywood,1591,99.0,M,O,"(34.1371, -118.4062)"
1,07/06/2019,500,Foothill,1677,45.0,M,W,"(34.2249, -118.3617)"
2,07/06/2019,1130,West Valley,1028,25.0,M,A,"(34.2012, -118.4989)"
3,07/06/2019,1415,Northeast,1153,29.0,M,O,"(34.1081, -118.3078)"
4,07/06/2019,1230,Central,192,41.0,M,O,"(34.0384, -118.2646)"


### b) Drop Null Values

In [4]:
# See how many null values exist for each column
pd.isnull(drop_columns_df).sum()

Date Occurred             0
Time Occurred             0
Area Name                 0
Reporting District        0
Victim Age            77488
Victim Sex             6985
Victim Descent         7705
Location                  0
dtype: int64

In [5]:
# Drop all null values
drop_nulls_df = drop_columns_df.dropna()

In [6]:
# Check row counts for each column after dropping nulls
drop_nulls_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 400726 entries, 0 to 479424
Data columns (total 8 columns):
Date Occurred         400726 non-null object
Time Occurred         400726 non-null int64
Area Name             400726 non-null object
Reporting District    400726 non-null int64
Victim Age            400726 non-null float64
Victim Sex            400726 non-null object
Victim Descent        400726 non-null object
Location              400726 non-null object
dtypes: float64(1), int64(2), object(5)
memory usage: 27.5+ MB


### c) Rename Columns

In [7]:
# Rename columns without spaces
rename_columns_df = drop_nulls_df.rename(columns={
    "Date Occurred" : "date",
    "Time Occurred" : "time",
    "Area Name" : "area",
    "Reporting District" : "district",
    "Victim Age" : "victim_age",
    "Victim Sex" : "victim_gender",
    "Victim Descent" : "victim_race",
    "Premise Description" : "premise",
    "Address" : "address",
    "Location" : "location"}).reset_index().drop(['index'],axis = 1)

rename_columns_df.head()

Unnamed: 0,date,time,area,district,victim_age,victim_gender,victim_race,location
0,07/06/2019,2355,N Hollywood,1591,99.0,M,O,"(34.1371, -118.4062)"
1,07/06/2019,500,Foothill,1677,45.0,M,W,"(34.2249, -118.3617)"
2,07/06/2019,1130,West Valley,1028,25.0,M,A,"(34.2012, -118.4989)"
3,07/06/2019,1415,Northeast,1153,29.0,M,O,"(34.1081, -118.3078)"
4,07/06/2019,1230,Central,192,41.0,M,O,"(34.0384, -118.2646)"


### d) Map Ethnicities

Ethnicity Map Documentation: https://data.lacity.org/A-Safe-City/Traffic-Collision-Data-from-2010-to-Present/d5tf-ez2w

In [8]:
# Map each respective ethnicity inital into ethnicity groups
# race_dict = {'H':'Hispanic', 'B':'Black', 'O':'Unknown', 'W':'White', 'X':'Unknown', '-':'Unknown',
#              'A':'Asian', 'K':'Asian', 'C':'Asian', 'F':'Asian', 'U':'Pacific Islander',
#              'J':'Asian', 'P':'Pacific Islander', 'V':'Asian', 'Z':'Asian',
#              'I':'American Indian', 'G':'Pacific Islander', 'S':'Pacific Islander', 'D':'Asian', 'L':'Asian'}

# rename_columns_df["victim_race"] = rename_columns_df["victim_race"].map(race_dict)

# map_ethnicities_df = rename_columns_df
# map_ethnicities_df.head()

### e) Map Genders

In [9]:
# Map each gender inital into gender groups
map_gender_df = rename_columns_df # Set mapped ethnicities equal to a new gender df

gender_dict = {
    'M':'Male', 
    'F':'Female', 
    'X':'Unknown', 
    'H':'Unknown', 
    'N':'Unknown'}

map_gender_df["victim_gender"] = map_gender_df["victim_gender"].map(gender_dict)
map_gender_df.head()
# fn=':'.join(fn[i:i+2] for i in range(0,len(fn),2))

Unnamed: 0,date,time,area,district,victim_age,victim_gender,victim_race,location
0,07/06/2019,2355,N Hollywood,1591,99.0,Male,O,"(34.1371, -118.4062)"
1,07/06/2019,500,Foothill,1677,45.0,Male,W,"(34.2249, -118.3617)"
2,07/06/2019,1130,West Valley,1028,25.0,Male,A,"(34.2012, -118.4989)"
3,07/06/2019,1415,Northeast,1153,29.0,Male,O,"(34.1081, -118.3078)"
4,07/06/2019,1230,Central,192,41.0,Male,O,"(34.0384, -118.2646)"


### f) Map Time

In [24]:
map_time_df = map_gender_df
time_df = map_time_df.copy()
time_df['time_str'] = time_df['time'].apply(str)
for i in range(0,len(time_df)):
    time_df['time_str'][i]=time_df['time_str'][i].zfill(4)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [27]:
time_df['datetime']=time_df[['date','time_str']].apply(lambda x: ' '.join(x), axis = 1)
time_df['datetime']=[datetime.datetime.strptime(time_df['datetime'][i],'%m/%d/%Y %H%M') for i in range(0,len(time_df))]

In [32]:
clean_df = time_df.copy()
clean_df.drop(['time','date','time_str'],axis = 1)

Unnamed: 0,area,district,victim_age,victim_gender,victim_race,location,datetime
0,N Hollywood,1591,99.0,Male,O,"(34.1371, -118.4062)",2019-07-06 23:55:00
1,Foothill,1677,45.0,Male,W,"(34.2249, -118.3617)",2019-07-06 05:00:00
2,West Valley,1028,25.0,Male,A,"(34.2012, -118.4989)",2019-07-06 11:30:00
3,Northeast,1153,29.0,Male,O,"(34.1081, -118.3078)",2019-07-06 14:15:00
4,Central,192,41.0,Male,O,"(34.0384, -118.2646)",2019-07-06 12:30:00
5,Hollywood,645,46.0,Male,W,"(34.0944, -118.3441)",2019-07-06 01:15:00
6,Newton,1351,41.0,Male,H,"(34.0075, -118.2775)",2019-07-06 08:00:00
7,N Hollywood,1547,30.0,Female,W,"(34.1697, -118.3822)",2019-07-06 08:00:00
8,Topanga,2157,32.0,Male,H,"(34.1938, -118.5884)",2019-07-06 08:30:00
9,Rampart,231,24.0,Female,H,"(34.0719, -118.2822)",2019-07-06 06:30:00


## 4. Export Clean Data

In [33]:
# Export cleaned data set into a csv file
time_df.to_csv("data/clean_data/clean_data_with_time.csv",index=True)
clean_df.to_csv("data/clean_data/clean_data.csv",index=True)