# Crime Data Preprocessing
As of 3/10/2020, the [dataset](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2) provided by the city of Chicago on crime (excluding murders) contains over 7 millions rows and 22 columns. To facilitate early exploration of the data and focus on more recent, relevant trends, I reduced the dataset to the last 10 years, removed unneeded columns, and dropped rows with nulls. The reduced dataset contained just under 3 million rows of crimes  
  
After reducing the size of the dataset, I cleaned up the text columns by manually matching values of each column with a smaller subset of categories in excel, mapped the Community Area ID's to their name and group (e.g. Community Area 8 maps to Near North Side and Central) based on [this](https://en.wikipedia.org/wiki/Community_areas_in_Chicago) Wikipedia page, and added some categorical columns based on the date of the crime.  

## Table of Contents
- [Dataset Description](#original_desc)  
- [Reading in the Original Dataset](#read_in)  
- [Removing Crimes from before 10 Years Ago](#remove_10_years)  
- [Removing Unneeded Columns](#unneeded_columns)
- [Cleaning Up Text Columns](#clean_up)
- [Pulling in Community Area Names and Regions](#community_areas)
- [Creating New Datetime Columns based on the Date of the Crime](#date_columns)
- [Dropping Nulls from the Dataset](#nulls)
- [Saving the Cleaned Dataset](#saving)

<a id="original_desc">  

## Dataset Description:
These are the original column descriptions from the City of Chicago 
[website](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2)  

| Column Name  | Column Description |  
| :-:    | :-: |  
| ID           | Unique identifier for the record |
| Case Number  | The Chicago Police Department Records Division Number |
| Date         | Date when the incident occurred (sometimes an estimate) |
| Block        | The partially redacted address where the incident occurred, placing it on the same block as the actual address |
| IUCR         | Illinois Uniform Crime Reporting code |
| Primary Type | The primary description of the IUCR code |
| Description  | The secondary description of the IUCR code, a subcategory of the primary description |
| Location Description | Description of the location where the incident occurred |
| Arrest | Indicates whether an arrest was made |
| Domestic | Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence act |
| Beat | Indicates the beat where the incident occurred. A beat is the smallest police geographic area. 3 to 5 beats make up a police sector, and 3 sectors make up a police district |
| District | Indicates the police district where the incident occurred |
| Ward | The ward (City Council district) where the incident occurred |
| Community Area | Indicates the community area where the incident occurred (Chicago has 77 community areas) |
| FBI Code | Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS) |
| X Coordinate | The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block |
| Y Coordinate | The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block |
| Year | The year the incident occurred |
| Updated On | Date and time the record was last updated |
| Latitude | The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block |
| Longitude | The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block |
| Location | The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block |

<a id="read_in">

## Reading in the Original Dataset

In [1]:
import pandas as pd
import datetime as dt

In [2]:
crimes = pd.read_csv('crimes_original.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
print("There were {:,d} of Crimes from {} to {}".format(crimes.shape[0],
                                                        crimes.Date.min(),
                                                        crimes.Date.max()))

There were 7,084,356 of Crimes from 01/01/2001 01:00:00 AM to 12/31/2019 12:58:00 AM


<a id="remove_10_years">

## Removing Crimes from Before 10 Years Ago

In [4]:
#removing crimes from over 10 years ago
crimes_cleaned = crimes[crimes.Year >= 2010].copy()

#displaying the results
print("There were {:,d} of Crimes from {} to {}".format(crimes_cleaned.shape[0],
                                                        crimes_cleaned.Date.min(),
                                                        crimes_cleaned.Date.max()))

There were 3,007,770 of Crimes from 01/01/2010 01:00:00 AM to 12/31/2019 12:58:00 AM


<a id="unneeded_columns">

## Removing Unneeded Columns

In [5]:
#creating a list of the columns to drop
drop_cols = ['ID','Case Number', 'Block', 'Description', 
             'Beat', 'District', 'Ward', 'IUCR', 'FBI Code', 
             'X Coordinate', 'Y Coordinate', 'Latitude', 'Longitude', 
             'Updated On']

#dropping the columns
crimes_cleaned.drop(labels=drop_cols, axis=1, inplace=True)

#displaying the results
crimes_cleaned.head()

Unnamed: 0,Date,Primary Type,Location Description,Arrest,Domestic,Community Area,Year,Location
1,10/08/2017 03:00:00 AM,CRIM SEXUAL ASSAULT,RESIDENCE,False,False,73.0,2017,
2,03/28/2017 02:00:00 PM,BURGLARY,OTHER,False,False,70.0,2017,
3,09/09/2017 08:17:00 PM,THEFT,RESIDENCE,False,False,42.0,2017,
4,08/26/2017 10:00:00 AM,CRIM SEXUAL ASSAULT,HOTEL/MOTEL,False,False,32.0,2017,
5,02/10/2013 12:00:00 AM,CRIM SEXUAL ASSAULT,RESIDENCE,False,False,69.0,2013,


<a id="clean_up">

## Cleaning Up Text Columns

### Primary Type

In [6]:
#original values
crimes_cleaned['Primary Type'].unique()

array(['CRIM SEXUAL ASSAULT', 'BURGLARY', 'THEFT',
       'OFFENSE INVOLVING CHILDREN', 'DECEPTIVE PRACTICE',
       'CRIMINAL DAMAGE', 'OTHER OFFENSE', 'SEX OFFENSE', 'ASSAULT',
       'NARCOTICS', 'ROBBERY', 'CRIMINAL TRESPASS', 'WEAPONS VIOLATION',
       'MOTOR VEHICLE THEFT', 'BATTERY', 'OBSCENITY',
       'LIQUOR LAW VIOLATION', 'PROSTITUTION', 'NON-CRIMINAL',
       'PUBLIC PEACE VIOLATION', 'INTIMIDATION', 'ARSON', 'STALKING',
       'INTERFERENCE WITH PUBLIC OFFICER',
       'CONCEALED CARRY LICENSE VIOLATION', 'KIDNAPPING',
       'HUMAN TRAFFICKING', 'HOMICIDE', 'GAMBLING', 'PUBLIC INDECENCY',
       'OTHER NARCOTIC VIOLATION', 'NON - CRIMINAL',
       'NON-CRIMINAL (SUBJECT SPECIFIED)'], dtype=object)

In [7]:
#list of other values to group under "NON-CRIMINAL"
non_criminal_list = ['NON - CRIMINAL','NON-CRIMINAL (SUBJECT SPECIFIED)']

#replacing the other values with "NON-CRIMINAL"
crimes_cleaned['Primary Type'].replace(to_replace = non_criminal_list,
                                       value='NON-CRIMINAL', regex=False,
                                       inplace=True)

#Converting values to title case
crimes_cleaned['Primary Type'] = crimes_cleaned['Primary Type'].str.title()

#displaying the results
crimes_cleaned['Primary Type'].unique()

array(['Crim Sexual Assault', 'Burglary', 'Theft',
       'Offense Involving Children', 'Deceptive Practice',
       'Criminal Damage', 'Other Offense', 'Sex Offense', 'Assault',
       'Narcotics', 'Robbery', 'Criminal Trespass', 'Weapons Violation',
       'Motor Vehicle Theft', 'Battery', 'Obscenity',
       'Liquor Law Violation', 'Prostitution', 'Non-Criminal',
       'Public Peace Violation', 'Intimidation', 'Arson', 'Stalking',
       'Interference With Public Officer',
       'Concealed Carry License Violation', 'Kidnapping',
       'Human Trafficking', 'Homicide', 'Gambling', 'Public Indecency',
       'Other Narcotic Violation'], dtype=object)

### Location Description

In [8]:
#original values of Location Description
crimes_cleaned['Location Description'].unique()

array(['RESIDENCE', 'OTHER', 'HOTEL/MOTEL', nan, 'APARTMENT', 'SIDEWALK',
       'PARKING LOT/GARAGE(NON.RESID.)', 'RESIDENCE-GARAGE',
       'HOSPITAL BUILDING/GROUNDS', 'BANK', 'RESTAURANT',
       'SCHOOL, PUBLIC, BUILDING', 'STREET',
       'AIRPORT BUILDING NON-TERMINAL - SECURE AREA',
       'RESIDENCE PORCH/HALLWAY', 'RESIDENTIAL YARD (FRONT/BACK)',
       'BAR OR TAVERN', 'AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA',
       'DEPARTMENT STORE', 'ALLEY', 'VEHICLE NON-COMMERCIAL',
       'GOVERNMENT BUILDING/PROPERTY', 'AUTO / BOAT / RV DEALERSHIP',
       'VACANT LOT/LAND', 'WAREHOUSE', 'POOL ROOM',
       'COMMERCIAL / BUSINESS OFFICE', 'POLICE FACILITY/VEH PARKING LOT',
       'PARK PROPERTY', 'MEDICAL/DENTAL OFFICE', 'BOAT/WATERCRAFT',
       'GROCERY FOOD STORE', 'CTA STATION', 'CONVENIENCE STORE',
       'ATHLETIC CLUB', 'SMALL RETAIL STORE', 'AIRPORT/AIRCRAFT',
       'ANIMAL HOSPITAL', 'ATM (AUTOMATIC TELLER MACHINE)', 'CTA BUS',
       'CURRENCY EXCHANGE', 'DRIVEWAY -

In [9]:
#reading in manual mapping of unique original values to new ones
location = pd.read_csv('Location.csv')
location.head()

Unnamed: 0,Original Value,New Value
0,RESIDENCE,Residence
1,OTHER,Other
2,HOTEL/MOTEL,Hotel/Motel
3,APARTMENT,Residence
4,SIDEWALK,Sidewalk


In [10]:
#merging the crimes_cleaned df with the location df 
crimes_cleaned = crimes_cleaned.merge(location, how='left', 
                                      left_on='Location Description',
                                      right_on='Original Value')

#removing the original value columns
crimes_cleaned.drop(labels=['Location Description','Original Value'],axis=1,
                    inplace=True)

#renaming the new value column as Location Description
crimes_cleaned.rename(columns={'New Value':'Location Description'}, inplace=True)

#displaying the results
crimes_cleaned.head()

Unnamed: 0,Date,Primary Type,Arrest,Domestic,Community Area,Year,Location,Location Description
0,10/08/2017 03:00:00 AM,Crim Sexual Assault,False,False,73.0,2017,,Residence
1,03/28/2017 02:00:00 PM,Burglary,False,False,70.0,2017,,Other
2,09/09/2017 08:17:00 PM,Theft,False,False,42.0,2017,,Residence
3,08/26/2017 10:00:00 AM,Crim Sexual Assault,False,False,32.0,2017,,Hotel/Motel
4,02/10/2013 12:00:00 AM,Crim Sexual Assault,False,False,69.0,2013,,Residence


<a id="community_areas">

## Pulling in Community Area Names and Regions

In [11]:
#reading in the mapping of the Community Area ID to the Name and Region
#values from Wikipedia
community_areas = pd.read_csv('Community_Areas.csv')
community_areas.head()

Unnamed: 0,ID,Name,Region
0,8,Near North Side,Central
1,32,Loop,Central
2,33,Near South Side,Central
3,5,North Center,North Side
4,6,Lake View,North Side


In [12]:
#merging the community area df with the reduced crimes df
crimes_cleaned = crimes_cleaned.merge(community_areas, how='left', 
                                      left_on='Community Area',
                                      right_on='ID')

#dropping the ID columns
crimes_cleaned.drop(labels=['Community Area','ID'],axis=1,inplace=True)

#renaming the Name column as Community Area
crimes_cleaned.rename(columns={'Name':'Community Area'}, inplace=True)

#displaying the results
crimes_cleaned.head()

Unnamed: 0,Date,Primary Type,Arrest,Domestic,Year,Location,Location Description,Community Area,Region
0,10/08/2017 03:00:00 AM,Crim Sexual Assault,False,False,2017,,Residence,Washington Heights,Far Southwest Side
1,03/28/2017 02:00:00 PM,Burglary,False,False,2017,,Other,Ashburn,Far Southwest Side
2,09/09/2017 08:17:00 PM,Theft,False,False,2017,,Residence,Woodlawn,South Side
3,08/26/2017 10:00:00 AM,Crim Sexual Assault,False,False,2017,,Hotel/Motel,Loop,Central
4,02/10/2013 12:00:00 AM,Crim Sexual Assault,False,False,2013,,Residence,Greater Grand Crossing,South Side


<a id="date_columns">

## Creating New Datetime Columns based on the Date of the Crime

In [13]:
#converting Date column to datetime data type (this takes a while)
crimes_cleaned.Date = pd.to_datetime(crimes_cleaned.Date)

In [14]:
#adding a column for the month of the crime
crimes_cleaned['Month'] = crimes_cleaned.Date.dt.month

#adding a column for the day of week of the crime
crimes_cleaned['Day of Week'] = crimes_cleaned.Date.dt.day_name()

#creating index vars for time of day (morning, afternoon, evening, and night)
morning_idx   = (crimes_cleaned.Date.dt.time >= dt.time(5))  & (crimes_cleaned.Date.dt.time < dt.time(12)) # 5am to 12pm
afternoon_idx = (crimes_cleaned.Date.dt.time >= dt.time(10)) & (crimes_cleaned.Date.dt.time < dt.time(17)) #12pm to  5pm
evening_idx   = (crimes_cleaned.Date.dt.time >= dt.time(17)) & (crimes_cleaned.Date.dt.time < dt.time(20)) # 5pm to  8pm
night_idx     = (crimes_cleaned.Date.dt.time >= dt.time(20)) | (crimes_cleaned.Date.dt.time < dt.time(5))  # 8pm to  5am

#adding a column for the time of day of the crime
crimes_cleaned['Time of Day'] = ""
crimes_cleaned.loc[morning_idx,   'Time of Day'] = "Morning"
crimes_cleaned.loc[afternoon_idx, 'Time of Day'] = "Afternoon"
crimes_cleaned.loc[evening_idx,   'Time of Day'] = "Evening"
crimes_cleaned.loc[night_idx,     'Time of Day'] = "Night"

In [15]:
crimes_cleaned.head()

Unnamed: 0,Date,Primary Type,Arrest,Domestic,Year,Location,Location Description,Community Area,Region,Month,Day of Week,Time of Day
0,2017-10-08 03:00:00,Crim Sexual Assault,False,False,2017,,Residence,Washington Heights,Far Southwest Side,10,Sunday,Night
1,2017-03-28 14:00:00,Burglary,False,False,2017,,Other,Ashburn,Far Southwest Side,3,Tuesday,Afternoon
2,2017-09-09 20:17:00,Theft,False,False,2017,,Residence,Woodlawn,South Side,9,Saturday,Night
3,2017-08-26 10:00:00,Crim Sexual Assault,False,False,2017,,Hotel/Motel,Loop,Central,8,Saturday,Afternoon
4,2013-02-10 00:00:00,Crim Sexual Assault,False,False,2013,,Residence,Greater Grand Crossing,South Side,2,Sunday,Night


<a id="saving">

<a id="nulls">

## Dropping Nulls from the Dataset
Because of the relatively small number of rows with null values, I'm going to drop all rows with nulls.

In [16]:
print("Number of Rows: {:,d}".format(crimes_cleaned.shape[0]))
print()
for col in crimes_cleaned.columns:
    print("{} has {:,d} nulls".format(col,crimes_cleaned[col].isna().sum()))

Number of Rows: 3,007,770

Date has 0 nulls
Primary Type has 0 nulls
Arrest has 0 nulls
Domestic has 0 nulls
Year has 0 nulls
Location has 22,340 nulls
Location Description has 6,199 nulls
Community Area has 431 nulls
Region has 431 nulls
Month has 0 nulls
Day of Week has 0 nulls
Time of Day has 0 nulls


In [19]:
crimes_cleaned.dropna(inplace=True,axis=0)

print("Number of Rows: {:,d}".format(crimes_cleaned.shape[0]))
print()
for col in crimes_cleaned.columns:
    print("{} has {:,d} nulls".format(col,crimes_cleaned[col].isna().sum()))

Number of Rows: 2,980,876

Date has 0 nulls
Primary Type has 0 nulls
Arrest has 0 nulls
Domestic has 0 nulls
Year has 0 nulls
Location has 0 nulls
Location Description has 0 nulls
Community Area has 0 nulls
Region has 0 nulls
Month has 0 nulls
Day of Week has 0 nulls
Time of Day has 0 nulls


## Saving the Cleaned Dataset

In [20]:
crimes_cleaned.to_csv("crimes_cleaned.csv")