# Crime Data Preprocessing
As of 3/10/2020, the [dataset](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2) provided by the city of Chicago on crime (excluding murders) contains over 7 millions rows and 22 columns. This [dataset](#https://data.cityofchicago.org/Public-Safety/Homicides/k9xv-yxzs) contained the homicides, about 10,000, from the last 20 years. To facilitate early exploration of the data and focus on more recent, relevant trends, I removed crimes from before 2010, unneeded columns, and rows with nulls. The reduced dataset contained just under 3 million rows of crimes  
  
After reducing the size of the dataset, I cleaned up the text columns by manually matching values of each column with a smaller subset of categories in excel, mapped the Community Area ID's to their name and group (e.g. Community Area 8 maps to Near North Side and Central) based on [this](https://en.wikipedia.org/wiki/Community_areas_in_Chicago) Wikipedia page, and added some categorical columns based on the date of the crime.  

Also added in the Community Area population sizes from the [2010 Census](https://www.chicago.gov/content/dam/city/depts/zlup/Zoning_Main_Page/Publications/Census_2010_Community_Area_Profiles/Census_2010_and_2000_CA_Populations.pdf) to allow for an approximated Crimes/Homicides per Capita calculation. Unfortunately, this data isn't provided year over year and as of 3/30/2020, the 2020 Census isn't available, which is why I needed to use 2010 population sizes.

## Table of Contents
- [Dataset Description](#original_desc)  
- [Reading in the Original Datasets and Combining Them](#read_in)  
- [Removing Crimes from before 10 Years Ago](#remove_10_years)  
- [Removing Unneeded Columns](#unneeded_columns)
- [Cleaning Up Text Columns](#clean_up)
- [Pulling in Community Area Names and Regions](#community_areas)
- [Creating New Datetime Columns based on the Date of the Crime](#date_columns)
- [Dropping Nulls from the Dataset](#nulls)
- [Saving the Cleaned Dataset](#saving)
- [Creating an Aggregated Dataset for Prediction](#pred_data)

<a id="original_desc">  

## Dataset Description:
These are the original column descriptions from the City of Chicago 
[website](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2)  

| Column Name  | Column Description |  
| :-:    | :-: |  
| ID           | Unique identifier for the record |
| Case Number  | The Chicago Police Department Records Division Number |
| Date         | Date when the incident occurred (sometimes an estimate) |
| Block        | The partially redacted address where the incident occurred, placing it on the same block as the actual address |
| IUCR         | Illinois Uniform Crime Reporting code |
| Primary Type | The primary description of the IUCR code |
| Description  | The secondary description of the IUCR code, a subcategory of the primary description |
| Location Description | Description of the location where the incident occurred |
| Arrest | Indicates whether an arrest was made |
| Domestic | Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence act |
| Beat | Indicates the beat where the incident occurred. A beat is the smallest police geographic area. 3 to 5 beats make up a police sector, and 3 sectors make up a police district |
| District | Indicates the police district where the incident occurred |
| Ward | The ward (City Council district) where the incident occurred |
| Community Area | Indicates the community area where the incident occurred (Chicago has 77 community areas) |
| FBI Code | Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS) |
| X Coordinate | The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block |
| Y Coordinate | The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block |
| Year | The year the incident occurred |
| Updated On | Date and time the record was last updated |
| Latitude | The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block |
| Longitude | The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block |
| Location | The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block |

<a id="read_in">

## Reading in the Original Datasets and Combining Them

In [1]:
import pandas as pd
import datetime as dt

import warnings
warnings.filterwarnings('ignore')

In [2]:
crimes_general = pd.read_csv("Data/crimes_general.csv", parse_dates=['Date'])
crimes_murders = pd.read_csv("Data/crimes_murders.csv", parse_dates=['Date'])

In [3]:
crimes = pd.concat([crimes_general, crimes_murders], ignore_index=True)

In [4]:
print("There were {:,d} General Crimes from {} to {}".format(crimes_general.shape[0],
                                                           crimes_general.Date.min(),
                                                           crimes_general.Date.max()))
print()
print("There were {:,d} Homicides from {} to {}".format(crimes_murders.shape[0],
                                                           crimes_murders.Date.min(),
                                                           crimes_murders.Date.max()))
print()
print("There were {:,d} Total Crimes from {} to {}".format(crimes.shape[0],
                                                           crimes.Date.min(),
                                                           crimes.Date.max()))

There were 7,084,356 General Crimes from 2001-01-01 00:00:00 to 2020-03-02 23:59:00

There were 10,133 Homicides from 2001-01-01 10:40:00 to 2020-03-10 16:31:00

There were 7,094,489 Total Crimes from 2001-01-01 00:00:00 to 2020-03-10 16:31:00


<a id="remove_10_years">

## Removing Crimes from Before 2010

In [5]:
#removing crimes from over 10 years ago
crimes_cleaned = crimes[crimes.Year >= 2010].copy()

#dropping March 2020 crimes as there isn't a full month of data
crimes_cleaned.drop(crimes_cleaned[(crimes_cleaned.Date.dt.month == 3) & (crimes.Year == 2020)].index, axis=0, inplace=True)

#displaying the results
print("There were {:,d} of Crimes from {} to {}".format(crimes_cleaned.shape[0],
                                                        crimes_cleaned.Date.min(),
                                                        crimes_cleaned.Date.max()))

There were 3,011,993 of Crimes from 2010-01-01 00:00:00 to 2020-02-29 23:59:00


<a id="unneeded_columns">

## Removing Unneeded Columns

In [6]:
#creating a list of the columns to drop
drop_cols = ['ID','Case Number', 'Block', 'Description', 
             'Beat', 'District', 'Ward', 'IUCR', 'FBI Code', 
             'X Coordinate', 'Y Coordinate', 'Latitude', 'Longitude', 
             'Updated On']

#dropping the columns
crimes_cleaned.drop(labels=drop_cols, axis=1, inplace=True)

#displaying the results
crimes_cleaned.head()

Unnamed: 0,Date,Primary Type,Location Description,Arrest,Domestic,Community Area,Year,Location
1,2017-10-08 03:00:00,CRIM SEXUAL ASSAULT,RESIDENCE,False,False,73.0,2017,
2,2017-03-28 14:00:00,BURGLARY,OTHER,False,False,70.0,2017,
3,2017-09-09 20:17:00,THEFT,RESIDENCE,False,False,42.0,2017,
4,2017-08-26 10:00:00,CRIM SEXUAL ASSAULT,HOTEL/MOTEL,False,False,32.0,2017,
5,2013-02-10 00:00:00,CRIM SEXUAL ASSAULT,RESIDENCE,False,False,69.0,2013,


<a id="clean_up">

## Cleaning Up Text Columns

### Primary Type

In [7]:
#original values
crimes_cleaned['Primary Type'].unique()

array(['CRIM SEXUAL ASSAULT', 'BURGLARY', 'THEFT',
       'OFFENSE INVOLVING CHILDREN', 'DECEPTIVE PRACTICE',
       'CRIMINAL DAMAGE', 'OTHER OFFENSE', 'SEX OFFENSE', 'ASSAULT',
       'NARCOTICS', 'ROBBERY', 'CRIMINAL TRESPASS', 'WEAPONS VIOLATION',
       'MOTOR VEHICLE THEFT', 'BATTERY', 'OBSCENITY',
       'LIQUOR LAW VIOLATION', 'PROSTITUTION', 'NON-CRIMINAL',
       'PUBLIC PEACE VIOLATION', 'INTIMIDATION', 'ARSON', 'STALKING',
       'INTERFERENCE WITH PUBLIC OFFICER',
       'CONCEALED CARRY LICENSE VIOLATION', 'KIDNAPPING',
       'HUMAN TRAFFICKING', 'HOMICIDE', 'GAMBLING', 'PUBLIC INDECENCY',
       'OTHER NARCOTIC VIOLATION', 'NON - CRIMINAL',
       'NON-CRIMINAL (SUBJECT SPECIFIED)'], dtype=object)

In [8]:
#list of other values to group under "NON-CRIMINAL"
non_criminal_list = ['NON - CRIMINAL','NON-CRIMINAL (SUBJECT SPECIFIED)']

#replacing the other values with "NON-CRIMINAL"
crimes_cleaned['Primary Type'].replace(to_replace = non_criminal_list,
                                       value='NON-CRIMINAL', regex=False,
                                       inplace=True)

#Converting values to title case
crimes_cleaned['Primary Type'] = crimes_cleaned['Primary Type'].str.title()

#displaying the results
crimes_cleaned['Primary Type'].unique()

array(['Crim Sexual Assault', 'Burglary', 'Theft',
       'Offense Involving Children', 'Deceptive Practice',
       'Criminal Damage', 'Other Offense', 'Sex Offense', 'Assault',
       'Narcotics', 'Robbery', 'Criminal Trespass', 'Weapons Violation',
       'Motor Vehicle Theft', 'Battery', 'Obscenity',
       'Liquor Law Violation', 'Prostitution', 'Non-Criminal',
       'Public Peace Violation', 'Intimidation', 'Arson', 'Stalking',
       'Interference With Public Officer',
       'Concealed Carry License Violation', 'Kidnapping',
       'Human Trafficking', 'Homicide', 'Gambling', 'Public Indecency',
       'Other Narcotic Violation'], dtype=object)

### Location Description

In [9]:
#original values of Location Description
crimes_cleaned['Location Description'].unique()

array(['RESIDENCE', 'OTHER', 'HOTEL/MOTEL', nan, 'APARTMENT', 'SIDEWALK',
       'PARKING LOT/GARAGE(NON.RESID.)', 'RESIDENCE-GARAGE',
       'HOSPITAL BUILDING/GROUNDS', 'BANK', 'RESTAURANT',
       'SCHOOL, PUBLIC, BUILDING', 'STREET',
       'AIRPORT BUILDING NON-TERMINAL - SECURE AREA',
       'RESIDENCE PORCH/HALLWAY', 'RESIDENTIAL YARD (FRONT/BACK)',
       'BAR OR TAVERN', 'AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA',
       'DEPARTMENT STORE', 'ALLEY', 'VEHICLE NON-COMMERCIAL',
       'GOVERNMENT BUILDING/PROPERTY', 'AUTO / BOAT / RV DEALERSHIP',
       'VACANT LOT/LAND', 'WAREHOUSE', 'POOL ROOM',
       'COMMERCIAL / BUSINESS OFFICE', 'POLICE FACILITY/VEH PARKING LOT',
       'PARK PROPERTY', 'MEDICAL/DENTAL OFFICE', 'BOAT/WATERCRAFT',
       'GROCERY FOOD STORE', 'CTA STATION', 'CONVENIENCE STORE',
       'ATHLETIC CLUB', 'SMALL RETAIL STORE', 'AIRPORT/AIRCRAFT',
       'ANIMAL HOSPITAL', 'ATM (AUTOMATIC TELLER MACHINE)', 'CTA BUS',
       'CURRENCY EXCHANGE', 'DRIVEWAY -

In [10]:
#reading in manual mapping of unique original values to new ones
location = pd.read_csv('Data/Location.csv')
location.head()

Unnamed: 0,Original Value,New Value
0,RESIDENCE,Residence
1,OTHER,Other
2,HOTEL/MOTEL,Hotel/Motel
3,APARTMENT,Residence
4,SIDEWALK,Sidewalk


In [11]:
#merging the crimes_cleaned df with the location df 
crimes_cleaned = crimes_cleaned.merge(location, how='left', 
                                      left_on='Location Description',
                                      right_on='Original Value')

#removing the original value columns
crimes_cleaned.drop(labels=['Location Description','Original Value'],axis=1,
                    inplace=True)

#renaming the new value column as Location Description
crimes_cleaned.rename(columns={'New Value':'Location Description'}, inplace=True)

#displaying the results
crimes_cleaned.head()

Unnamed: 0,Date,Primary Type,Arrest,Domestic,Community Area,Year,Location,Location Description
0,2017-10-08 03:00:00,Crim Sexual Assault,False,False,73.0,2017,,Residence
1,2017-03-28 14:00:00,Burglary,False,False,70.0,2017,,Other
2,2017-09-09 20:17:00,Theft,False,False,42.0,2017,,Residence
3,2017-08-26 10:00:00,Crim Sexual Assault,False,False,32.0,2017,,Hotel/Motel
4,2013-02-10 00:00:00,Crim Sexual Assault,False,False,69.0,2013,,Residence


<a id="community_areas">

## Pulling in Community Area Names and Regions

In [12]:
#reading in the mapping of the Community Area ID to the Name and Region
#values from Wikipedia
community_areas = pd.read_csv('Data/Community_Areas.csv')
community_areas.head()

Unnamed: 0,ID,Name,Region,2010 Population
0,8,Near North Side,Central,80484
1,32,Loop,Central,29283
2,33,Near South Side,Central,21390
3,5,North Center,North Side,31867
4,6,Lake View,North Side,94368


In [13]:
#merging the community area df with the reduced crimes df
crimes_cleaned = crimes_cleaned.merge(community_areas, how='left', 
                                      left_on='Community Area',
                                      right_on='ID')

#dropping the ID columns
crimes_cleaned.drop(labels=['Community Area','ID'],axis=1,inplace=True)

#renaming the Name column as Community Area
crimes_cleaned.rename(columns={'Name':'Community Area'}, inplace=True)

#displaying the results
crimes_cleaned.head()

Unnamed: 0,Date,Primary Type,Arrest,Domestic,Year,Location,Location Description,Community Area,Region,2010 Population
0,2017-10-08 03:00:00,Crim Sexual Assault,False,False,2017,,Residence,Washington Heights,Far Southwest Side,26493.0
1,2017-03-28 14:00:00,Burglary,False,False,2017,,Other,Ashburn,Far Southwest Side,41081.0
2,2017-09-09 20:17:00,Theft,False,False,2017,,Residence,Woodlawn,South Side,25983.0
3,2017-08-26 10:00:00,Crim Sexual Assault,False,False,2017,,Hotel/Motel,Loop,Central,29283.0
4,2013-02-10 00:00:00,Crim Sexual Assault,False,False,2013,,Residence,Greater Grand Crossing,South Side,32602.0


<a id="date_columns">

## Creating New Datetime Columns based on the Date of the Crime

In [14]:
#adding a column for the month of the crime
crimes_cleaned['Month'] = crimes_cleaned.Date.dt.month

#adding a column for the day of week of the crime
crimes_cleaned['Day of Week'] = crimes_cleaned.Date.dt.day_name()

#creating index vars for time of day (morning, afternoon, evening, and night)
morning_idx   = (crimes_cleaned.Date.dt.time >= dt.time( 5)) & (crimes_cleaned.Date.dt.time < dt.time(12)) # 5am to 12pm
afternoon_idx = (crimes_cleaned.Date.dt.time >= dt.time(10)) & (crimes_cleaned.Date.dt.time < dt.time(17)) #12pm to  5pm
evening_idx   = (crimes_cleaned.Date.dt.time >= dt.time(17)) & (crimes_cleaned.Date.dt.time < dt.time(20)) # 5pm to  8pm
night_idx     = (crimes_cleaned.Date.dt.time >= dt.time(20)) | (crimes_cleaned.Date.dt.time < dt.time( 5)) # 8pm to  5am

#adding a column for the time of day of the crime
crimes_cleaned['Time of Day'] = ""
crimes_cleaned.loc[morning_idx,   'Time of Day'] = "Morning"
crimes_cleaned.loc[afternoon_idx, 'Time of Day'] = "Afternoon"
crimes_cleaned.loc[evening_idx,   'Time of Day'] = "Evening"
crimes_cleaned.loc[night_idx,     'Time of Day'] = "Night"

#creating index vars for season of year (spring, summer, fall, winter)
spring_idx = (crimes_cleaned.Month >=  3) & (crimes_cleaned.Month <  6)
summer_idx = (crimes_cleaned.Month >=  6) & (crimes_cleaned.Month <  9)
fall_idx   = (crimes_cleaned.Month >=  9) & (crimes_cleaned.Month < 12)
winter_idx = (crimes_cleaned.Month >= 12) | (crimes_cleaned.Month <  3)

#adding a column for the season the crime occurred in
crimes_cleaned['Season'] = ""
crimes_cleaned.loc[spring_idx, 'Season'] = "Spring"
crimes_cleaned.loc[summer_idx, 'Season'] = "Summer"
crimes_cleaned.loc[fall_idx,   'Season'] = "Fall"
crimes_cleaned.loc[winter_idx, 'Season'] = "Winter"

In [15]:
crimes_cleaned.head()

Unnamed: 0,Date,Primary Type,Arrest,Domestic,Year,Location,Location Description,Community Area,Region,2010 Population,Month,Day of Week,Time of Day,Season
0,2017-10-08 03:00:00,Crim Sexual Assault,False,False,2017,,Residence,Washington Heights,Far Southwest Side,26493.0,10,Sunday,Night,Fall
1,2017-03-28 14:00:00,Burglary,False,False,2017,,Other,Ashburn,Far Southwest Side,41081.0,3,Tuesday,Afternoon,Spring
2,2017-09-09 20:17:00,Theft,False,False,2017,,Residence,Woodlawn,South Side,25983.0,9,Saturday,Night,Fall
3,2017-08-26 10:00:00,Crim Sexual Assault,False,False,2017,,Hotel/Motel,Loop,Central,29283.0,8,Saturday,Afternoon,Summer
4,2013-02-10 00:00:00,Crim Sexual Assault,False,False,2013,,Residence,Greater Grand Crossing,South Side,32602.0,2,Sunday,Night,Winter


<a id="saving">

<a id="nulls">

## Dropping Nulls from the Dataset
Because of the relatively small number of rows with null values, I'm going to drop all rows with nulls.

In [16]:
print("Number of Rows: {:,d}".format(crimes_cleaned.shape[0]))
print()
for col in crimes_cleaned.columns:
    print("{} has {:,d} nulls".format(col,crimes_cleaned[col].isna().sum()))

Number of Rows: 3,011,993

Date has 0 nulls
Primary Type has 0 nulls
Arrest has 0 nulls
Domestic has 0 nulls
Year has 0 nulls
Location has 22,299 nulls
Location Description has 6,194 nulls
Community Area has 431 nulls
Region has 431 nulls
2010 Population has 431 nulls
Month has 0 nulls
Day of Week has 0 nulls
Time of Day has 0 nulls
Season has 0 nulls


In [17]:
crimes_cleaned.dropna(inplace=True,axis=0)

print("Number of Rows: {:,d}".format(crimes_cleaned.shape[0]))
print()
for col in crimes_cleaned.columns:
    print("{} has {:,d} nulls".format(col,crimes_cleaned[col].isna().sum()))

Number of Rows: 2,985,145

Date has 0 nulls
Primary Type has 0 nulls
Arrest has 0 nulls
Domestic has 0 nulls
Year has 0 nulls
Location has 0 nulls
Location Description has 0 nulls
Community Area has 0 nulls
Region has 0 nulls
2010 Population has 0 nulls
Month has 0 nulls
Day of Week has 0 nulls
Time of Day has 0 nulls
Season has 0 nulls


## Saving the Cleaned Dataset

In [18]:
crimes_cleaned.to_csv("Data/crimes_cleaned.csv", index=False)

<a id="pred_data">

## Creating an Aggregated Dataset for Prediction
My goal is to predict the monthly number of crimes based on the primary type of the crime, the description of the location, and the region and community area where the crime occurred, so I aggregated the cleaned dataset based on these columns and retrieved the count of crimes for each group. 

In [19]:
#aggregating the crimes by features of interest
crimes_agg = crimes_cleaned.groupby(['Year','Month', 'Primary Type', 'Location Description', 'Region', 'Season',
                                     'Community Area']).count().reset_index()

#dropping extra columns
crimes_agg.drop(labels=['Location', 'Day of Week', 'Time of Day', 'Arrest', 'Domestic'], 
                axis=1, inplace=True)

#renaming the aggregated column
crimes_agg.rename({'Date':'Number of Crimes'}, axis=1, inplace=True)

#displaying the results
crimes_agg.head()

Unnamed: 0,Year,Month,Primary Type,Location Description,Region,Season,Community Area,Number of Crimes,2010 Population
0,2010,1,Arson,Abandoned Building,South Side,Winter,Greater Grand Crossing,1,1
1,2010,1,Arson,Abandoned Building,Southwest Side,Winter,West Englewood,1,1
2,2010,1,Arson,CTA Station/Platform/Stop/Other,Central,Winter,Loop,1,1
3,2010,1,Arson,Church,Far North Side,Winter,Forest Glen,1,1
4,2010,1,Arson,Gas Station,Southwest Side,Winter,Englewood,1,1


In [20]:
print("Number of rows: {:,d}".format(crimes_agg.shape[0]))
print()
print("Min number of monthly crimes per group: {}".format(crimes_agg['Number of Crimes'].min()))
print("Max number of monthly crimes per group: {}".format(crimes_agg['Number of Crimes'].max()))
print("Mean number of monthly crimes per group: {:.2f}".format(crimes_agg['Number of Crimes'].mean()))
print("Standard Deviation of monthly crimes per group: {:.2f}".format(crimes_agg['Number of Crimes'].std()))

Number of rows: 718,406

Min number of monthly crimes per group: 1
Max number of monthly crimes per group: 277
Mean number of monthly crimes per group: 4.16
Standard Deviation of monthly crimes per group: 8.57


In [21]:
crimes_agg.to_csv('Data/crimes_aggregated.csv', index=False)