# Crime Data Preprocessing
As of 3/10/2020, the [dataset](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2) provided by the city of Chicago on crime (excluding murders) contains over 7 millions rows and 22 columns. To facilitate early exploration of the data and focus on more recent, relevant trends, I will reduce the data set size before beginning my analysis. I will do this by only removing redundant and unneeded columns from the dataset and only keeping crimes from the last 10 years. 
  
I also mapped the Community Area ID's to their name and group (e.g. Community Area 8 maps to Near North Side and Central) based on [this](https://en.wikipedia.org/wiki/Community_areas_in_Chicago) Wikipedia page. 

## Dataset Description:
These are the original column descriptions from the City of Chicago 
[website](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2)  

| Column Name  | Column Description |  
| :-:    | :-: |  
| ID           | Unique identifier for the record |
| Case Number  | The Chicago Police Department Records Division Number |
| Date         | Date when the incident occurred (sometimes an estimate) |
| Block        | The partially redacted address where the incident occurred, placing it on the same block as the actual address |
| IUCR         | Illinois Uniform Crime Reporting code |
| Primary Type | The primary description of the IUCR code |
| Description  | The secondary description of the IUCR code, a subcategory of the primary description |
| Location Description | Description of the location where the incident occurred |
| Arrest | Indicates whether an arrest was made |
| Domestic | Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence act |
| Beat | Indicates the beat where the incident occurred. A beat is the smallest police geographic area. 3 to 5 beats make up a police sector, and 3 sectors make up a police district |
| District | Indicates the police district where the incident occurred |
| Ward | The ward (City Council district) where the incident occurred |
| Community Area | Indicates the community area where the incident occurred (Chicago has 77 community areas) |
| FBI Code | Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS) |
| X Coordinate | The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block |
| Y Coordinate | The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block |
| Year | The year the incident occurred |
| Updated On | Date and time the record was last updated |
| Latitude | The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block |
| Longitude | The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block |
| Location | The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block |

## Reading in the Original Datase

In [1]:
import pandas as pd

In [8]:
crimes = pd.read_csv('Crimes_Original.csv')

In [9]:
print("Number of Crimes from {} to {}: {:,d}".format(crimes.Date.min(),
                                                     crimes.Date.max(),
                                                     crimes.shape[0]))

Number of Crimes from 01/01/2001 01:00:00 AM to 12/31/2019 12:58:00 AM: 7,084,356


## Removing Crimes from Before 10 Years Ago

In [70]:
#removing crimes from over 10 years ago
crimes_reduced = crimes[crimes.Year >= 2010].copy()

#displaying the results
crimes_reduced.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
1,11227287,JB147188,10/08/2017 03:00:00 AM,092XX S RACINE AVE,281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,RESIDENCE,False,False,...,21.0,73.0,2,,,2017,02/11/2018 03:57:41 PM,,,
2,11227583,JB147595,03/28/2017 02:00:00 PM,026XX W 79TH ST,620,BURGLARY,UNLAWFUL ENTRY,OTHER,False,False,...,18.0,70.0,5,,,2017,02/11/2018 03:57:41 PM,,,
3,11227293,JB147230,09/09/2017 08:17:00 PM,060XX S EBERHART AVE,810,THEFT,OVER $500,RESIDENCE,False,False,...,20.0,42.0,6,,,2017,02/11/2018 03:57:41 PM,,,
4,11227634,JB147599,08/26/2017 10:00:00 AM,001XX W RANDOLPH ST,281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,HOTEL/MOTEL,False,False,...,42.0,32.0,2,,,2017,02/11/2018 03:57:41 PM,,,
5,11227517,JB138481,02/10/2013 12:00:00 AM,071XX S LAFAYETTE AVE,266,CRIM SEXUAL ASSAULT,PREDATORY,RESIDENCE,False,False,...,6.0,69.0,2,,,2013,02/11/2018 03:57:41 PM,,,


## Removing Unneeded Columns

In [71]:
#creating a list of the columns to drop
drop_cols = ['ID','Case Number', 'Block', 'Description', 
             'Beat', 'District', 'Ward', 'IUCR', 'FBI Code', 
             'X Coordinate', 'Y Coordinate', 'Latitude', 'Longitude', 
             'Updated On']

#dropping the columns
crimes_reduced.drop(labels=drop_cols, axis=1, inplace=True)

#displaying the results
crimes_reduced.head()

Unnamed: 0,Date,Primary Type,Location Description,Arrest,Domestic,Community Area,Year,Location
1,10/08/2017 03:00:00 AM,CRIM SEXUAL ASSAULT,RESIDENCE,False,False,73.0,2017,
2,03/28/2017 02:00:00 PM,BURGLARY,OTHER,False,False,70.0,2017,
3,09/09/2017 08:17:00 PM,THEFT,RESIDENCE,False,False,42.0,2017,
4,08/26/2017 10:00:00 AM,CRIM SEXUAL ASSAULT,HOTEL/MOTEL,False,False,32.0,2017,
5,02/10/2013 12:00:00 AM,CRIM SEXUAL ASSAULT,RESIDENCE,False,False,69.0,2013,


## Cleaning Up Text Columns

### Primary Type

In [72]:
#original values
crimes_reduced['Primary Type'].unique()

array(['CRIM SEXUAL ASSAULT', 'BURGLARY', 'THEFT',
       'OFFENSE INVOLVING CHILDREN', 'DECEPTIVE PRACTICE',
       'CRIMINAL DAMAGE', 'OTHER OFFENSE', 'SEX OFFENSE', 'ASSAULT',
       'NARCOTICS', 'ROBBERY', 'CRIMINAL TRESPASS', 'WEAPONS VIOLATION',
       'MOTOR VEHICLE THEFT', 'BATTERY', 'OBSCENITY',
       'LIQUOR LAW VIOLATION', 'PROSTITUTION', 'NON-CRIMINAL',
       'PUBLIC PEACE VIOLATION', 'INTIMIDATION', 'ARSON', 'STALKING',
       'INTERFERENCE WITH PUBLIC OFFICER',
       'CONCEALED CARRY LICENSE VIOLATION', 'KIDNAPPING',
       'HUMAN TRAFFICKING', 'HOMICIDE', 'GAMBLING', 'PUBLIC INDECENCY',
       'OTHER NARCOTIC VIOLATION', 'NON - CRIMINAL',
       'NON-CRIMINAL (SUBJECT SPECIFIED)'], dtype=object)

In [73]:
#list of other values to group under "NON-CRIMINAL"
non_criminal_list = ['NON - CRIMINAL','NON-CRIMINAL (SUBJECT SPECIFIED)']

#replacing the other values with "NON-CRIMINAL"
crimes_reduced['Primary Type'].replace(to_replace = non_criminal_list,
                                       value='NON-CRIMINAL', regex=False,
                                       inplace=True)

#Converting values to title case
crimes_reduced['Primary Type'] = crimes_reduced['Primary Type'].str.title()

#displaying the results
crimes_reduced['Primary Type'].unique()

array(['Crim Sexual Assault', 'Burglary', 'Theft',
       'Offense Involving Children', 'Deceptive Practice',
       'Criminal Damage', 'Other Offense', 'Sex Offense', 'Assault',
       'Narcotics', 'Robbery', 'Criminal Trespass', 'Weapons Violation',
       'Motor Vehicle Theft', 'Battery', 'Obscenity',
       'Liquor Law Violation', 'Prostitution', 'Non-Criminal',
       'Public Peace Violation', 'Intimidation', 'Arson', 'Stalking',
       'Interference With Public Officer',
       'Concealed Carry License Violation', 'Kidnapping',
       'Human Trafficking', 'Homicide', 'Gambling', 'Public Indecency',
       'Other Narcotic Violation'], dtype=object)

### Location Description

In [74]:
#original values of Location Description
crimes_reduced['Location Description'].unique()

array(['RESIDENCE', 'OTHER', 'HOTEL/MOTEL', nan, 'APARTMENT', 'SIDEWALK',
       'PARKING LOT/GARAGE(NON.RESID.)', 'RESIDENCE-GARAGE',
       'HOSPITAL BUILDING/GROUNDS', 'BANK', 'RESTAURANT',
       'SCHOOL, PUBLIC, BUILDING', 'STREET',
       'AIRPORT BUILDING NON-TERMINAL - SECURE AREA',
       'RESIDENCE PORCH/HALLWAY', 'RESIDENTIAL YARD (FRONT/BACK)',
       'BAR OR TAVERN', 'AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA',
       'DEPARTMENT STORE', 'ALLEY', 'VEHICLE NON-COMMERCIAL',
       'GOVERNMENT BUILDING/PROPERTY', 'AUTO / BOAT / RV DEALERSHIP',
       'VACANT LOT/LAND', 'WAREHOUSE', 'POOL ROOM',
       'COMMERCIAL / BUSINESS OFFICE', 'POLICE FACILITY/VEH PARKING LOT',
       'PARK PROPERTY', 'MEDICAL/DENTAL OFFICE', 'BOAT/WATERCRAFT',
       'GROCERY FOOD STORE', 'CTA STATION', 'CONVENIENCE STORE',
       'ATHLETIC CLUB', 'SMALL RETAIL STORE', 'AIRPORT/AIRCRAFT',
       'ANIMAL HOSPITAL', 'ATM (AUTOMATIC TELLER MACHINE)', 'CTA BUS',
       'CURRENCY EXCHANGE', 'DRIVEWAY -

In [75]:
#reading in manual mapping of unique original values to new ones
location = pd.read_csv('Location.csv')
location.head()

Unnamed: 0,Original Value,New Value
0,RESIDENCE,Residence
1,OTHER,Other
2,HOTEL/MOTEL,Hotel/Motel
3,APARTMENT,Residence
4,SIDEWALK,Sidewalk


In [76]:
#merging the crimes_reduced df with the location df 
crimes_reduced = crimes_reduced.merge(location, how='left', 
                                      left_on='Location Description',
                                      right_on='Original Value')

#removing the original value columns
crimes_reduced.drop(labels=['Location Description','Original Value'],axis=1,
                    inplace=True)

#renaming the new value column as Location Description
crimes_reduced.rename(columns={'New Value':'Location Description'}, inplace=True)

#displaying the results
crimes_reduced.head()

Unnamed: 0,Date,Primary Type,Arrest,Domestic,Community Area,Year,Location,Location Description
0,10/08/2017 03:00:00 AM,Crim Sexual Assault,False,False,73.0,2017,,Residence
1,03/28/2017 02:00:00 PM,Burglary,False,False,70.0,2017,,Other
2,09/09/2017 08:17:00 PM,Theft,False,False,42.0,2017,,Residence
3,08/26/2017 10:00:00 AM,Crim Sexual Assault,False,False,32.0,2017,,Hotel/Motel
4,02/10/2013 12:00:00 AM,Crim Sexual Assault,False,False,69.0,2013,,Residence


## Pulling in Community Area Names and Regions

In [77]:
#reading in the mapping of the Community Area ID to the Name and Region
#values from Wikipedia
community_areas = pd.read_csv('Community_Areas.csv')
community_areas.head()

Unnamed: 0,ID,Name,Region
0,8,Near North Side,Central
1,32,Loop,Central
2,33,Near South Side,Central
3,5,North Center,North Side
4,6,Lake View,North Side


In [78]:
#merging the community area df with the reduced crimes df
crimes_reduced = crimes_reduced.merge(community_areas, how='left', 
                                      left_on='Community Area',
                                      right_on='ID')

#dropping the ID columns
crimes_reduced.drop(labels=['Community Area','ID'],axis=1,inplace=True)

#renaming the Name column as Community Area
crimes_reduced.rename(columns={'Name':'Community Area'}, inplace=True)

#displaying the results
crimes_reduced.head()

Unnamed: 0,Date,Primary Type,Arrest,Domestic,Year,Location,Location Description,Community Area,Region
0,10/08/2017 03:00:00 AM,Crim Sexual Assault,False,False,2017,,Residence,Washington Heights,Far Southwest Side
1,03/28/2017 02:00:00 PM,Burglary,False,False,2017,,Other,Ashburn,Far Southwest Side
2,09/09/2017 08:17:00 PM,Theft,False,False,2017,,Residence,Woodlawn,South Side
3,08/26/2017 10:00:00 AM,Crim Sexual Assault,False,False,2017,,Hotel/Motel,Loop,Central
4,02/10/2013 12:00:00 AM,Crim Sexual Assault,False,False,2013,,Residence,Greater Grand Crossing,South Side


## Create New Datetime Columns based on the Date of the Crime

In [None]:
#adding a column for the month of the crime


#adding a column for the day of week of the crime


#adding a column for the time of day of the crime



## Saving the Reduced Dataset

In [42]:
crimes_reduced.to_csv("Crimes_Reduced.csv")