# Crime Data Preprocessing
As of 3/10/2020, the [dataset](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2) provided by the city of Chicago on crime (excluding murders) contains over 7 millions rows and 22 columns. To facilitate early exploration of the data and focus on more recent, relevant trends, I will reduce the data set size before beginning my analysis. I will do this by only removing redundant and unneeded columns from the dataset and only keeping crimes from the last 10 years. 
  
I also mapped the Community Area ID's to their name and group (e.g. Community Area 8 maps to Near North Side and Central) based on [this](https://en.wikipedia.org/wiki/Community_areas_in_Chicago) Wikipedia page. 

## Next Steps:
1. Convert Community Areas/Districts/Wards from numbers to names (whichever variable makes the most sense).
2. Create Time of Day column based on Date column (e.g. Morning, Afternoon, Evening, Night).

## Dataset Description:
These are the original column descriptions from the City of Chicago 
[website](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2)  

| Column Name  | Column Description |  
| :-:    | :-: |  
| ID           | Unique identifier for the record |
| Case Number  | The Chicago Police Department Records Division Number |
| Date         | Date when the incident occurred (sometimes an estimate) |
| Block        | The partially redacted address where the incident occurred, placing it on the same block as the actual address |
| IUCR         | Illinois Uniform Crime Reporting code |
| Primary Type | The primary description of the IUCR code |
| Description  | The secondary description of the IUCR code, a subcategory of the primary description |
| Location Description | Description of the location where the incident occurred |
| Arrest | Indicates whether an arrest was made |
| Domestic | Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence act |
| Beat | Indicates the beat where the incident occurred. A beat is the smallest police geographic area. 3 to 5 beats make up a police sector, and 3 sectors make up a police district |
| District | Indicates the police district where the incident occurred |
| Ward | The ward (City Council district) where the incident occurred |
| Community Area | Indicates the community area where the incident occurred (Chicago has 77 community areas) |
| FBI Code | Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS) |
| X Coordinate | The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block |
| Y Coordinate | The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block |
| Year | The year the incident occurred |
| Updated On | Date and time the record was last updated |
| Latitude | The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block |
| Longitude | The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block |
| Location | The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block |

## Reading in the Original Datase

In [2]:
import pandas as pd

In [27]:
crimes = pd.read_csv('Crimes_Original.csv')

In [29]:
print("Number of Crimes from {} to {}: {:,d}".format(crimes.Date.min(),crimes.Date.max(),crimes.shape[0]))

Number of Crimes from 01/01/2001 01:00:00 AM to 12/31/2019 12:58:00 AM: 7,084,356


## Removing Crimes from Before 5 Years Ago

In [44]:
crimes_reduced = crimes[crimes.Year >= 2010].copy()
crimes_reduced.head(10)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
1,11227287,JB147188,10/08/2017 03:00:00 AM,092XX S RACINE AVE,281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,RESIDENCE,False,False,...,21.0,73.0,2,,,2017,02/11/2018 03:57:41 PM,,,
2,11227583,JB147595,03/28/2017 02:00:00 PM,026XX W 79TH ST,620,BURGLARY,UNLAWFUL ENTRY,OTHER,False,False,...,18.0,70.0,5,,,2017,02/11/2018 03:57:41 PM,,,
3,11227293,JB147230,09/09/2017 08:17:00 PM,060XX S EBERHART AVE,810,THEFT,OVER $500,RESIDENCE,False,False,...,20.0,42.0,6,,,2017,02/11/2018 03:57:41 PM,,,
4,11227634,JB147599,08/26/2017 10:00:00 AM,001XX W RANDOLPH ST,281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,HOTEL/MOTEL,False,False,...,42.0,32.0,2,,,2017,02/11/2018 03:57:41 PM,,,
5,11227517,JB138481,02/10/2013 12:00:00 AM,071XX S LAFAYETTE AVE,266,CRIM SEXUAL ASSAULT,PREDATORY,RESIDENCE,False,False,...,6.0,69.0,2,,,2013,02/11/2018 03:57:41 PM,,,
6,11227503,JB146383,01/01/2015 12:01:00 AM,061XX S KILBOURN AVE,1751,OFFENSE INVOLVING CHILDREN,CRIM SEX ABUSE BY FAM MEMBER,RESIDENCE,False,True,...,13.0,65.0,17,,,2015,04/12/2019 04:00:15 PM,,,
7,11227508,JB146365,01/01/2017 12:01:00 AM,027XX S WHIPPLE ST,1754,OFFENSE INVOLVING CHILDREN,AGG SEX ASSLT OF CHILD FAM MBR,RESIDENCE,False,False,...,12.0,30.0,2,,,2017,02/11/2018 03:57:41 PM,,,
8,11022695,JA353568,07/17/2017 10:10:00 AM,021XX W MC LEAN AVE,810,THEFT,OVER $500,RESIDENCE,False,False,...,32.0,22.0,6,,,2017,07/24/2017 03:54:23 PM,,,
9,11227633,JB147500,12/28/2017 03:55:00 PM,011XX S MICHIGAN AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,...,2.0,32.0,11,,,2017,02/11/2018 03:57:41 PM,,,
10,11227586,JB147613,02/10/2017 12:00:00 PM,089XX S COTTAGE GROVE AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,False,False,...,8.0,44.0,14,,,2017,02/11/2018 03:57:41 PM,,,


## Removing Unneeded Columns

In [45]:
drop_cols = ['ID','Case Number', 'IUCR', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Latitude', 'Longitude', 'Updated On']
crimes_reduced.drop(labels=drop_cols, axis=1, inplace=True)
crimes_reduced.head()

Unnamed: 0,Date,Block,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,Year,Location
1,10/08/2017 03:00:00 AM,092XX S RACINE AVE,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,RESIDENCE,False,False,2222,22.0,21.0,73.0,2017,
2,03/28/2017 02:00:00 PM,026XX W 79TH ST,BURGLARY,UNLAWFUL ENTRY,OTHER,False,False,835,8.0,18.0,70.0,2017,
3,09/09/2017 08:17:00 PM,060XX S EBERHART AVE,THEFT,OVER $500,RESIDENCE,False,False,313,3.0,20.0,42.0,2017,
4,08/26/2017 10:00:00 AM,001XX W RANDOLPH ST,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,HOTEL/MOTEL,False,False,122,1.0,42.0,32.0,2017,
5,02/10/2013 12:00:00 AM,071XX S LAFAYETTE AVE,CRIM SEXUAL ASSAULT,PREDATORY,RESIDENCE,False,False,731,7.0,6.0,69.0,2013,


In [46]:
print("Number of Crimes in Reduced Dataset: {:,d}".format(crimes_reduced.shape[0]))

Number of Crimes in Reduced Dataset: 3,007,770


## Saving the Reduced Dataset

In [42]:
crimes_reduced.to_csv("Crimes_Reduced.csv")