## City of Chicago Crime Data - Data Wrangling

#### Source of data
[City of of Chicago](https://catalog.data.gov/dataset/crimes-2001-to-present-398a4)

[API Documentation](https://dev.socrata.com/foundry/data.cityofchicago.org/ta3m-92yk)

[IUCR Crime codes](https://data.cityofchicago.org/widgets/c7ck-438e)

### Import Packages

In [1]:
import pandas as pd

### Read in Data

In [2]:
# read in data
crimes = pd.read_csv("../data/Crimes_-_2001_to_Present.csv", header=0, parse_dates=[2])

### Reduce Size of DF

In [3]:
crimes.columns

Index(['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type',
       'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat',
       'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate',
       'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude',
       'Location'],
      dtype='object')

Columns are:
- ID - unique identifier
- Case Number - Chicago Police unique case number
- Date - timestamp of crime
- Block - partially redacted address
- IUCR - crime code
- Primary Type - description of crime code
- Description - secondary description of crime code
- Location Description - i.e. STREET - need to see what values there are
- Arrest - binary, whether or not an arrest was made
- Domestic - binary, whether or not this is domestic crime
- Beat - 3-5 beats make up a sector, and 3 sectors make up a district of which there are 22
- Ward
- Community Area
- FBI Code
- X Coordinate - The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection
- Y Coordinate
- Year - year crime occcurred
- Updated On - last time record was updated
- Latitude
- Longitude
- Location

In [4]:
crimes.shape

(7266864, 22)

In [5]:
crimes.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10224738,HY411648,2015-09-05 13:30:00,043XX S WOOD ST,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,...,12.0,61.0,08B,1165074.0,1875917.0,2015,02/10/2018 03:50:01 PM,41.815117,-87.67,"(41.815117282, -87.669999562)"
1,10224739,HY411615,2015-09-04 11:30:00,008XX N CENTRAL AVE,870,THEFT,POCKET-PICKING,CTA BUS,False,False,...,29.0,25.0,06,1138875.0,1904869.0,2015,02/10/2018 03:50:01 PM,41.89508,-87.7654,"(41.895080471, -87.765400451)"
2,11646166,JC213529,2018-09-01 00:01:00,082XX S INGLESIDE AVE,810,THEFT,OVER $500,RESIDENCE,False,True,...,8.0,44.0,06,,,2018,04/06/2019 04:04:43 PM,,,
3,10224740,HY411595,2015-09-05 12:45:00,035XX W BARRY AVE,2023,NARCOTICS,POSS: HEROIN(BRN/TAN),SIDEWALK,True,False,...,35.0,21.0,18,1152037.0,1920384.0,2015,02/10/2018 03:50:01 PM,41.937406,-87.71665,"(41.937405765, -87.716649687)"
4,10224741,HY411610,2015-09-05 13:00:00,0000X N LARAMIE AVE,560,ASSAULT,SIMPLE,APARTMENT,False,True,...,28.0,25.0,08A,1141706.0,1900086.0,2015,02/10/2018 03:50:01 PM,41.881903,-87.755121,"(41.881903443, -87.755121152)"


In [6]:
crimes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7266864 entries, 0 to 7266863
Data columns (total 22 columns):
 #   Column                Dtype         
---  ------                -----         
 0   ID                    int64         
 1   Case Number           object        
 2   Date                  datetime64[ns]
 3   Block                 object        
 4   IUCR                  object        
 5   Primary Type          object        
 6   Description           object        
 7   Location Description  object        
 8   Arrest                bool          
 9   Domestic              bool          
 10  Beat                  int64         
 11  District              float64       
 12  Ward                  float64       
 13  Community Area        float64       
 14  FBI Code              object        
 15  X Coordinate          float64       
 16  Y Coordinate          float64       
 17  Year                  int64         
 18  Updated On            object        
 19  

In [7]:
crimes = crimes[['ID','Date','Primary Type','Community Area']]

In [8]:
crimes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7266864 entries, 0 to 7266863
Data columns (total 4 columns):
 #   Column          Dtype         
---  ------          -----         
 0   ID              int64         
 1   Date            datetime64[ns]
 2   Primary Type    object        
 3   Community Area  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 221.8+ MB


In [9]:
crimes.head()

Unnamed: 0,ID,Date,Primary Type,Community Area
0,10224738,2015-09-05 13:30:00,BATTERY,61.0
1,10224739,2015-09-04 11:30:00,THEFT,25.0
2,11646166,2018-09-01 00:01:00,THEFT,44.0
3,10224740,2015-09-05 12:45:00,NARCOTICS,21.0
4,10224741,2015-09-05 13:00:00,ASSAULT,25.0


### Categorize Crime Type

We're going to categorize the crime type so that we have fewer categories.  

In [10]:
crimes['Primary Type'].unique()

array(['BATTERY', 'THEFT', 'NARCOTICS', 'ASSAULT', 'BURGLARY', 'ROBBERY',
       'DECEPTIVE PRACTICE', 'OTHER OFFENSE', 'CRIMINAL DAMAGE',
       'WEAPONS VIOLATION', 'CRIMINAL TRESPASS', 'MOTOR VEHICLE THEFT',
       'SEX OFFENSE', 'INTERFERENCE WITH PUBLIC OFFICER',
       'OFFENSE INVOLVING CHILDREN', 'PUBLIC PEACE VIOLATION',
       'PROSTITUTION', 'GAMBLING', 'CRIM SEXUAL ASSAULT',
       'LIQUOR LAW VIOLATION', 'CRIMINAL SEXUAL ASSAULT', 'ARSON',
       'STALKING', 'KIDNAPPING', 'INTIMIDATION', 'HOMICIDE',
       'CONCEALED CARRY LICENSE VIOLATION', 'NON - CRIMINAL',
       'HUMAN TRAFFICKING', 'OBSCENITY', 'PUBLIC INDECENCY',
       'OTHER NARCOTIC VIOLATION', 'NON-CRIMINAL',
       'NON-CRIMINAL (SUBJECT SPECIFIED)', 'RITUALISM',
       'DOMESTIC VIOLENCE'], dtype=object)

In [11]:
crimes['Primary Type'].isnull().sum()

0

We have no null primary types, so that is good.

In [12]:
# Assign a type for a crime
def assign_type(crime):
    """categorizes a crime"""
    if crime in ['ARSON','ASSAULT','BATTERY','CRIM SEXUAL ASSAULT','CRIMINAL SEXUAL ASSAULT','DOMESTIC VIOLENCE',\
                'HOMICIDE','HUMAN TRAFFICKING','INTIMIDATION','KIDNAPPING','OFFENSE INVOLVING CHILDREN','SEX OFFENSE',\
                 'STALKING']:
        return 'Violent'
    elif crime in ['THEFT','BURGLARY','CRIMINAL DAMAGE','CRIMINAL TRESPASS','MOTOR VEHICLE THEFT','ROBBERY']:
        return 'Property'
    elif crime in ['CRIMINAL TRESPASS','CONCEALED CARRY LICENSE VIOLATION','GAMBLING','INTERFERENCE WITH PUBLIC OFFICER',\
                   'LIQUOR LAW VIOLATION','OBSCENITY','PROSTITUTION','PUBLIC INDECENCY','PUBLIC PEACE VIOLATION',\
                   'RITUALISM','WEAPONS VIOLATION']:
        return 'Public Order / Vice'
    elif crime in ['DECEPTIVE PRACTICE']:
        return 'White Collar'
    elif crime in ['NARCOTICS','OTHER NARCOTIC VIOLATION']:
        return 'Drugs'
    else:
        return 'Other'

In [13]:
crimes['Type'] = crimes.apply(lambda x: assign_type(x['Primary Type']), axis=1)

In [14]:
crimes['Type'].value_counts()

Property               3581213
Violent                1938349
Drugs                   736227
Other                   450947
White Collar            305225
Public Order / Vice     254903
Name: Type, dtype: int64

### Add year, month, day of week, hour

In [15]:
crimes['Year']=crimes['Date'].dt.year
crimes['Month']=crimes['Date'].dt.month
crimes['Day of Week']=crimes['Date'].dt.dayofweek
crimes['Hour']=crimes['Date'].dt.hour

In [16]:
crimes.head()

Unnamed: 0,ID,Date,Primary Type,Community Area,Type,Year,Month,Day of Week,Hour
0,10224738,2015-09-05 13:30:00,BATTERY,61.0,Violent,2015,9,5,13
1,10224739,2015-09-04 11:30:00,THEFT,25.0,Property,2015,9,4,11
2,11646166,2018-09-01 00:01:00,THEFT,44.0,Property,2018,9,5,0
3,10224740,2015-09-05 12:45:00,NARCOTICS,21.0,Drugs,2015,9,5,12
4,10224741,2015-09-05 13:00:00,ASSAULT,25.0,Violent,2015,9,5,13


In [17]:
crimes.replace({'Day of Week':{0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'}}, inplace=True)

### Missing Invalid Values for Community Area

In [18]:
crimes['Community Area'].isnull().sum() / crimes.shape[0]

0.08442224871691557

So 8% of our data has no value for community area.  Let's see how this is distributed throughout the data.

In [19]:
for year in range(2001, 2021):
    per_null = crimes[crimes['Year']==year]['Community Area'].isnull().sum() / \
                crimes[crimes['Year']==year].shape[0]
    print("Year: {} Percent Null: {}".format(year, per_null))

Year: 2001 Percent Null: 0.9872805121554581
Year: 2002 Percent Null: 0.27249083233182336
Year: 2003 Percent Null: 0.00010084859893982911
Year: 2004 Percent Null: 0.0001406053272375953
Year: 2005 Percent Null: 0.00011901010043130142
Year: 2006 Percent Null: 0.00010934058700275137
Year: 2007 Percent Null: 0.0003615138736669176
Year: 2008 Percent Null: 0.0005689374215662402
Year: 2009 Percent Null: 0.0005575626112261031
Year: 2010 Percent Null: 0.0005021462703490726
Year: 2011 Percent Null: 0.0005143740550863353
Year: 2012 Percent Null: 7.734527227023248e-05
Year: 2013 Percent Null: 1.301579140892688e-05
Year: 2014 Percent Null: 0.0
Year: 2015 Percent Null: 0.0
Year: 2016 Percent Null: 0.0
Year: 2017 Percent Null: 0.0
Year: 2018 Percent Null: 0.0
Year: 2019 Percent Null: 0.0
Year: 2020 Percent Null: 4.775845682874295e-06


So we can see clearly that this is a reporting issue, and that community area was not being reported early on in the year 2001 and 2002.  We are interested in looking at crimes for different community areas, so we're going to just drop the early data since we have so much of it.  Let's look at when this was fixed during 2002.

In [20]:
for month in range(1,12):
    per_null = crimes[(crimes['Year']==2002) & (crimes['Month']==month)]['Community Area'].isnull().sum() / \
                crimes[(crimes['Year']==2002) & (crimes['Month']==month)].shape[0]
    print("Month: {} Percent Null: {}".format(month, per_null))

Month: 1 Percent Null: 0.9616466177159819
Month: 2 Percent Null: 0.9681205579639623
Month: 3 Percent Null: 0.9537384994168718
Month: 4 Percent Null: 0.6489909081826356
Month: 5 Percent Null: 0.0006757858923869224
Month: 6 Percent Null: 0.0009104704097116844
Month: 7 Percent Null: 0.00010866730418151786
Month: 8 Percent Null: 2.261880527470539e-05
Month: 9 Percent Null: 0.00016513718181603718
Month: 10 Percent Null: 9.271277582050807e-05
Month: 11 Percent Null: 0.00016149870801033592


So we can see that community area started being reported in May of 2020, so we'll drop all the data prior to that month.

In [21]:
crimes.dtypes

ID                         int64
Date              datetime64[ns]
Primary Type              object
Community Area           float64
Type                      object
Year                       int64
Month                      int64
Day of Week               object
Hour                       int64
dtype: object

In [22]:
crimes = crimes[crimes['Date']>='2002-05']

In [23]:
crimes.shape

(6630138, 9)

Let's check again for null values.

In [24]:
crimes['Community Area'].isnull().sum() / crimes.shape[0]

0.00020059914288360212

There are so few we'll just drop them.  Then we can force community area to be type int.  

In [25]:
crimes.dropna(inplace=True)

In [26]:
crimes['Community Area'] = crimes['Community Area'].astype('int')

In [27]:
crimes.columns

Index(['ID', 'Date', 'Primary Type', 'Community Area', 'Type', 'Year', 'Month',
       'Day of Week', 'Hour'],
      dtype='object')

Community Area should be a number from 1 to 77.  Let's see if it is.  

In [28]:
crimes['Community Area'].value_counts().sort_index()

0         74
1     101659
2      83524
3      96436
4      46814
       ...  
73     78765
74     14854
75     52962
76     37116
77     65344
Name: Community Area, Length: 78, dtype: int64

We have just a few records with a community area of 0 so we'll get rid of those

In [29]:
crimes = crimes[~(crimes['Community Area']==0)]

In [30]:
crimes.shape

(6628734, 9)

#### Add names for community areas

In [31]:
community_areas = pd.read_csv('../data/Community Area Names.csv',header=None)

In [32]:
community_areas.head()

Unnamed: 0,0,1
0,1,Rogers Park
1,2,West Ridge
2,3,Uptown
3,4,Lincoln Square
4,5,North Center


In [33]:
community_names = {}
for area in range(1,78):
    community_names[area] = community_areas.iloc[area-1,1]

In [34]:
community_names

{1: 'Rogers Park',
 2: 'West Ridge',
 3: 'Uptown',
 4: 'Lincoln Square',
 5: 'North Center',
 6: 'Lake View',
 7: 'Lincoln Park',
 8: 'Near North Side',
 9: 'Edison Park',
 10: 'Norwood Park',
 11: 'Jefferson Park',
 12: 'Forest Glen',
 13: 'North Park',
 14: 'Albany Park',
 15: 'Portage Park',
 16: 'Irving Park',
 17: 'Dunning',
 18: 'Montclare',
 19: 'Belmont Cragin',
 20: 'Hermosa',
 21: 'Avondale',
 22: 'Logan Square',
 23: 'Humboldt Park',
 24: 'West Town',
 25: 'Austin',
 26: 'West Garfield Park',
 27: 'East Garfield Park',
 28: 'Near West Side',
 29: 'North Lawndale',
 30: 'South Lawndale',
 31: 'Lower West Side',
 32: 'The Loop',
 33: 'Near South Side',
 34: 'Armour Square',
 35: 'Douglas',
 36: 'Oakland',
 37: 'Fuller Park',
 38: 'Grand Boulevard',
 39: 'Kenwood',
 40: 'Washington Park',
 41: 'Hyde Park',
 42: 'Woodlawn',
 43: 'South Shore',
 44: 'Chatham',
 45: 'Avalon Park',
 46: 'South Chicago',
 47: 'Burnside',
 48: 'Calumet Heights',
 49: 'Roseland',
 50: 'Pullman',
 51: 

In [35]:
crimes.replace({'Community Area': community_names}, inplace=True)

In [36]:
crimes.head()

Unnamed: 0,ID,Date,Primary Type,Community Area,Type,Year,Month,Day of Week,Hour
0,10224738,2015-09-05 13:30:00,BATTERY,New City,Violent,2015,9,Saturday,13
1,10224739,2015-09-04 11:30:00,THEFT,Austin,Property,2015,9,Friday,11
2,11646166,2018-09-01 00:01:00,THEFT,Chatham,Property,2018,9,Saturday,0
3,10224740,2015-09-05 12:45:00,NARCOTICS,Avondale,Drugs,2015,9,Saturday,12
4,10224741,2015-09-05 13:00:00,ASSAULT,Austin,Violent,2015,9,Saturday,13


In [37]:
crimes['Community Area'].unique()

array(['New City', 'Austin', 'Chatham', 'Avondale', 'Auburn Gresham',
       'West Town', 'Lower West Side', 'East Garfield Park', 'Gage Park',
       'West Lawn', 'Jefferson Park', 'Roseland', 'Kenwood',
       'South Deering', 'Portage Park', 'East Side', 'Forest Glen',
       'South Shore', 'South Chicago', 'The Loop', 'Englewood',
       'Albany Park', 'North Lawndale', 'West Ridge',
       'Greater Grand Crossing', 'Humboldt Park', 'Douglas',
       'Near North Side', 'Uptown', 'Lake View', 'Garfield Ridge',
       'Near West Side', 'West Pullman', 'Pullman', 'Near South Side',
       'North Park', 'Mount Greenwood', 'Belmont Cragin', 'Beverly Hills',
       'Logan Square', 'Clearing', 'Lincoln Park', 'North Center',
       'West Garfield Park', 'Chicago Lawn', 'McKinley Park', 'Edgewater',
       'Rogers Park', 'Norwood Park', 'West Englewood', 'Grand Boulevard',
       'Hyde Park', 'Bridgeport', 'South Lawndale', 'Riverdale',
       'Brighton Park', 'Calumet Heights', 'Washingto

In [38]:
# Let's write this out now since we have crime type categorized
crimes[['ID', 'Date', 'Community Area', 'Type', 'Year', 'Month',
       'Day of Week', 'Hour']].to_csv('../data/All Crimes Categorized.csv', index=False)