#### Source of data
[City of of Chicago](https://catalog.data.gov/dataset/crimes-2001-to-present-398a4)

[API Documentation](https://dev.socrata.com/foundry/data.cityofchicago.org/ta3m-92yk)

[IUCR Crime codes](https://data.cityofchicago.org/widgets/c7ck-438e)

In [1]:
import pandas as pd

In [2]:
# read in data
df = pd.read_csv("../data/Crimes_-_2001_to_Present.csv", header=0)

In [3]:
df.columns

Index(['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type',
       'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat',
       'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate',
       'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude',
       'Location'],
      dtype='object')

Columns are:
- ID - unique identifier
- Case Number - Chicago Police unique case number
- Date - timestamp of crime
- Block - partially redacted address
- IUCR - crime code
- Primary Type - description of crime code
- Description - secondary description of crime code
- Location Description - i.e. STREET - need to see what values there are
- Arrest - binary, whether or not an arrest was made
- Domestic - binary, whether or not this is domestic crime
- Beat - 3-5 beats make up a sector, and 3 sectors make up a district of which there are 22
- Ward
- Community Area
- FBI Code
- X Coordinate - The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection
- Y Coordinate
- Year - year crime occcurred
- Updated On - last time record was updated
- Latitude
- Longitude
- Location

In [4]:
df.shape

(7255968, 22)

In [5]:
df.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10224738,HY411648,09/05/2015 01:30:00 PM,043XX S WOOD ST,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,...,12.0,61.0,08B,1165074.0,1875917.0,2015,02/10/2018 03:50:01 PM,41.815117,-87.67,"(41.815117282, -87.669999562)"
1,10224739,HY411615,09/04/2015 11:30:00 AM,008XX N CENTRAL AVE,870,THEFT,POCKET-PICKING,CTA BUS,False,False,...,29.0,25.0,06,1138875.0,1904869.0,2015,02/10/2018 03:50:01 PM,41.89508,-87.7654,"(41.895080471, -87.765400451)"
2,11646166,JC213529,09/01/2018 12:01:00 AM,082XX S INGLESIDE AVE,810,THEFT,OVER $500,RESIDENCE,False,True,...,8.0,44.0,06,,,2018,04/06/2019 04:04:43 PM,,,
3,10224740,HY411595,09/05/2015 12:45:00 PM,035XX W BARRY AVE,2023,NARCOTICS,POSS: HEROIN(BRN/TAN),SIDEWALK,True,False,...,35.0,21.0,18,1152037.0,1920384.0,2015,02/10/2018 03:50:01 PM,41.937406,-87.71665,"(41.937405765, -87.716649687)"
4,10224741,HY411610,09/05/2015 01:00:00 PM,0000X N LARAMIE AVE,560,ASSAULT,SIMPLE,APARTMENT,False,True,...,28.0,25.0,08A,1141706.0,1900086.0,2015,02/10/2018 03:50:01 PM,41.881903,-87.755121,"(41.881903443, -87.755121152)"


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7255968 entries, 0 to 7255967
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ID                    int64  
 1   Case Number           object 
 2   Date                  object 
 3   Block                 object 
 4   IUCR                  object 
 5   Primary Type          object 
 6   Description           object 
 7   Location Description  object 
 8   Arrest                bool   
 9   Domestic              bool   
 10  Beat                  int64  
 11  District              float64
 12  Ward                  float64
 13  Community Area        float64
 14  FBI Code              object 
 15  X Coordinate          float64
 16  Y Coordinate          float64
 17  Year                  int64  
 18  Updated On            object 
 19  Latitude              float64
 20  Longitude             float64
 21  Location              object 
dtypes: bool(2), float64(7), int64(3), object(1

In [7]:
# Convert date to datetime object
df['Date'] = pd.to_datetime(df['Date'])

In [8]:
# Let's write out the columns we are really interested in to a csv file so we have less data to deal with
df[['ID','Date','IUCR','Primary Type','Description','Arrest','Domestic','Latitude','Longitude']].to_csv('../data/Smaller Dataset with Datetime.csv')

In [9]:
# Now let's clear memory from our first dataframe
del df

In [10]:
smaller_df = pd.read_csv('../data/Smaller Dataset with Datetime.csv')

In [11]:
smaller_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7255968 entries, 0 to 7255967
Data columns (total 10 columns):
 #   Column        Dtype  
---  ------        -----  
 0   Unnamed: 0    int64  
 1   ID            int64  
 2   Date          object 
 3   IUCR          object 
 4   Primary Type  object 
 5   Description   object 
 6   Arrest        bool   
 7   Domestic      bool   
 8   Latitude      float64
 9   Longitude     float64
dtypes: bool(2), float64(2), int64(2), object(4)
memory usage: 456.7+ MB


In [12]:
# Convert date to datetime object
smaller_df['Date'] = pd.to_datetime(smaller_df['Date'])

In [13]:
smaller_df.set_index('Date', inplace=True)

In [14]:
smaller_df.drop(columns=['Unnamed: 0'],inplace=True)

In [15]:
smaller_df.head()

Unnamed: 0_level_0,ID,IUCR,Primary Type,Description,Arrest,Domestic,Latitude,Longitude
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-09-05 13:30:00,10224738,486,BATTERY,DOMESTIC BATTERY SIMPLE,False,True,41.815117,-87.67
2015-09-04 11:30:00,10224739,870,THEFT,POCKET-PICKING,False,False,41.89508,-87.7654
2018-09-01 00:01:00,11646166,810,THEFT,OVER $500,False,True,,
2015-09-05 12:45:00,10224740,2023,NARCOTICS,POSS: HEROIN(BRN/TAN),True,False,41.937406,-87.71665
2015-09-05 13:00:00,10224741,560,ASSAULT,SIMPLE,False,True,41.881903,-87.755121


In [16]:
smaller_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7255968 entries, 2015-09-05 13:30:00 to 2020-12-27 12:13:00
Data columns (total 8 columns):
 #   Column        Dtype  
---  ------        -----  
 0   ID            int64  
 1   IUCR          object 
 2   Primary Type  object 
 3   Description   object 
 4   Arrest        bool   
 5   Domestic      bool   
 6   Latitude      float64
 7   Longitude     float64
dtypes: bool(2), float64(2), int64(1), object(3)
memory usage: 401.4+ MB


In [17]:
smaller_df['IUCR'].unique().shape

(402,)

In [18]:
smaller_df['Domestic'].sum()

974228

In [19]:
smaller_df['Arrest'].sum()

1970069

In [20]:
smaller_df[smaller_df['Domestic']==True]['IUCR'].unique().shape

(295,)

In [21]:
# Let's write this data out to a CSV file
smaller_df.to_csv('../data/Smaller Dataset - Data Wrangling.csv')

In [22]:
smaller_df['IUCR'].unique()

array(['0486', '0870', '0810', '2023', '0560', '0610', '0620', '0860',
       '0320', '1153', '0820', '0460', '2820', '0497', '1320', '1310',
       '031A', '0890', '0430', '2825', '5002', '4625', '1811', '141C',
       '5111', '1460', '1122', '2014', '143B', '141A', '1330', '0910',
       '2826', '1130', '2027', '0420', '0850', '2092', '2024', '1570',
       '2017', '051A', '3710', '0880', '1751', '1753', '0313', '1822',
       '2890', '2851', '0340', '502T', '1154', '1120', '2022', '143A',
       '1156', '1150', '4387', '1305', '1110', '1365', '2028', '0470',
       '1350', '0454', '041A', '0930', '1360', '0337', '1506', '0520',
       '5011', '1661', '3730', '2093', '1710', '502R', '0461', '0453',
       '0331', '1563', '0281', '3731', '0291', '0530', '1210', '1755',
       '2230', '0550', '0266', '1340', '0498', '1754', '1821', '2025',
       '1025', '4650', '3800', '5001', '1151', '1752', '4310', '1375',
       '500N', '1345', '0630', '1725', '1121', '0580', '031B', '1790',
      

In [23]:
smaller_df['Primary Type'].unique()

array(['BATTERY', 'THEFT', 'NARCOTICS', 'ASSAULT', 'BURGLARY', 'ROBBERY',
       'DECEPTIVE PRACTICE', 'OTHER OFFENSE', 'CRIMINAL DAMAGE',
       'WEAPONS VIOLATION', 'CRIMINAL TRESPASS', 'MOTOR VEHICLE THEFT',
       'SEX OFFENSE', 'INTERFERENCE WITH PUBLIC OFFICER',
       'OFFENSE INVOLVING CHILDREN', 'PUBLIC PEACE VIOLATION',
       'PROSTITUTION', 'GAMBLING', 'CRIM SEXUAL ASSAULT',
       'LIQUOR LAW VIOLATION', 'CRIMINAL SEXUAL ASSAULT', 'ARSON',
       'STALKING', 'KIDNAPPING', 'INTIMIDATION', 'HOMICIDE',
       'CONCEALED CARRY LICENSE VIOLATION', 'NON - CRIMINAL',
       'HUMAN TRAFFICKING', 'OBSCENITY', 'PUBLIC INDECENCY',
       'OTHER NARCOTIC VIOLATION', 'NON-CRIMINAL',
       'NON-CRIMINAL (SUBJECT SPECIFIED)', 'RITUALISM',
       'DOMESTIC VIOLENCE'], dtype=object)

In [24]:
smaller_df[smaller_df['Domestic']==True]['Primary Type'].unique()

array(['BATTERY', 'THEFT', 'ASSAULT', 'ROBBERY', 'OTHER OFFENSE',
       'CRIMINAL DAMAGE', 'PUBLIC PEACE VIOLATION', 'NARCOTICS',
       'CRIMINAL TRESPASS', 'MOTOR VEHICLE THEFT', 'CRIM SEXUAL ASSAULT',
       'CRIMINAL SEXUAL ASSAULT', 'OFFENSE INVOLVING CHILDREN',
       'BURGLARY', 'ARSON', 'DECEPTIVE PRACTICE', 'SEX OFFENSE',
       'STALKING', 'KIDNAPPING', 'OBSCENITY', 'HOMICIDE',
       'WEAPONS VIOLATION', 'INTIMIDATION',
       'INTERFERENCE WITH PUBLIC OFFICER', 'HUMAN TRAFFICKING',
       'PROSTITUTION', 'NON-CRIMINAL (SUBJECT SPECIFIED)',
       'LIQUOR LAW VIOLATION', 'GAMBLING', 'RITUALISM',
       'DOMESTIC VIOLENCE', 'PUBLIC INDECENCY', 'NON-CRIMINAL'],
      dtype=object)