<h3>Import Required Packages & Data</h3>

In [1]:
#import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)

In [2]:
#import file
file = 'Chicago_Crimes_2016.csv'
df_crime = pd.read_csv(file)

<h3>Review the Data</h3>

In [3]:
#review a summary of the dataframe
df_crime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265462 entries, 0 to 265461
Data columns (total 23 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Unnamed: 0            265462 non-null  int64  
 1   ID                    265462 non-null  int64  
 2   Case Number           265462 non-null  object 
 3   Date                  265462 non-null  object 
 4   Block                 265462 non-null  object 
 5   IUCR                  265462 non-null  object 
 6   Primary Type          265462 non-null  object 
 7   Description           265462 non-null  object 
 8   Location Description  264679 non-null  object 
 9   Arrest                265462 non-null  bool   
 10  Domestic              265462 non-null  bool   
 11  Beat                  265462 non-null  int64  
 12  District              265462 non-null  float64
 13  Ward                  265462 non-null  float64
 14  Community Area        265462 non-null  float64
 15  

<b>Observation #1:</b> There is an unnamed column. Depending on what values are in this column, we may be able to drop it from our dataset.</br></br>
<b>Observation #2:</b> Several columns have missing data: Location Description, X Coordinate, Y Coordinate, Latitude, Longitude, & Location. We will need to detemine whether or not we want to drop these rows/columns or replace the NULL values with the mean value.</br></br>
<b>Observation #3:</b> There are several columns that seem to revolve around the location of the crime (Location Description, District, Community Area, X Coordinate, Y Coordinate, Latitude, Longitude, & Location). We may be able to drop a few of these columns from our dataset.</br></br>
<b>Observation #4:</b> The <i>Date</i> and <i>Updated On</i> columns are currently labeled as "object." We should update the data types appropriately. Additionally, the <i>Year</i> column may be irrelevant, as the year is already within the <i>Date</i> field.

In [4]:
#review a sample of the data within the dataframe
df_crime.head(10)

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,3,10508693,HZ250496,05/03/2016 11:40:00 PM,013XX S SAWYER AVE,0486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,True,True,1022,10.0,24.0,29.0,08B,1154907.0,1893681.0,2016,05/10/2016 03:56:50 PM,41.864073,-87.706819,"(41.864073157, -87.706818608)"
1,89,10508695,HZ250409,05/03/2016 09:40:00 PM,061XX S DREXEL AVE,0486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,313,3.0,20.0,42.0,08B,1183066.0,1864330.0,2016,05/10/2016 03:56:50 PM,41.782922,-87.604363,"(41.782921527, -87.60436317)"
2,197,10508697,HZ250503,05/03/2016 11:31:00 PM,053XX W CHICAGO AVE,0470,PUBLIC PEACE VIOLATION,RECKLESS CONDUCT,STREET,False,False,1524,15.0,37.0,25.0,24,1140789.0,1904819.0,2016,05/10/2016 03:56:50 PM,41.894908,-87.758372,"(41.894908283, -87.758371958)"
3,673,10508698,HZ250424,05/03/2016 10:10:00 PM,049XX W FULTON ST,0460,BATTERY,SIMPLE,SIDEWALK,False,False,1532,15.0,28.0,25.0,08B,1143223.0,1901475.0,2016,05/10/2016 03:56:50 PM,41.885687,-87.749516,"(41.885686845, -87.749515983)"
4,911,10508699,HZ250455,05/03/2016 10:00:00 PM,003XX N LOTUS AVE,0820,THEFT,$500 AND UNDER,RESIDENCE,False,True,1523,15.0,28.0,25.0,06,1139890.0,1901675.0,2016,05/10/2016 03:56:50 PM,41.886297,-87.761751,"(41.886297242, -87.761750709)"
5,1108,10508702,HZ250447,05/03/2016 10:35:00 PM,082XX S MARYLAND AVE,041A,BATTERY,AGGRAVATED: HANDGUN,STREET,False,False,631,6.0,8.0,44.0,04B,1183336.0,1850642.0,2016,05/10/2016 03:56:50 PM,41.745354,-87.603799,"(41.745354023, -87.603798903)"
6,1130,10508703,HZ250489,05/03/2016 10:30:00 PM,027XX S STATE ST,0460,BATTERY,SIMPLE,CHA HALLWAY/STAIRWELL/ELEVATOR,False,False,133,1.0,3.0,35.0,08B,1176730.0,1886544.0,2016,05/10/2016 03:56:50 PM,41.844024,-87.626923,"(41.844023772, -87.626923253)"
7,1801,10508704,HZ250514,05/03/2016 09:30:00 PM,002XX E 46TH ST,0460,BATTERY,SIMPLE,RESIDENCE PORCH/HALLWAY,False,False,215,2.0,3.0,38.0,08B,1178514.0,1874573.0,2016,05/10/2016 03:56:50 PM,41.811134,-87.620741,"(41.811133958, -87.62074077)"
8,1868,10508709,HZ250523,05/03/2016 04:00:00 PM,014XX W DEVON AVE,0460,BATTERY,SIMPLE,SIDEWALK,False,False,2432,24.0,40.0,1.0,08B,1165696.0,1942616.0,2016,05/10/2016 03:56:50 PM,41.998131,-87.665814,"(41.99813061, -87.665814038)"
9,1891,10508982,HZ250667,05/03/2016 10:30:00 PM,069XX S ASHLAND AVE,0486,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,735,7.0,17.0,67.0,08B,1166876.0,1858796.0,2016,05/10/2016 03:56:50 PM,41.768097,-87.663879,"(41.768096835, -87.663878589)"


<b>Observation #5:</b> The <i>Date</i> and <i>Updated On</i> columns contain both date and time. We need to consider this when updating the data type.</br></br>
<b>Observation #6:</b> The <i>Location</i> column contains a concatination of <i>Latitude</i> and <i>Longitude</i>. We can consider dropping either just the <i>Location</i> column, or both the <i>Latitude</i> and <i>Longitude</i> columns.</br></br>
<b>Observation #7:</b> The <i>District, Ward, Community, X Coordinate,</i> and <i>Y Coordinate</i> columns are set as "float," however "integer" may be a more appropriate data type.</br></br>
<b>Observation #8:</b> The <i>Block</i> field contains street names. We may be able to create a new column for streets instead of block, which should help to standardize the data.</br></br>
<b>Observation #9:</b> According to Google, <i>IUCR</i> stands for "Illinois Uniform Crime Reporting." This column may not be relevant to our objective and may be able to be dropped.

In [5]:
#review a summary of the data within the dataframe
df_crime.describe()

Unnamed: 0.1,Unnamed: 0,ID,Beat,District,Ward,Community Area,X Coordinate,Y Coordinate,Year,Latitude,Longitude
count,265462.0,265462.0,265462.0,265462.0,265462.0,265462.0,251273.0,251273.0,265462.0,251273.0,251273.0
mean,5047712.0,10560100.0,1147.235107,11.243112,23.03828,36.917683,1164503.0,1886326.0,2016.0,41.843676,-87.671841
std,1651671.0,581766.3,690.457744,6.897241,13.943938,21.361191,16202.88,31074.34,0.0,0.085459,0.058986
min,3.0,22245.0,111.0,1.0,1.0,1.0,1094231.0,1813910.0,2016.0,41.644604,-87.928909
25%,4053823.0,10482520.0,613.0,6.0,10.0,23.0,1152767.0,1859326.0,2016.0,41.769418,-87.714549
50%,6090142.0,10590520.0,1031.0,10.0,24.0,32.0,1166176.0,1893403.0,2016.0,41.863197,-87.66571
75%,6157513.0,10700950.0,1712.0,17.0,34.0,55.0,1176352.0,1908878.0,2016.0,41.905721,-87.628111
max,6253474.0,10827870.0,2535.0,31.0,50.0,77.0,1205117.0,1951535.0,2016.0,42.022671,-87.524529


<b>Observation #10:</b> All rows have a value of 2016, so we can drop this column.

<h3>Wrangle the Data</h3>

In [6]:
#remove unnecessary columns
df_crime.drop(["Unnamed: 0","Case Number", "Year", "IUCR", "X Coordinate", "Y Coordinate", "Updated On", "Block", "FBI Code"], axis=1, inplace=True)
    
#update datatypes
df_crime[["District","Ward","Community Area"]] = df_crime[["District","Ward","Community Area"]].astype("int")

#remove rows with missing data
df_crime.dropna(subset=["Location","Location Description"], axis=0, inplace=True)
df_crime.reset_index(drop=True, inplace=True)

#review the dataframe info to ensure accuracy of wrangling
df_crime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250732 entries, 0 to 250731
Data columns (total 14 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   ID                    250732 non-null  int64  
 1   Date                  250732 non-null  object 
 2   Primary Type          250732 non-null  object 
 3   Description           250732 non-null  object 
 4   Location Description  250732 non-null  object 
 5   Arrest                250732 non-null  bool   
 6   Domestic              250732 non-null  bool   
 7   Beat                  250732 non-null  int64  
 8   District              250732 non-null  int64  
 9   Ward                  250732 non-null  int64  
 10  Community Area        250732 non-null  int64  
 11  Latitude              250732 non-null  float64
 12  Longitude             250732 non-null  float64
 13  Location              250732 non-null  object 
dtypes: bool(2), float64(2), int64(5), object(5)
memory u

In [7]:
#separate date and time from Date field (Cameron)

<h3>Binning Data</h3>

In [8]:
#bin by Season (Cameron)

In [9]:
#bin Time of Day (Cameron)

In [10]:
#bin Weekday/Weekend (Cameron)

In [11]:
#bin Location Description (Private vs. Public) (Ryan)

In [12]:
#bin Primary Type/Description by Violent/Non-Violent (Eric)

In [21]:
#bin Primary Type (new column)
def pt_bin(value):
    if value in ["CRIMINAL DAMAGE","ARSON"]:
        return "CRIMINAL DAMAGE/ARSON"
    if value in ["THEFT","BURGLARY","ROBBERY","MOTOR VEHICLE THEFT"]:
        return "THEFT/BURGLARY/ROBBERY"
    if value in ["BATTERY","ASSAULT"]:
        return "ASSAULT/BATTERY"
    if value in ["OTHER OFFENSE","LIQUOR LAW VIOLATION","KIDNAPPING","GAMBLING","NON-CRIMINAL","HUMAN TRAFFICKING","NON - CRIMINAL","NON-CRIMINAL (SUBJECT SPECIFIED)"]:
        return "OTHER/NON-CRIMINAL"
    if value in ["NARCOTICS","OTHER NARCOTIC VIOLATION"]:
        return "NARCOTICS"
    if value in ["WEAPONS VIOLATION","CONCEALED CARRY LICENSE VIOLATION"]:
        return "WEAPONS VIOLATION/CCL VIOLATION"
    if value in ["PUBLIC PEACE VIOLATION","STALKING","INTIMIDATION","OBSCENITY","PUBLIC INDECENCY"]:
        return "PUBLIC PEACE/INDECENCY VIOLATION"
    if value in ["CRIM SEXUAL ASSAULT","PROSTITUTION","SEX OFFENSE"]:
        return "SEX OFFENSE/ASSAULT"
    else:
        return value

df_crime["Primary Type (Binned)"] = df_crime["Primary Type"].map(pt_bin)

Unnamed: 0,ID,Date,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,Latitude,Longitude,Location,Primary Type (Binned)
0,10508693,05/03/2016 11:40:00 PM,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,True,True,1022,10,24,29,41.864073,-87.706819,"(41.864073157, -87.706818608)",ASSAULT/BATTERY
1,10508695,05/03/2016 09:40:00 PM,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,313,3,20,42,41.782922,-87.604363,"(41.782921527, -87.60436317)",ASSAULT/BATTERY
2,10508697,05/03/2016 11:31:00 PM,PUBLIC PEACE VIOLATION,RECKLESS CONDUCT,STREET,False,False,1524,15,37,25,41.894908,-87.758372,"(41.894908283, -87.758371958)",PUBLIC PEACE/INDECENCY VIOLATION
3,10508698,05/03/2016 10:10:00 PM,BATTERY,SIMPLE,SIDEWALK,False,False,1532,15,28,25,41.885687,-87.749516,"(41.885686845, -87.749515983)",ASSAULT/BATTERY
4,10508699,05/03/2016 10:00:00 PM,THEFT,$500 AND UNDER,RESIDENCE,False,True,1523,15,28,25,41.886297,-87.761751,"(41.886297242, -87.761750709)",THEFT/BURGLARY/ROBBERY
5,10508702,05/03/2016 10:35:00 PM,BATTERY,AGGRAVATED: HANDGUN,STREET,False,False,631,6,8,44,41.745354,-87.603799,"(41.745354023, -87.603798903)",ASSAULT/BATTERY
6,10508703,05/03/2016 10:30:00 PM,BATTERY,SIMPLE,CHA HALLWAY/STAIRWELL/ELEVATOR,False,False,133,1,3,35,41.844024,-87.626923,"(41.844023772, -87.626923253)",ASSAULT/BATTERY
7,10508704,05/03/2016 09:30:00 PM,BATTERY,SIMPLE,RESIDENCE PORCH/HALLWAY,False,False,215,2,3,38,41.811134,-87.620741,"(41.811133958, -87.62074077)",ASSAULT/BATTERY
8,10508709,05/03/2016 04:00:00 PM,BATTERY,SIMPLE,SIDEWALK,False,False,2432,24,40,1,41.998131,-87.665814,"(41.99813061, -87.665814038)",ASSAULT/BATTERY
9,10508982,05/03/2016 10:30:00 PM,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,735,7,17,67,41.768097,-87.663879,"(41.768096835, -87.663878589)",ASSAULT/BATTERY
