# Lab 3 - Data Storytelling

For this Lab you guys will be going through the thought-process of how Data Scientists in the real world extract, clean, explore and eventually visualize data to make inferences about real-world problems. Since this is a guided-lab it will be graded just as strictly since everything is provided to you.

**Note:** Please make sure you use the environment you guys created for Homework 2 - the one which contains Seaborn 0.9.0 - as it would be extremely difficult for you to make the same plots as in the blog post otherwise. Open the Homework 2 notebook if you are unsure on how to switch to that environment.

Once again - name the notebooks as Lab3_rollno.ipynb

Here are the links you have to follow for this lab (you HAVE to follow both links)

[Data Storytelling Part One](https://towardsdatascience.com/homicide-in-chicago-data-storytelling-part-one-e6fbd77afc07)

[Data Storytelling Part Two](https://towardsdatascience.com/homicide-in-chicago-data-stroytelling-part-two-e8748602daca)


## Part 1: Data Cleaning
All the plots in the blog-post should be printed in this part. Here all of you have the freedom to break up the cells as you wish - so that you also get practice in that regard.

In [1]:
# import modules
import numpy as np
import pandas as pd
from pandas import *
import os
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import datetime
from scipy import stats
sns.set_style("darkgrid")
import matplotlib.image as mpimg
from IPython.display import IFrame
import folium
from folium import plugins
from folium.plugins import MarkerCluster, FastMarkerCluster, HeatMapWithTime

In [2]:
# use TextFileReader iterable with chunks of 100,000 rows.
tp = read_csv('Crimes_-_2001_to_present.csv', iterator=True, chunksize=100000)  
crime_data = concat(tp, ignore_index=True)  

# print data's shape
crime_data.shape

(6749001, 22)

In [3]:
crime_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6749001 entries, 0 to 6749000
Data columns (total 22 columns):
ID                      int64
Case Number             object
Date                    object
Block                   object
IUCR                    object
Primary Type            object
Description             object
Location Description    object
Arrest                  bool
Domestic                bool
Beat                    int64
District                float64
Ward                    float64
Community Area          float64
FBI Code                object
X Coordinate            float64
Y Coordinate            float64
Year                    int64
Updated On              object
Latitude                float64
Longitude               float64
Location                object
dtypes: bool(2), float64(7), int64(3), object(10)
memory usage: 1.0+ GB


In [4]:
crime_data.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10000092,HY189866,03/18/2015 07:44:00 PM,047XX W OHIO ST,041A,BATTERY,AGGRAVATED: HANDGUN,STREET,False,False,...,28.0,25.0,04B,1144606.0,1903566.0,2015,02/10/2018 03:50:01 PM,41.891399,-87.744385,"(41.891398861, -87.744384567)"
1,10000094,HY190059,03/18/2015 11:00:00 PM,066XX S MARSHFIELD AVE,4625,OTHER OFFENSE,PAROLE VIOLATION,STREET,True,False,...,15.0,67.0,26,1166468.0,1860715.0,2015,02/10/2018 03:50:01 PM,41.773372,-87.665319,"(41.773371528, -87.665319468)"
2,10000095,HY190052,03/18/2015 10:45:00 PM,044XX S LAKE PARK AVE,0486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,...,4.0,39.0,08B,1185075.0,1875622.0,2015,02/10/2018 03:50:01 PM,41.813861,-87.596643,"(41.81386068, -87.596642837)"
3,10000096,HY190054,03/18/2015 10:30:00 PM,051XX S MICHIGAN AVE,0460,BATTERY,SIMPLE,APARTMENT,False,False,...,3.0,40.0,08B,1178033.0,1870804.0,2015,02/10/2018 03:50:01 PM,41.800802,-87.622619,"(41.800802415, -87.622619343)"
4,10000097,HY189976,03/18/2015 09:00:00 PM,047XX W ADAMS ST,031A,ROBBERY,ARMED: HANDGUN,SIDEWALK,False,False,...,28.0,25.0,03,1144920.0,1898709.0,2015,02/10/2018 03:50:01 PM,41.878065,-87.743354,"(41.878064761, -87.743354013)"


In [5]:
# preview all crime variables in the "Primary Type" column

crimes = crime_data['Primary Type'].sort_values().unique()
crimes, len(crimes)

(array(['ARSON', 'ASSAULT', 'BATTERY', 'BURGLARY',
        'CONCEALED CARRY LICENSE VIOLATION', 'CRIM SEXUAL ASSAULT',
        'CRIMINAL DAMAGE', 'CRIMINAL TRESPASS', 'DECEPTIVE PRACTICE',
        'DOMESTIC VIOLENCE', 'GAMBLING', 'HOMICIDE', 'HUMAN TRAFFICKING',
        'INTERFERENCE WITH PUBLIC OFFICER', 'INTIMIDATION', 'KIDNAPPING',
        'LIQUOR LAW VIOLATION', 'MOTOR VEHICLE THEFT', 'NARCOTICS',
        'NON - CRIMINAL', 'NON-CRIMINAL',
        'NON-CRIMINAL (SUBJECT SPECIFIED)', 'OBSCENITY',
        'OFFENSE INVOLVING CHILDREN', 'OTHER NARCOTIC VIOLATION',
        'OTHER OFFENSE', 'PROSTITUTION', 'PUBLIC INDECENCY',
        'PUBLIC PEACE VIOLATION', 'RITUALISM', 'ROBBERY', 'SEX OFFENSE',
        'STALKING', 'THEFT', 'WEAPONS VIOLATION'], dtype=object), 35)

In [None]:
# Created a scatter plot of X and Y coordinates vs all crime data available in the dataset

crime_data = crime_data.loc[(crime_data['X Coordinate']!=0)]

sns.lmplot('X Coordinate', 
           'Y Coordinate',
           data=crime_data[:],
           fit_reg=False, 
           hue="District",
           palette='Dark2',
           height=12,
           ci=2,
           scatter_kws={"marker": "D", 
                        "s": 10})
ax = plt.gca()
ax.set_title("All Crime Distribution per District")

In [None]:
# create and preview dataframe containing crimes associated with gang violence

col2 = ['Date','Primary Type','Arrest','Domestic','District','X Coordinate','Y Coordinate']
multiple_crimes = crime_data[col2]
multiple_crimes = multiple_crimes[multiple_crimes['Primary Type']\
                  .isin(['HOMICIDE','CONCEALED CARRY LICENSE VIOLATION','NARCOTICS','WEAPONS VIOLATION'])]

# clean some rouge (0,0) coordinates
multiple_crimes = multiple_crimes[multiple_crimes['X Coordinate']!=0]

multiple_crimes.head()

In [None]:
# geographical distribution scatter plots by crime
g = sns.lmplot(x="X Coordinate",
               y="Y Coordinate",
               col="Primary Type",
               data=multiple_crimes.dropna(), 
               col_wrap=2, height=6, fit_reg=False, 
               sharey=False,
               scatter_kws={"marker": "D",
                            "s": 10})

In [None]:
# create a dataframe with Homicide as the only crime

df_homicideN = crime_data[crime_data['Primary Type']=='HOMICIDE']
df_homicideN.head()

In [None]:
# print some attributes of our new homicide dataframe

df_homicideN.info()

In [None]:
# find null values in our dataframe

df_homicideN.isnull().sum()

In [None]:
# drop null values

df_homicide = df_homicideN.dropna()

In [None]:
# create a list of columns to keep and update the dataframe with new columns

keep_cols = ['Year','Date','Primary Type','Arrest','Domestic','District','Location Description',
             'FBI Code','X Coordinate','Y Coordinate','Latitude','Longitude','Location']

df_homicide = df_homicide[keep_cols].reset_index()
df_homicide.head()

In [None]:
# change string Date to datetime.datetime format

df_homicide['Date'] = df_homicide['Date'].apply(lambda x: datetime.datetime.strptime(x,"%m/%d/%Y %I:%M:%S %p"))
df_homicide.head()

In [None]:
# create new columns from date column -- Year, Month, Day, Hour, Minutes, DayOfWeek 

df_homicide['Year'] = df_homicide['Date'].dt.year
df_homicide['Month'] = df_homicide['Date'].dt.month
df_homicide['Day'] = df_homicide['Date'].dt.day
df_homicide['Weekday'] = df_homicide['Date'].dt.dayofweek
df_homicide['HourOfDay'] = df_homicide['Date'].dt.hour

df_homicide = df_homicide.sort_values('Date')
# print columns list and info

df_homicide.info()

In [None]:
# save cleaned data to pickle file 
df_homicide.to_pickle('df_homicide.pkl') 
print('pickle size:', os.stat('df_homicide.pkl').st_size)

In [1]:
# load pickled data 
df_homicide = pd.read_pickle('df_homicide.pkl')

NameError: name 'pd' is not defined

## Part 2: Data Exploration and Visualization
Make sure ALL plots are printed just like in the blog-post. Although you can of course add details of your own if you wish and carry out some exploration on your own.

In [None]:
# plot all homicides in dataset by location per District

df_homicide = df_homicide.loc[(df_homicide['X Coordinate']!=0)]

sns.lmplot('X Coordinate',
           'Y Coordinate',
           data=df_homicide[:],
           fit_reg=False, 
           hue="District", 
           palette='Dark2',
           height=12,
           ci=2,
           scatter_kws={"marker": "D", 
                        "s": 10}) 
ax = plt.gca()
ax.set_title("All Homicides (2001-2018) per District")

In [None]:
# plot bar chart of homicide rates for all years

plt.figure(figsize=(12,6))
sns.barplot(x='Year',
            y='HOMICIDE',
            data=df_homicide.groupby(['Year'])['Primary Type'].value_counts().\
                 unstack().reset_index(),
            color='steelblue').\
            set_title("CHICAGO MURDER RATES: 2001 - 2018")

In [None]:
# plot homicides sorted by month

fig, ax = plt.subplots(figsize=(14,6))
month_nms = ['January','February','March','April','May','June','July','August'\
             ,'September','October','November','December']    
fig = sns.barplot(x='Month',
                  y='HOMICIDE',
                  data=df_homicide.groupby(['Year','Month'])['Primary Type'].\
                  value_counts().unstack().reset_index(),
                  color='#808080')
ax.set_xticklabels(month_nms)
plt.title("CHICAGO MURDER RATES by MONTH -- All Years")

# -------------------------------------------

# plot average monthly temps in Chicago
# source of data:  ncdc.noaa.gov

mntemp = [26.5,31,39.5,49.5,59.5,70,76,75.5,68,56,44,32]
df_temps = pd.DataFrame(list(zip(month_nms,mntemp)),
                       columns=['Month','AVERAGE TEMPERATURE'])
fig, ax = plt.subplots(figsize=(14,6))
sns.barplot(x='Month', y='AVERAGE TEMPERATURE', data=df_temps,color='steelblue')

In [None]:
# plot homicide rates vs. day of the week

fig, ax = plt.subplots(figsize=(14,6))
week_days = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']    
fig = sns.barplot(x='Weekday',
                  y='HOMICIDE',
                  data=df_homicide.groupby(['Year','Weekday'])['Primary Type'].\
                       value_counts().unstack().reset_index(),
                  color='steelblue')
ax.set_xticklabels(week_days)
plt.title('HOMICIDE BY DAY OF THE WEEK -- All Years')

In [None]:
# use seaborn barplot to plot homicides vs. hour of the day 

fig, ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x='HourOfDay',
                  y='HOMICIDE',
                  data=df_homicide.groupby(['Year','HourOfDay'])['Primary Type'].\
                       value_counts().unstack().reset_index(),
                  color='steelblue',
                  alpha=.75)
plt.title('HOMICIDE BY HOUR OF THE DAY -- All Years')

In [None]:
# plot domestic variable vs. homicide variable

fig, ax = plt.subplots(figsize=(14,6))
df_arrest = df_homicide[['Year','Domestic']]
ax = sns.countplot(x="Year",
                   hue='Domestic',
                   data=df_arrest,
                   palette="Blues_d")
plt.title('HOMICIDE - DOMESTIC STATS BY YEAR')

In [None]:
# visualize the "scene of the crime" vs. number of occurences at such scene

crime_scene = df_homicide['Primary Type'].\
            groupby(df_homicide['Location Description']).\
            value_counts().\
            unstack().\
            sort_values('HOMICIDE',ascending=False).\
            reset_index()
    
# Top Homicide Crime Scene Locations
crime_scene.head(10)

In [None]:
# create a count plot for all crime scene locations

g = sns.catplot(x='Location Description',
                   y='HOMICIDE',
                   data=crime_scene,
                   kind='bar',
                   height=10,
                   color='steelblue', 
                   saturation=10)

g.fig.set_size_inches(15,5)
g.set_xticklabels(rotation=90)
plt.title('CRIME SCENE BY LOCATION FREQUENCY')

In [None]:
# create a heatmap showing crime by district by year

corr = df_homicide.groupby(['District','Year']).count().Date.unstack()
fig, ax = plt.subplots(figsize=(15,13))
sns.set(font_scale=1.0)
sns.heatmap(corr.dropna(axis=1),
            annot=True,
           linewidths=0.2,
           cmap='Blues',
            robust=True,
           cbar_kws={'label': 'HOMICIDES'})
plt.title('HOMICIDE vs DISTRICT vs YEAR')

In [None]:
with sns.plotting_context('notebook',font_scale=1.5):
    sorted_homicides = df_homicide[df_homicide['Year']>=2016].groupby(['District']).count()\
                    .Arrest.reset_index().sort_values('Arrest',ascending=False)
    fig, ax = plt.subplots(figsize=(14,6))
    sns.barplot(x='District',
                y='Arrest',
                data=sorted_homicides,
                color='steelblue',
                order = list(sorted_homicides['District']),
                label='big')
    plt.title('HOMICIDES PER DISTRICT (2016-2017) - Highest to Lowest')

In [None]:
# create seaborn countplots  for whole dataset

fig, ax = plt.subplots(figsize=(14,6))
df_arrest = df_homicide[['Year','Arrest']]
ax = sns.countplot(x="Year",
                   hue='Arrest',
                   data=df_arrest,
                   palette="PuBuGn_d")
plt.title('HOMICIDE - ARRESTS STATS BY YEAR')

In [None]:
# create seaborn countplots for 2016 and 2017 -- high crime rate spike years

fig, ax = plt.subplots(figsize=(14,6))
ax = sns.countplot(x="Month",
                   hue='Arrest',
                   data=df_homicide[df_homicide['Year']>=2016][['Month','Arrest']],
                   palette="PuBuGn_d")
month_nms = ['January','February','March','April','May','June','July',\
             'August','September','October','November','December']    
ax.set_xticklabels(month_nms)
plt.title('HOMICIDE - ARRESTS STATS BY MONTH -- (2016-2018)')

In [None]:
# create seaborn lmplot to compare arrest rates for different districts

dfx = df_homicide[df_homicide['District'].\
                isin(list(sorted_homicides.head(10)['District']))].\
                groupby(['District','Year','Month','Arrest'])['Primary Type'].\
                value_counts().unstack().reset_index()

with sns.plotting_context('notebook',font_scale=1.25):
    sns.set_context("notebook", font_scale=1.15)

    g = sns.lmplot('Year','HOMICIDE',
                   col='District',
                   col_wrap=5,
                   height=5,
                   aspect=0.5,
                   sharex=False,
                   data=dfx[:],
                   fit_reg=True,
                   hue="Arrest", 
                   palette=sns.color_palette("seismic_r", 2),
                   scatter_kws={"marker": "o",
                            "s": 7},
                   line_kws={"lw":0.7})

In [None]:
# plot chloropleth maps 2001 - 2017
def toString(x):
    return str(int(x))

df_homicide_allyears = df_homicide.groupby(['District']).count().Arrest.reset_index()
df_homicide_allyears['District'] = df_homicide_allyears['District'].apply(toString)

# ______________________________________________________#

chicago = location=[41.85, -87.68]
m = folium.Map(chicago,
               zoom_start=10)

plugins.Fullscreen(
    position='topright',
    title='Expand me',
    title_cancel='Exit me',
    force_separate_button=True).add_to(m)

m.choropleth(
    geo_data='chicago_police_districts.geojson',
    name='choropleth',
    data=df_homicide_allyears,
    columns=['District', 'Arrest'],
    key_on='feature.properties.dist_num',
    fill_color='YlOrRd', 
    fill_opacity=0.4, 
    line_opacity=0.2,
    legend_name='Choropleth of Homicide per Police District : 2001-2017',
    highlight=True
    )
folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m.save("map1.html") 
IFrame('map1.html', width=990, height=700)

# plot 2016-2018 chloropleth map

In [None]:
# plot 2016-2018 chloropleth map

def toString(x):
    return str(int(x))

df_homicide_after_2015 = df_homicide[df_homicide['Year']>=2016].groupby(['District']).count().Arrest.reset_index()
df_homicide_after_2015['District'] = df_homicide_after_2015['District'].apply(toString)

# ______________________________________________________#

chicago = location=[41.85, -87.68]
m = folium.Map(chicago,
               zoom_start=10)

plugins.Fullscreen(
    position='topright',
    title='Expand me',
    title_cancel='Exit me',
    force_separate_button=True).add_to(m)

m.choropleth(
    geo_data='chicago_police_districts.geojson',
    name='choropleth',
    data=df_homicide_after_2015,
    columns=['District', 'Arrest'],
    key_on='feature.properties.dist_num',
    fill_color='YlOrRd', 
    fill_opacity=0.4, 
    line_opacity=0.2,
    legend_name='Homicide per Police District : 2016-2017',
    highlight=True
    )
folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m.save("map2.html") 
IFrame('map2.html', width=990, height=700)

In [None]:
# plot heatmap all districts -- (2016-2018)

after_2015_geo = []
for index, row in df_homicide[df_homicide['Year']>=2016][['Latitude','Longitude','District']].dropna().iterrows():
    after_2015_geo.append([row["Latitude"], row["Longitude"],row['District']])
# ___________________________________________________________________
chicago = location=[41.85, -87.68]
m = folium.Map(chicago, zoom_start=9.5,control_scale = False)

plugins.Fullscreen(
    position='topright',
    title='Expand me',
    title_cancel='Exit me',
    force_separate_button=True).add_to(m)

m.choropleth(
    geo_data='chicago_police_districts.geojson',
    name='choropleth',
    data=df_homicide_after_2015,
    columns=['District', 'Arrest'],
    key_on='feature.properties.dist_num',
    fill_color='YlOrRd', 
    fill_opacity=0.4, 
    line_opacity=0.2,
    legend_name='HeatMap Homicides : 2016-2017',
    highlight=True
    )
m.add_child(plugins.HeatMap(after_2015_geo,
                            name='all_homicides_2016_to_2017',
                            radius=5,
                            max_zoom=1,
                            blur=10, 
                            max_val=3.0))
folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m.save("map3.html") 
IFrame('map3.html', width=990, height=700)

In [None]:
# geo locations of homicides crime scenes -- 2016-2017

df_homicide_after_2015 = df_homicide[df_homicide['Year']>=2016].groupby(['District']).count().Arrest.reset_index()
df_homicide_after_2015['District'] = df_homicide_after_2015['District'].apply(toString)

after_2015 = df_homicide[df_homicide['Year']>=2016].dropna()

# _____________________________________________

lats = list(after_2015.Latitude)
longs = list(after_2015.Longitude)
locations = [lats,longs]

m = folium.Map(
    location=[np.mean(lats), np.mean(longs)],
    zoom_start=10.3
)

plugins.Fullscreen(
    position='topright',
    title='Expand me',
    title_cancel='Exit me',
    force_separate_button=True).add_to(m)

FastMarkerCluster(data=list(zip(lats, longs))).add_to(m)

m.choropleth(
    geo_data='chicago_police_districts.geojson',
    name='choropleth',
    data=df_homicide_after_2015,
    columns=['District', 'Arrest'],
    key_on='feature.properties.dist_num',
    fill_color='YlOrRd', 
    fill_opacity=0.4, 
    line_opacity=0.2,
    legend_name='Homicides : 2016-2017',
    highlight=False
    )

# folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m.save("map5.html") 
IFrame('map5.html', width=990, height=700)

In [None]:
# geo locations of homicides -- January, February 2018

df_homicide_2018 = df_homicide[df_homicide['Year']==2018].groupby(['District']).count().Arrest.reset_index()
df_homicide_2018['District'] = df_homicide_2018['District'].apply(toString)

only_2018 = df_homicide[df_homicide['Year']==2018].dropna()

# _____________________________________________

lats = list(only_2018.Latitude)
longs = list(only_2018.Longitude)
locations = [lats,longs]

m = folium.Map(
    location=[np.mean(lats), np.mean(longs)],
    zoom_start=10.3
)

plugins.Fullscreen(
    position='topright',
    title='Expand me',
    title_cancel='Exit me',
    force_separate_button=True).add_to(m)

FastMarkerCluster(data=list(zip(lats, longs))).add_to(m)

m.choropleth(
    geo_data='chicago_police_districts.geojson',
    name='choropleth',
    data=df_homicide_2018,
    columns=['District', 'Arrest'],
    key_on='feature.properties.dist_num',
    fill_color='YlOrRd', 
    fill_opacity=0.4, 
    line_opacity=0.2,
    legend_name='Homicides : January, February 2018',
    highlight=False
    )

# folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m.save("map6.html") 
IFrame('map6.html', width=990, height=700)

## Part 3: Individual Inferences
Make *any* visualization to show an inference that you can make about the dataset that is not covered in the lab itself. Remember since it is a guided lab - this part might be the only thing that would make a difference in the marks! You can even go beyond Homicide rates and pick up - instead - any other crime that is not covered in the lab.

In [None]:
# Print the number of kidnapping instances reported
print(len(crime_data[crime_data['Primary Type'] == 'KIDNAPPING']))

In [None]:
# create a dataframe with Kidnapping as the only crime

df_kidnappingN = crime_data[crime_data['Primary Type']=='KIDNAPPING']
df_kidnappingN.head()

In [None]:
# print some attributes of our new kidnapping dataframe

df_kidnappingN.info()

In [None]:
# find null values in our dataframe

df_kidnappingN.isnull().sum()

In [None]:
# drop null values

df_kidnapping = df_kidnappingN.dropna()
# df_kidnapping = df_kidnappingN
print('Size of kidnapping dataframe with nulls: ', len(df_kidnappingN))
print('Size of kidnapping dataframe without nulls: ', len(df_kidnapping))

In [None]:
# create a list of columns to keep and update the dataframe with new columns

keep_cols = ['Year','Date','Primary Type','Arrest','Domestic','District','Location Description',
             'FBI Code','X Coordinate','Y Coordinate','Latitude','Longitude','Location']

df_kidnapping = df_kidnapping[keep_cols].reset_index()
df_kidnappingN = df_kidnappingN[keep_cols].reset_index()
# df_kidnapping.head()

In [None]:
# change string Date to datetime.datetime format

df_kidnapping['Date'] = df_kidnapping['Date'].apply(lambda x: datetime.datetime.strptime(x,"%m/%d/%Y %I:%M:%S %p"))
df_kidnappingN['Date'] = df_kidnappingN['Date'].apply(lambda x: datetime.datetime.strptime(x,"%m/%d/%Y %I:%M:%S %p"))
df_kidnapping.head()

In [None]:
# create new columns from date column -- Year, Month, Day, Hour, Minutes, DayOfWeek 

df_kidnapping['Year'] = df_kidnapping['Date'].dt.year
df_kidnapping['Month'] = df_kidnapping['Date'].dt.month
df_kidnapping['Day'] = df_kidnapping['Date'].dt.day
df_kidnapping['Weekday'] = df_kidnapping['Date'].dt.dayofweek
df_kidnapping['HourOfDay'] = df_kidnapping['Date'].dt.hour

df_kidnapping = df_kidnapping.sort_values('Date')
# print columns list and info

df_kidnapping.shape
# print(df_kidnapping)

df_kidnappingN['Year'] = df_kidnappingN['Date'].dt.year
df_kidnappingN['Month'] = df_kidnappingN['Date'].dt.month
df_kidnappingN['Day'] = df_kidnappingN['Date'].dt.day
df_kidnappingN['Weekday'] = df_kidnappingN['Date'].dt.dayofweek
df_kidnappingN['HourOfDay'] = df_kidnappingN['Date'].dt.hour

df_kidnappingN = df_kidnappingN.sort_values('Date')

In [None]:
# save cleaned data to pickle file 
df_kidnapping.to_pickle('df_kidnapping.pkl') 
print('pickle size:', os.stat('df_kidnapping.pkl').st_size)

df_kidnappingN.to_pickle('df_kidnappingN.pkl') 
print('pickle size:', os.stat('df_kidnappingN.pkl').st_size)

# load pickled data 
df_kidnapping = pd.read_pickle('df_kidnapping.pkl')
df_kidnappingN = pd.read_pickle('df_kidnappingN.pkl')

In [None]:
# plot all kidnapping (with nulls) in dataset by location per District

df_kidnappingN = df_kidnappingN.loc[(df_kidnappingN['X Coordinate']!=0)]

sns.lmplot('X Coordinate',
           'Y Coordinate',
           data=df_kidnappingN[:],
           fit_reg=False, 
           hue="District", 
           palette='Dark2',
           height=12,
           ci=2,
           scatter_kws={"marker": "D", 
                        "s": 10}) 
ax = plt.gca()
ax.set_title("All Kidnapping (with nulls) (2001-2018) per District")

In [None]:
# plot all kidnapping (non-null) in dataset by location per District

df_kidnapping = df_kidnapping.loc[(df_kidnapping['X Coordinate']!=0)]

sns.lmplot('X Coordinate',
           'Y Coordinate',
           data=df_kidnapping[:],
           fit_reg=False, 
           hue="District", 
           palette='Dark2',
           height=12,
           ci=2,
           scatter_kws={"marker": "D", 
                        "s": 10}) 
ax = plt.gca()
ax.set_title("All Kidnapping (2001-2018) per District")

Unlike homicides, we can see that kidnappings are not very concenterated in specific regions but appears to be equally
distributed

In [None]:
# plot bar chart of kidnapping (with nulls) rates for all years

plt.figure(figsize=(12,6))
sns.barplot(x='Year',
            y='KIDNAPPING',
            data=df_kidnappingN.groupby(['Year'])['Primary Type'].value_counts().\
                 unstack().reset_index(),
            color='steelblue').\
            set_title("CHICAGO KIDNAPPING (with nulls) RATES: 2001 - 2018")

In [None]:
# plot bar chart of kidnapping rates for all years

plt.figure(figsize=(12,6))
sns.barplot(x='Year',
            y='KIDNAPPING',
            data=df_kidnapping.groupby(['Year'])['Primary Type'].value_counts().\
                 unstack().reset_index(),
            color='steelblue').\
            set_title("CHICAGO KIDNAPPING RATES: 2001 - 2018")
# df_kidnapping.loc[df_kidnapping['Year'] == 2001]

We see a lot of decrease by dropping nulls from the dataset. Year 2001 appears to be the most affected by dropping nulls. Let's see how many nulls are there

In [None]:
print(len(df_kidnappingN) - len(df_kidnapping))
print(len(df_kidnapping.loc[df_kidnapping['Year'] == 2001]))

A lot of this null are actully missing co-ordinates, so this does not add much to out data. Year 2001 has only one non null row so I will be dropping it and concentrating focus on 2002 and afterwards

In [None]:
# Drop Year 2001
df_2002_kidnapping = df_kidnapping[df_kidnapping.Year != 2001]
print(len(df_2002_kidnapping.loc[df_kidnapping['Year'] == 2001]))

In [None]:
# plot bar chart of kidnapping rates for all years 2002 - 2018

plt.figure(figsize=(12,6))
sns.barplot(x='Year',
            y='KIDNAPPING',
            data=df_2002_kidnapping.groupby(['Year'])['Primary Type'].value_counts().\
                 unstack().reset_index(),
            color='steelblue').\
            set_title("CHICAGO KIDNAPPING RATES: 2002 - 2018")

In [None]:
# plot kidnappings sorted by month

fig, ax = plt.subplots(figsize=(14,6))
month_nms = ['January','February','March','April','May','June','July','August'\
             ,'September','October','November','December']    
fig = sns.barplot(x='Month',
                  y='KIDNAPPING',
                  data=df_2002_kidnapping.groupby(['Year','Month'])['Primary Type'].\
                  value_counts().unstack().reset_index(),
                  color='#808080')
ax.set_xticklabels(month_nms)
plt.title("CHICAGO KIDNAPPINGS RATES by MONTH -- 2002 - 2018")

# -------------------------------------------

# plot average monthly temps in Chicago
# source of data:  ncdc.noaa.gov

mntemp = [26.5,31,39.5,49.5,59.5,70,76,75.5,68,56,44,32]
df_temps = pd.DataFrame(list(zip(month_nms,mntemp)),
                       columns=['Month','AVERAGE TEMPERATURE'])
fig, ax = plt.subplots(figsize=(14,6))
sns.barplot(x='Month', y='AVERAGE TEMPERATURE', data=df_temps,color='steelblue')

Apparently there is no direct relation between kidnappings and average temperature

In [None]:
# plot kidnappings rates vs. day of the week

fig, ax = plt.subplots(figsize=(14,6))
week_days = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']    
fig = sns.barplot(x='Weekday',
                  y='KIDNAPPING',
                  data=df_2002_kidnapping.groupby(['Year','Weekday'])['Primary Type'].\
                       value_counts().unstack().reset_index(),
                  color='steelblue')
ax.set_xticklabels(week_days)
plt.title('KIDNAPPINGS BY DAY OF THE WEEK -- 2002 - 2018')

So, kidanppings on Thursdays are much higher than rest of the week. On the other hand, all days other than Thursday have similar number of kidnappings

In [None]:
# use seaborn barplot to plot kidnappings vs. hour of the day 

fig, ax = plt.subplots(figsize=(14,6))
fig = sns.barplot(x='HourOfDay',
                  y='KIDNAPPING',
                  data=df_2002_kidnapping.groupby(['Year','HourOfDay'])['Primary Type'].\
                       value_counts().unstack().reset_index(),
                  color='steelblue',
                  alpha=.75)
plt.title('KIDNAPPING BY HOUR OF THE DAY -- 2002 - 2018')

Here's something interesting. Unlike homicide, kidnappings have higher number during day time. There as clear spikes at 8am and 3pm, which are typically school starting and ending times respectively. After 3pm, kidnapping keep dropping gradually until 6am in the next morning.

In [None]:
# visualize the "scene of the crime" vs. number of occurences at such scene

crime_scene_kidnapping = df_2002_kidnapping['Primary Type'].\
            groupby(df_2002_kidnapping['Location Description']).\
            value_counts().\
            unstack().\
            sort_values('KIDNAPPING',ascending=False).\
            reset_index()
    
# Top Homicide Crime Scene Locations
crime_scene_kidnapping.head(10)

In [None]:
# create a count plot for all crime scene locations

g = sns.catplot(x='Location Description',
                   y='KIDNAPPING',
                   data=crime_scene_kidnapping,
                   kind='bar',
                   height=10,
                   color='steelblue', 
                   saturation=10)

g.fig.set_size_inches(15,5)
g.set_xticklabels(rotation=90)
plt.title('CRIME SCENE BY LOCATION FREQUENCY')

Streets have the highest number of kidnappings as expected, but the presence of residence as 2nd highest is particularly unexpected

In [None]:
with sns.plotting_context('notebook',font_scale=1.5):
    sorted_kidnappings = df_2002_kidnapping[df_2002_kidnapping['Year']>=2002].groupby(['District']).count()\
                    .Arrest.reset_index().sort_values('Arrest',ascending=False)
    fig, ax = plt.subplots(figsize=(14,6))
    sns.barplot(x='District',
                y='Arrest',
                data=sorted_kidnappings,
                color='steelblue',
                order = list(sorted_kidnappings['District']),
                label='big')
    plt.title('KIDNAPPINGS PER DISTRICT (2002-2018) - Highest to Lowest')

Well top 3 districts are not same as homicide so that's a relieve

In [None]:
# create seaborn countplots  for whole dataset

fig, ax = plt.subplots(figsize=(14,6))
df_arrest_kidnapping = df_2002_kidnapping[['Year','Arrest']]
ax = sns.countplot(x="Year",
                   hue='Arrest',
                   data=df_arrest_kidnapping,
                   palette="PuBuGn_d")
plt.title('KIDNAPPINGS - ARRESTS STATS BY YEAR')

Situtaions for arrests in kidnapping cases also does not look very promising. Arrests have steadily decreased over time. 
Since number of kidnappings has also decreased so this factor might also be kept in mind.

In [None]:
# create seaborn countplots

fig, ax = plt.subplots(figsize=(14,6))
ax = sns.countplot(x="Month",
                   hue='Arrest',
                   data=df_2002_kidnapping[df_2002_kidnapping['Year']>=2002][['Month','Arrest']],
                   palette="PuBuGn_d")
month_nms = ['January','February','March','April','May','June','July',\
             'August','September','October','November','December']    
ax.set_xticklabels(month_nms)
plt.title('KIDNAPPINGS - ARRESTS STATS BY MONTH')

May and August have slightly higher arrest rates, may be due schools close and open in those months respectively

In [None]:
# create seaborn lmplot to compare arrest rates for different districts

dfx = df_2002_kidnapping[df_2002_kidnapping['District'].\
                isin(list(sorted_kidnappings.head(10)['District']))].\
                groupby(['District','Year','Month','Arrest'])['Primary Type'].\
                value_counts().unstack().reset_index()

with sns.plotting_context('notebook',font_scale=1.25):
    sns.set_context("notebook", font_scale=1.15)

    g = sns.lmplot('Year','KIDNAPPING',
                   col='District',
                   col_wrap=5,
                   height=5,
                   aspect=0.5,
                   sharex=False,
                   data=dfx[:],
                   fit_reg=True,
                   hue="Arrest", 
                   palette=sns.color_palette("seismic_r", 2),
                   scatter_kws={"marker": "o",
                            "s": 7},
                   line_kws={"lw":0.7})

Arrest rates have been declining for all the districts, because of declining number of kidnapping incidents

In [None]:
# plot chloropleth maps 2002 - 2018
def toString(x):
    return str(int(x))

df_kidnapping_allyears = df_kidnapping.groupby(['District']).count().Arrest.reset_index()
df_kidnapping_allyears['District'] = df_kidnapping_allyears['District'].apply(toString)

# ______________________________________________________#

chicago = location=[41.85, -87.68]
m = folium.Map(chicago,
               zoom_start=10)

plugins.Fullscreen(
    position='topright',
    title='Expand me',
    title_cancel='Exit me',
    force_separate_button=True).add_to(m)

m.choropleth(
    geo_data='chicago_police_districts.geojson',
    name='choropleth',
    data=df_kidnapping_allyears,
    columns=['District', 'Arrest'],
    key_on='feature.properties.dist_num',
    fill_color='YlOrRd', 
    fill_opacity=0.4, 
    line_opacity=0.2,
    legend_name='Choropleth of Kidnappings per Police District : 2002-2018',
    highlight=True
    )
folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m.save("map7.html") 
IFrame('map7.html', width=990, height=700)

In [None]:
# plot heatmap all districts -- (2002-2018)

df_kidnapping_after_2001 = df_kidnapping[df_kidnapping['Year']>=2002].groupby(['District']).count().Arrest.reset_index()
df_kidnapping_after_2001['District'] = df_kidnapping_after_2001['District'].apply(toString)

after_2001_geo = []
for index, row in df_kidnapping[df_kidnapping['Year']>=2002][['Latitude','Longitude','District']].dropna().iterrows():
    after_2001_geo.append([row["Latitude"], row["Longitude"],row['District']])
# ___________________________________________________________________
chicago = location=[41.85, -87.68]
m = folium.Map(chicago, zoom_start=9.5,control_scale = False)

plugins.Fullscreen(
    position='topright',
    title='Expand me',
    title_cancel='Exit me',
    force_separate_button=True).add_to(m)

m.choropleth(
    geo_data='chicago_police_districts.geojson',
    name='choropleth',
    data=df_kidnapping_after_2001,
    columns=['District', 'Arrest'],
    key_on='feature.properties.dist_num',
    fill_color='YlOrRd', 
    fill_opacity=0.4, 
    line_opacity=0.2,
    legend_name='HeatMap Kidnapping : 2002 - 2018',
    highlight=True
    )
m.add_child(plugins.HeatMap(after_2001_geo,
                            name='all_kidnappings_2002_to_2018',
                            radius=5,
                            max_zoom=1,
                            blur=10, 
                            max_val=3.0))
folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m.save("map8.html") 
IFrame('map8.html', width=990, height=700)

Apparently kidnappings seems to distributed over the most of the Chicago. No kidnappings at airport lol

In [None]:
# geo locations of kidnappings crime scenes -- 2002 - 2018

# df_homicide_after_2015 = df_homicide[df_homicide['Year']>=2016].groupby(['District']).count().Arrest.reset_index()
# df_homicide_after_2015['District'] = df_homicide_after_2015['District'].apply(toString)

after_2001 = df_kidnapping[df_kidnapping['Year']>=2002].dropna()

# _____________________________________________

lats = list(after_2001.Latitude)
longs = list(after_2001.Longitude)
locations = [lats,longs]

m = folium.Map(
    location=[np.mean(lats), np.mean(longs)],
    zoom_start=10.3
)

plugins.Fullscreen(
    position='topright',
    title='Expand me',
    title_cancel='Exit me',
    force_separate_button=True).add_to(m)

FastMarkerCluster(data=list(zip(lats, longs))).add_to(m)

m.choropleth(
    geo_data='chicago_police_districts.geojson',
    name='choropleth',
    data=df_kidnapping_after_2001,
    columns=['District', 'Arrest'],
    key_on='feature.properties.dist_num',
    fill_color='YlOrRd', 
    fill_opacity=0.4, 
    line_opacity=0.2,
    legend_name='Kidnappings : 2002 - 2018',
    highlight=False
    )

# folium.TileLayer('openstreetmap').add_to(m)
folium.TileLayer('cartodbpositron').add_to(m)
folium.LayerControl().add_to(m)
m.save("map9.html") 
IFrame('map9.html', width=990, height=700)

There are 3 major regions in which kidnappings occur