## EDA of Crimes in Chicago 2005-2016 with Python

Analyzing Crimes in Chicago 

#### A Quick introduction


Crime in Chicago is a very interesting topic for exploration for all kinds of reasons. Another reason is the availability of huge amounts of publicly available (high quality) crime datasets open for data scientists to mine and investigates such as this one.

In this notebook, I am going to explore more about crime in Chicago and try to find answers to few questions:

How has crime in Chicago changed across years? Was 2016 really the bloodiest year in two decades?
Are some types of crimes more likely to happen in specific locations or specific time of the day or specific day of the week than other types of crimes?

## Data Cleaning and Formatting

Importing required Data Science libraries and modules for EDA  Analysis | In this analysis i'll explore the 2005-2016 Crimes data

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('seaborn')

Importing Datasets from 2005-2017 and joining them in a single data

In [None]:
Crime1 = pd.read_csv('../input/Chicago_Crimes_2005_to_2007.csv',error_bad_lines=False)
Crime2 = pd.read_csv('../input/Chicago_Crimes_2008_to_2011.csv',error_bad_lines=False)
Crime3 = pd.read_csv('../input/Chicago_Crimes_2012_to_2017.csv',error_bad_lines=False)

Crimes = pd.concat([Crime1, Crime2, Crime3], ignore_index=False, axis=0)

del Crime1
del Crime2
del Crime3

print('Dataset ready..')

print('Dataset Shape before drop_duplicate : ', Crimes.shape)
Crimes.drop_duplicates(subset=['ID', 'Case Number'], inplace=True)
print('Dataset Shape after drop_duplicate: ', Crimes.shape)

Filtering the Data out by dropping those columns which would not be of use for analysis

In [None]:
Crimes.drop(['Unnamed: 0', 'Case Number', 'IUCR','Updated On','Year', 'FBI Code', 'Beat','Ward','Community Area', 'Location'], inplace=True, axis=1)

From the first few rowswe can see that, we have several columns that will help us answer our questions. We will use the 'Date' column to explore temporal patterns, 'Primary Type' and 'Location Description' to investigate their relationship with time (month of the year, time of the day, hour of the day, .. etc).

In [None]:
#Let's have a look at the first 5 rows of the dataframe 'Crimes'

Crimes.head(5)

Great, Since we are dealing with dates, we need to convert the 'Date' column into a date format that can be interpret by Python (and pandas).

In [None]:
# converting dates to pandas datetime format
Crimes.Date = pd.to_datetime(Crimes.Date, format='%m/%d/%Y %I:%M:%S %p')


# setting the index to be the date will help us a lot later on
Crimes.index = pd.DatetimeIndex(Crimes.Date)

Checking the No. of records (i.e, the no. of rows) and Features for each records (i.e, the no. of columns)

In [None]:
Crimes.shape

Checking the data type of each columns and rows in the Dataframe

In [None]:
Crimes.info()

As 'Location Description', 'Description' and 'Primary Type' columns are Categorical columns (or factors in R), we will keep the most frequent categories and then cast them to a categorical data type.

In [None]:
loc_to_change  = list(Crimes['Location Description'].value_counts()[20:].index)
desc_to_change = list(Crimes['Description'].value_counts()[20:].index)
type_to_change = list(Crimes['Primary Type'].value_counts()[20:].index)

Crimes.loc[Crimes['Location Description'].isin(loc_to_change) , Crimes.columns=='Location Description'] = 'OTHER'
Crimes.loc[Crimes['Description'].isin(desc_to_change) , Crimes.columns=='Description'] = 'OTHER'
Crimes.loc[Crimes['Primary Type'].isin(type_to_change) , Crimes.columns=='Primary Type'] = 'OTHER'

Converting those columns in 'Categorical' data type

In [None]:
Crimes['Primary Type']         = pd.Categorical(Crimes['Primary Type'])
Crimes['Location Description'] = pd.Categorical(Crimes['Location Description'])
Crimes['Description']          = pd.Categorical(Crimes['Description'])

Now we're ready to go for Data Exploration

## Data Exploration and Visualization

At this point, I think we are done with all the Data preprocessing and cleaning. Now it is time to see what we got. In this section, I will make use of many of pandas functionality like resampling by a time frame and pivot_table etc.

Let's begin by some general queries like - how many records we have for each month ?

In [None]:
plt.figure(figsize=(11,5))
Crimes.resample('M').size().plot(legend = False)
plt.title('Number of Crimes per month (2005 - 2016)')
plt.xlabel('Months')
plt.ylabel('Number of Crimes')
plt.show()

<p p>

This chart clearly shows a "periodic" pattern in the crimes over many years, which shows why crimes are very predictable activity.


<p  p>

Before we go further and explore other features, the first question that arises is that <b>How crime has changed over the years? is it decreasing? </b> 
Let's have a look of what we have from 2005-2016

In a previous chart, we looked at the number of weekly crime records. Although it didn't give the clear idea about how crimes have changed over the years, it still gives somehow similar numbers between 2015 and 2016. Here, we will take a finer scale to get the visualization right. I decided to look at the rolling sum of crimes of the past year. The idea is, for each day, we calculate the sum of crimes of the past year. If this rolling sum is decreasing, then we know for sure that crime rates have been decreasing during that year. On the other hand, if the rolling sum stays the same during a given year, then we can conclude that crime rates stayed the same.

In [None]:
plt.figure(figsize=(11,6))
Crimes.resample('D').size().rolling(365).sum().plot()
plt.title('Rolling sum of all Crimes from 2005 - 2016')
plt.ylabel('Number of Crimes')
plt.xlabel('Days')
plt.show()

<p  p>

We see the line decreasing from 2006 up to some point around 2016 after which it stays around the same number of crimes. This all means that 2016 is really no better than 2015, but both years show a much better crime record (in total) than the previous years. 

Let's seperate the Crimes and see what is the actual rate of Crime in a particular type

In [None]:
Crimes_count_date = Crimes.pivot_table('ID', aggfunc=np.size, columns='Primary Type', index=Crimes.index.date, fill_value=0)
Crimes_count_date.index = pd.DatetimeIndex(Crimes_count_date.index)
plo = Crimes_count_date.rolling(365).sum().plot(figsize=(12, 30), subplots=True, layout=(-1, 3), sharex=False, sharey=False)

<p 
   p >

At first i started to wonder how the crime trend was decreasing but is not the case. Some crime of particular types are actually increasing all along like interference with public officer and deceptive practice. Other types started to increase slightly before 2016 like theft, robbery and stalking (which may be the reason behind the trend we saw earlier).

## A general view of crime records by time, type and location

In this part we'll see how crimes differ between different places at different times

The first thing we are going to look at is if there is a difference in the number of crimes during specific days of the week. Are there more crimes during weekdays or weekend?

In [None]:
days = ['Monday','Tuesday','Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
Crimes.groupby([Crimes.index.dayofweek]).size().plot(kind='barh')
plt.ylabel('Days of the week')
plt.yticks(np.arange(7), days)
plt.xlabel('Number of Crimes')
plt.title('Number of Crimes by day of the week')
plt.show()

<p p>
Now Let's look at crimes per month and see if certain months show more crimes than others.

In [None]:
Crimes.groupby([Crimes.index.month]).size().plot(kind='barh')
plt.ylabel('Months of the year')
plt.xlabel('Number of Crimes')
plt.title('Number of Crimes by month of the year')
plt.show()

<p p>
Crimes rate seems to be at peak in summer time

Let's have a look at the distribution of crime by their types, which crimes are most common among the top 20 most frequent crime types ?

In [None]:
plt.figure(figsize=(8,10))
Crimes.groupby([Crimes['Primary Type']]).size().sort_values(ascending=True).plot(kind='barh')
plt.title('Number of Crimes by type')
plt.ylabel('Crime Type')
plt.xlabel('Number of Crimes')
plt.show()

And similarly for Crime Location

In [None]:
plt.figure(figsize=(8,10))
Crimes.groupby([Crimes['Location Description']]).size().sort_values(ascending=True).plot(kind='barh')
plt.title('Number of Crimes by Location')
plt.ylabel('Crime Location')
plt.xlabel('Number of Crimes')
plt.show()

<p  p>
    

This is My first ever Notebook created by me and ofcourse by taking some help from other sources.\
    
    An upvote would be much appreciated.
    
Thanks for Reading !!