# A data analisys of the Montgomery County, PA [911 calls](https://www.kaggle.com/mchirico/montcoalert)

* By: Mateus Mendes Ramalho da Silva 
1.     mateus.mendes.mmr@gmail.com

Importing python libraries:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


Loading dataset on a dataframe:

In [None]:
df=pd.read_csv('../input/montcoalert/911.csv')

Checking head of the dataset:

In [None]:
df.head()

Checking infos about the dataframe

In [None]:
df.info()

Top 10 zip codes on 911 calls:

In [None]:
df['zip'].value_counts().head(10)


In [None]:
by_zip = df.groupby(['zip']).count()
by_zip.sort_values(by='addr', ascending=False, inplace=True)
by_zip = by_zip.head(10)
plt.figure(figsize=(12,6))
plt.title('Top 10 zip codes on 911 calls:')
sns.barplot(x='zip', y='addr', data=by_zip.reset_index()) 


Top 10 cities on 911 calls:

In [None]:
df['twp'].value_counts().head(10)

In [None]:
by_twp = df.groupby(['twp']).count()
by_twp.sort_values(by='addr', ascending = False, inplace=True)
by_twp = by_twp.head(10)
plt.figure(figsize=(20,6))
plt.title('Top 10 cities on 911 calls')
sns.barplot(x='twp', y='addr', data=by_twp.reset_index())

In the title there are three main kinds of reasons of the emergency, they are EMS, Fire and Traffic. Let's create a new column containing just the reasons of each occurrence, without the descriptions. This may help to mantain the code more clean and make easier to get some insights. 

Getting the slice of the tittles and putting in a new columns called 'reason':

In [None]:
reasons = df['title'].apply(lambda reason: reason.split(':')[0])
df['reason'] = reasons
df.head()

Which one of the reasons is the most common one:

In [None]:
df['reason'].value_counts()

Answer : EMS is the most common

Let's show it graphically:

In [None]:
plt.title("Most common reasons on 911 calls")
sns.countplot(x='reason', data=df) 
c_reasons = df['reason'].value_counts().to_frame() # Converting the series into a dataframe:
c_reasons.plot(kind='pie', subplots=True, figsize=(6,6), title="Most common reasons on 911 calls")

Let's now get some informations related to time. We need now to convert the string on the 'timeStamp' column on a dateTime format

In [None]:
df['timeStamp'] = pd.to_datetime(df['timeStamp']) # Converting by the to_datetime()

By doing this, we can now extract some very useful and especific information from the timeStamp.
Like:

Hour:

In [None]:
df['timeStamp'].iloc[0].hour

Month:

In [None]:
df['timeStamp'].iloc[0].month

Day of the week:

In [None]:
df['timeStamp'].iloc[0].dayofweek

Year:

In [None]:
df['timeStamp'].iloc[0].year

Let's now add the hour, month and dayofweek columns on the dataframe:

In [None]:
df['hour'] = df['timeStamp'].apply(lambda time: time.hour)
df['dayofweek'] = df['timeStamp'].apply(lambda time: time.dayofweek)
df['month'] = df['timeStamp'].apply(lambda time: time.month)
df['year'] = df['timeStamp'].apply(lambda time: time.year)

Checking dataframe:

In [None]:
df.head()

Have you noticed something weird? Yes, exactly. The day of the week is a number, to improve this we are going to need to create a dictionary with the name of the days of the week and apply to the column using the map() function:

In [None]:
days = {0:'Mon', 1:'Tue', 2:'Wed', 3:'Thu', 4:'Fri', 5:'Sat', 6:'Sun'}

In [None]:
df['dayofweek'] = df['dayofweek'].map(days)

Checking:

In [None]:
df['dayofweek']

Let's now finally get some informations from the timeStamp:

911 Calls by day of week:

In [None]:
df['dayofweek'].value_counts()

In [None]:
plt.title("911 calls by days of week:")
sns.countplot(x='dayofweek', data=df)

We can notice that the day with more 911 calss is friday and the one with less is sunday.

Dividing also by reason:

In [None]:
plt.title('911 Calls by day of week and reasons')
sns.countplot(x='dayofweek', data=df, hue='reason')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

911 Calls by month:

In [None]:
sns.countplot(x='month', data=df)
plt.title("911 calls by month")

We can notice that the calls are more present in January and december

By the way, we can also change the month number into the month name in order to get a better description:

In [None]:
months = {1:'Jan', 2: 'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
df_mn = df.copy() # Just making a copy to preserve temporal order on future insights
df_mn['month'] = df_mn['month'].map(months)

Let's Check:

In [None]:
df_mn['month']

Now the visualization:

In [None]:
plt.title("911 calls by month")
sns.countplot(x='month', data=df_mn)

Now let's divide by reason:

In [None]:
plt.title("911 calls by month")
sns.countplot(x='month', data=df_mn, hue='reason')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Let's now check by years:

In [None]:
plt.title("911 calls per year")
sns.countplot(x='year', data=df)

Now with the reasons:

In [None]:
plt.title("911 calls per year")
sns.countplot(x='year', data=df, hue='reason')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

**Question**: Why does 2015 have less data than other years?

Let's check the months to see if the data spreads through all the year:

In [None]:
df_2015 = df[df['year'] == 2015]
df_2015['month'].value_counts()

**Answer**: The data from 215 correspond just to december**

Let's make some different and interesting plots now:

Let's first agroup the data by month

In [None]:
by_month = df.groupby(['month']).count()

In [None]:
by_month.head()

Let's visualize the data

In [None]:
by_month['addr'].plot(title="911 calls through months")

Let's see it year by year:

In [None]:
a_2016 = df[df['year']==2016].groupby(['month']).count()
a_2017 = df[df['year']==2017].groupby(['month']).count()
a_2018 = df[df['year']==2018].groupby(['month']).count()
a_2019 = df[df['year']==2019].groupby(['month']).count()
a_2016['addr'].plot(legend=True)
a_2017['addr'].plot()
a_2018['addr'].plot()
a_2019['addr'].plot()
plt.title("911 Calls through months year by year")
plt.legend(['2016','2017','2018','2019'])

Now let's see on a linear model plot:

In [None]:
sns.lmplot(x='month', y='addr', data=by_month.reset_index())

Now let's take a look in a line insight using the reason:

In [None]:
df[df['reason'] == 'Traffic'].groupby(['month']).count()['addr'].plot()
df[df['reason'] == 'EMS'].groupby(['month']).count()['addr'].plot()
df[df['reason'] == 'Fire'].groupby(['month']).count()['addr'].plot()
plt.title('Reasons of 911 calls line comparison')
plt.legend(['Traffic','EMS', 'Fire'])
plt.tight_layout()

Now let's create some heatmaps using the hour and the day of week:

In [None]:
day_hour = df.groupby(['dayofweek', 'hour']).count()['reason'].unstack()
day_hour.head()

**Question:** in which days and which hours happen most of the calls?

In [None]:
plt.figure(figsize=(12,6))
plt.title("911 Calls on days of week and hour")
sns.heatmap(day_hour, cmap='YlGnBu')

**Answer:** We can notice that most of the calls happens between 15 pm and 18 pm, mainly 17 pm. The most common day are fridays, we can also say that in the weeknds there less calls than in the other days.

Let's also see a clustermap:

In [None]:
sns.clustermap(day_hour, cmap='YlGnBu')

Let's also relate the days of week and months:

In [None]:
day_month = df.groupby(['dayofweek', 'month']).count()['reason'].unstack()
day_month.head()

**Question:** In which day of which month there are concentrated more calls?

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(day_month, cmap="inferno_r")

**Answer**: We can notice that most of the calls happen on the fridays of march. We can also notice a great amount on the Tuesdays of january.

Let's also see a clustermap:

In [None]:
sns.clustermap(day_month, cmap='inferno_r')