# 911 Calls Capstone Project

**2018/11/23**
___

I will be looking at the Emergency 911 Calls Montegomery County Data set as part of one of the capstone projects for the Udemy course "Python for data science and machine learning bootcamp"

I will be making visualizations of this data set in order to analyze and extract insights. This is my first kernel using Kaggle.

Start by importing the relevant libraries I plan on using as well as the dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#Import the data into a dataframe
df = pd.read_csv('../input/911.csv')

# Dataset
Let's get some info on this dataset

In [None]:
df.info()

In [None]:
df.head()

First, let's explore the data a little.

In [None]:
#check for any missing data
df.isnull().sum()

We see that there are many missing data points for zip code and some for the township. Therefore, if we look at the top values in zip codes and townships, we have to keep in mind that much of the zip code data and some of the township data is missing.

In [None]:
#The top 5 zip codes for 911 calls with the data we have
df['zip'].value_counts().head()

In [None]:
#The top 5 townships for 911 calls with the data we have
df['twp'].value_counts().head()

# Extracting 911 Call Reasons
There seems to be a variety of, what seems like, the reason/results of the 911 call under the title column. I assume that this is what they use to quickly describe what kind of incident occured. 

In [None]:
#Let's see how many unique title codes there are.
df['title'].nunique()

It appears that the values in the title column are preceded with a category. Let's split this into another column to make it easier to understand what is going on.

In [None]:
df['Reason'] = df['title'].apply(lambda st: st.split(':')[0])

In [None]:
df.head()

Now that we have this new column, let's look at what is happening a little closer

In [None]:
df['Reason'].value_counts()

In [None]:
#let's visualize the above result to visually compare these numbers
sns.countplot(x='Reason',data=df)

We can quickly note, from the above graph, that fires are less represented in our data set. 

# Extracting Time Data
Let's look into what time frame this dataset covers

In [None]:
#checking the timestamp column datatype
type(df['timeStamp'].iloc[0])

In [None]:
#convert them to DateTime objects
df['timeStamp'] = pd.to_datetime(df['timeStamp'])

In [None]:
type(df['timeStamp'].iloc[0])

Now that the time stamps have been converted, we can begin adding new columns based on the time information.

In [None]:
df['Hour'] = df['timeStamp'].apply(lambda time: time.hour)

In [None]:
df['Month'] = df['timeStamp'].apply(lambda time: time.month)

In [None]:
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df['Day of Week'] = df['timeStamp'].apply(lambda time: time.dayofweek).map(dmap)
#There is a time.weekday_name attribute that could have produced an equivalent solution, but I wanted to practice mapping a dictionary

In [None]:
#check to see our new dataframe
df.head()

# Graphing Reason and Time Data
Now, let's look at this time data with our Reason column.

In [None]:
sns.countplot(x='Day of Week', data = df, hue = 'Reason', palette = 'Set2')
plt.legend(loc='lower left',bbox_to_anchor=(1.0,0.5))

In [None]:
sns.countplot(x='Month', data = df, hue = 'Reason', palette = 'Set2')
plt.legend(loc='lower left',bbox_to_anchor=(1.0,0.5))

It seems that, at a quick glance at these two graphs, we see that Traffic calls are generally reduced on the weekends, and that calls because of fire are much lower in number per month than EMS and Traffic. 

Let's get a better understanding of the total number of calls per month.

In [None]:
df.groupby('Month').count()

In [None]:
#let's turn this into a graph to better understand calling trends per month
df.groupby('Month').count().plot.line(use_index = True,y = 'title',legend = None)
plt.ylabel('count')

There's a lot of spikes in the above graph, so let's do a linear regression to see the general trendline and understand our data better.

In [None]:
sns.lmplot(x='Month',y = 'title', data = df.groupby('Month').count().reset_index())
plt.ylabel('count')

We see from above that the trendline is slightly negative with large variance towards the beginning and ending months of the data set. 

# Graphing Timelines
To continue exploring, let's find out what the actual data looks like for each reason given the date. 

In [None]:
#let's use the timestamp information to create a new column
df['Date'] = df['timeStamp'].apply(lambda ts: ts.date())

In [None]:
df.head()

Now let's plot the total 911 calls by date. 

In [None]:
df.groupby('Date').count().plot.line(use_index = True, y = 'title', figsize= (15,2), legend = None)
plt.ylabel('count')

We notice giant outliers in March of 2018 and in November of 2018.

# Investigating Outliers
Let's investigate these two major outliers.

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
df.groupby(df[df['Date'].dt.year>=2018]['Date']).count().plot.line(use_index = True, y = 'title', legend = None)
plt.ylabel('count')

We see the first one in March. Let's track it down to what day.

In [None]:
df.groupby(df[(df['Date'].dt.year>= 2018) & (df['Date'].dt.month==3)]['Date']).count()

We see that, scanning the above table, that on the 2nd there were about 4-5 times the normal amount of calls for the rest of the month. Let's see if we can understand what might have happened on that day from the data we have.

In [None]:
#Checking the reasons to see if it's distributed according to the entire dataset.
df[df['Date']=='2018-03-02']['Reason'].value_counts()


In [None]:
sns.countplot(x='Reason',data=df[df['Date']=='2018-03-02'])

We see that this count distribution looks very different than our original total count distribution for the entire dataset. We can draw from this that there was most likely an event that caused more traffic calls to happen, maybe a big sports game, weather issue, or something else. 

After some quick research, it does appear that Montgomery county was experiencing extreme weather and power outages on this day. It is very likely that this was the cause of the anomoly from March 2, 2018. 
(Source:https://www.pema.pa.gov/about/publicinformation/Daily%20Incident%20Reports/20180303%20Daily%20Report.pdf)

Now, let's check the other anomoly from November of 2018. 

In [None]:
#reusing the same code from before
df.groupby(df[(df['Date'].dt.year>= 2018) & (df['Date'].dt.month==11)]['Date']).count()

In [None]:
sns.countplot(x='Reason',data=df[df['Date']=='2018-11-15'])

Again, we see that something happened on November 15th. Because the November 15th graph also shows a high count of traffic calls, we have a sense that it could be similar to the March 2nd incident where the cause was extreme weather. 

After quickly researching the date again, we see that it was most likely due to extreme weather, just as it was March 2nd, 2018. (Source:https://patch.com/pennsylvania/norristown/more-1-200-montgomery-co-peco-customers-without-power)

*It's interesting to note that, of the 3 years of data, both of the anomolies caused by extreme weather were in 2018. Further investigation beyond the scope of this analysis could be done to see if the weather in 2018 for montegomery county was significantly greater than the previous two years. It would be interesting to see what specifically caused the comparively large increase in 911 calls for those weather events and not, presumably, for any in 2016 and 2017. There could be a number of factors (power outages, awareness of incoming conditions, severity of the weather, etc.) that could be further investigated to try to find the source of what caused the increase in 911 calls. Again, this is not within the scope of this analysis, but it is worth mentioning where this could lead and the potential benefit of knowing the cause(s) to gain insight on how to better prepare for extreme weather conditions in the future..*

Out of curiosity, let's continue exploring the number of calls by date, but, this time, let's break it down by reason.

In [None]:
#unstacking and viewing the groupby table so we can find out how to select our data.
df.groupby(['Date','Reason']).count().unstack()

In [None]:
#Traffic
df.groupby(['Date','Reason']).count()['title'].unstack().plot.line(use_index = True, y = 'Traffic', figsize= (15,2), legend = None)
plt.title('Traffic')
plt.ylabel('count')

In [None]:
#EMS
df.groupby(['Date','Reason']).count()['title'].unstack().plot.line(use_index = True, y = 'EMS', figsize= (15,2), legend = None)
plt.title('EMS')
plt.ylabel('count')

In [None]:
#Fire
df.groupby(['Date','Reason']).count()['title'].unstack().plot.line(use_index = True, y = 'Fire', figsize= (15,2), legend = None)
plt.title('Fire')
plt.ylabel('count')

We see that the first two graphs show us roughly what we expected - the traffic graph having two large outliers that we investigated earlier, and the EMS graph being about average, with the exception of a few datapoints having less than average (because these are not zero, indicating something wrong with the data, and seem like reasonable decreases, we're going to assume that they are part of the natural outliers that you would see in any dataset).

What we didn't notice until now, though, was that, along with the number of traffic calls being high on March 2, 2018, we also see that the number of calls for fire were abnormally high as well. With the count plot earlier, we were simply looking for a difference in distribution, not necessarily in quantity, over each category to denote it being abnormal. Now, however, going back to the count plot for March 2, 2018, we see that the fire count is around 600, much larger than the percieved average in the fire calls graph above. This shows the importance of checking each of the major categories in your data, especially ones that could lead you to conclusions, so that you can more accurately see what is going on. In this case, it wasn't just that severe weather most likely attributed to a higher traffic call count, it also attributed to higher fire count for March 2, 2018. 

*From here, another place you could explore, is the question following question: "Why were there more fire calls in the March 2,2018 weather incident but not the November 11th, 2018 weather incident?". This question is beyond the scope of this analysis but it's also worth mentioning where you could go because of this data, potentially leading to more information on how to specifically prevent more fire accidents by studying these two dates and analyzing what occured and why. *

# Heat Maps
Finally, let's move to looking at how the time of day and day of week interact with the number of 911 calls. For this, we'll be creating a heat map using seaborn.

In [None]:
#First need to change the dataframe to a pivot table with days of week and hours in day
dfht = df.groupby(['Day of Week','Hour']).count().unstack()['title']
dfht

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
sns.heatmap(dfht, cmap='coolwarm',ax = ax)

We see that most of the call density comes during the day and most prevelent on business days, which are both expected. Let's look at a cluster map of this data to better understand the similarities.

In [None]:
sns.clustermap(dfht, cmap = 'coolwarm', figsize = (12,10))

This cluster map more clearly shows that the hours and days of the week that have the most density are the weekdays during conventional working hours of 9 am to 6 pm. 

Let's find out what the heatmap of the month and the day of the week looks like.

In [None]:
#Creating the dataframe we'll use
dfmt = df.groupby(['Day of Week','Month']).count().unstack()['title']
dfmt

In [None]:
#Heatmap
fig, ax = plt.subplots(figsize=(12,6))
sns.heatmap(dfmt, cmap='coolwarm', ax = ax)

We see that the biggest density is on Friday in March. This is surely influenced by the weather incident that we looked at earlier that fell on Friday, March 2, 2018.
It's important to note that, although we expected to see a heavier density in November due to the other incident, at the time of this analysis, our data stops mid-November, making the months of November and December less valuable to look at due to the lower amount of data in the dataset. 

In [None]:
#let's make a cluster map of the same information
sns.clustermap(dfmt, cmap = 'coolwarm', figsize = (12,10))

The most clear observation we can make from the clustermap is that sunday is generally the lowest day of the week for 911 calls.

# Conclusion
In this visual analysis we were able to practice many different visualization techniques while exploring this dataset. We used pandas to create dataframes and sift through our data, reorganizing, extracting, and graphing important data categories that we want to visualize. For the dataset, we found the EMS-related calls represented the most 911 calls, followed by traffic and then fire. We found 2 outliers that occured on March 2nd, 2018 and November 15th, 2018, both likely due to sever weather conditions, and mentioned how, with more research, you could draw insights from the investigation of these two dates and the data behind it. 

I enjoyed learning utilizing my first kaggle kernel and getting to practice data visualization with this dataset. I would be grateful for any constructive feedback you may have for me!