# 911 Emergency Calls Analysis

![andrey-kremkov-UtWG73BiwE8-unsplash.jpg](https://i.imgur.com/ruYIZ57.jpg)

Photo by Andrey Kremkov on Unsplash

# 1. Introduction

911 is an emergency telephone number for the North American Numbering Plan (NANP).
Analysing emergency calls dataset and discovering hidden trends and patterns will help in ensuring that the emergency response team is better equipped to deal with emergencies.

Considering road accidents, fire accidents etc, high numbers in specific areas indicate that there is a high demand for ambulance services in those areas. Road accidents in some areas might be due to road conditions which need to be improved. High frequency of emergencies due to respiratory problems might be due to harmful pollutants in the air in that specific area. Association rule mining will thus help in discovering such patterns.

The dataset contains Emergency 911 calls in Montgomery County located in the Commonwealth of Pennsylvania. The attributes chosen include: type of emergency, time stamp, township where the emergency has occurred.


# 2. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# Print versions of libraries
print(f"Numpy version : Numpy {np.__version__}")
print(f"Pandas version : Pandas {pd.__version__}")
print(f"Matplotlib version : Matplotlib {matplotlib.__version__}")
print(f"Seaborn version : Seaborn {sns.__version__}")

# Magic Functions for In-Notebook Display
%matplotlib inline

# Setting seabon style
sns.set(style='darkgrid')

## Loading Data

In [None]:
data = pd.read_csv('../input/montcoalert/911.csv', encoding='latin_1')

In [None]:
data.head(5)

# 3. Data Understanding and Cleaning

In [None]:
data.shape

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Dataset comprises of 631339 observations and 9 fields.

#### It is also a good practice to know the columns and their corresponding data types,along with finding whether they contain null values or not.

In [None]:
data.info()

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Data has float, integer, and object type values.

* Data type of timeStamp is object that need to conver to datetime.

#### In order to understand our data, we can look at each variable and try to understand their meaning and relevance to this problem.

In [None]:
data.columns

<p style="font-weight: bold;color:#FF4500">Highlights</p>

#### The data contains the following fields:

* lat : String variable, Latitude
* lng: String variable, Longitude
* desc: String variable, Description of the Emergency Call
* zip: String variable, Zipcode
* title: String variable, Title
* timeStamp: String variable, YYYY-MM-DD HH:MM:SS
* twp: String variable, Township
* addr: String variable, Address
* e: String variable, Dummy variable (always 1)

**In order to understand the data, we need to analysis each fields carefully :**

* By observation, we can easily get to know that description field contains three information, address, township and station code seperated by semi-column. Address and township tells us location details from where the call has been made to the which Station. Although this information is also captured in the fields 'lat' & 'lng' and 'zip','twp' & 'addr' This information will help rescue team to take actions quickly. 
* timeStamp is provides us information on which date and time the emergency call has been made.
* title field contain the two kind of information separated by colon. First is Reason Categories :- EMS, Fire and Traffic and other is Reason details.
* column 'e' is a dummy variable, whoes value is always 1. So lets drop it as it will not providing us any useful information.
* data type of the objects in the timeStamp column is string, so this change this column from strings to DateTime objects.

In [None]:
data.drop('e', axis=1, inplace=True)

In [None]:
type(data['timeStamp'].iloc[0])

In [None]:
data['timeStamp'] = pd.to_datetime(data['timeStamp'])

In [None]:
mindate = data["timeStamp"].min()
mindate

In [None]:
maxdate = data["timeStamp"].max()
maxdate

In [None]:
from dateutil import relativedelta

dif = relativedelta.relativedelta(pd.to_datetime(maxdate), pd.to_datetime(mindate))
print("{} years and {} months".format(dif.years, dif.months))

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* So in a 4 years and 4 months of time span from Dec 2015 to April 2020, about 6.3 hundred people called emergency number 911 to get help in Montgomery County, PA.

#### Now grab specific attributes from a Datetime object 

In [None]:
data['Hour'] = data['timeStamp'].apply(lambda time: time.hour)
data['Hour'].head()

In [None]:
data['DayOfWeek'] = data['timeStamp'].apply(lambda time: time.dayofweek)
data['DayOfWeek'].head()

In [None]:
data['Month'] = data['timeStamp'].apply(lambda time: time.month)
data['Month'].head()

In [None]:
data['Year'] = data['timeStamp'].apply(lambda time: time.year)
data['Year'].head()

In [None]:
data['Date'] = data['timeStamp'].apply(lambda time:time.date())
data['Date'].head()

**Notice how the Day of Week is an integer 0-6. We use the map() method with this dictionary to map the actual string names to the day of the week:**

In [None]:
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}

In [None]:
data['DayOfWeek'] = data['DayOfWeek'].map(dmap)
data['DayOfWeek'].head()

In [None]:
data.head(5)

## Missing Data

In [None]:
total = data.isnull().sum().sort_values(ascending=False)
percent = ((data.isnull().sum()/data.isnull().count())*100).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Zip has 80199 missing values which is about 12% of total missing values.
* Township has 293 missing values which is only 0.04% of total missing values.
* Except Zip and Township, no other field have any missing values.

Actually there is no ideal way to deal with missing data. However handling missing values is an essential preprocessing task that can drastically deteriorate our model when not done with sufficient care. 

Before starting handling missing values it is important to identify the missing values and know with which value they are replaced. Possible variations of missing values are: ‘NaN’, ‘NA’, ‘None’, ‘ ’, ‘?’ and others. 

Once you know a bit more about the missing data you have to decide whether or not you want to keep entries with missing data. This decision should partially depend on **how random missing values are.**

## 'desc' field : Description of the Emergency Call

As stated earlier 'desc' field contains three information, Address, Township and Station code seperated by semi-column. So lets seperate them out to get the station code, to which the call has been made.

In [None]:
pd.set_option('display.max_colwidth', -1)
data['desc'].head()

In [None]:
data['station_code'] = data['desc'].str.split('Station', expand=True)[1].str.split(';', expand=True)[0]
data['station_code'] = data['station_code'].str.replace(':', '')
data['station_code'].head()

In [None]:
data['station_code'] = data['station_code'].str.strip()
data['station_code'].head()

####  In the titles column there are "Reasons/Departments" specified before the title code. These are EMS, Fire, and Traffic. So lets split into Reason Category and Reason.

In [None]:
data['reason_category'] = data['title'].apply(lambda title: title.split(':')[0])
data['reason_category'].head()

In [None]:
data['reason'] = data['title'].apply(lambda title: title.split(':')[1])
data['reason'].head()

#### So finally our dataset looks like:

In [None]:
data.head(5)

# 4. Exploratory Data Analysis (EDA)

We do the Exploratory Data Analysis (EDA) of 911 data set to summarize their main characteristics with the help of summary statistics and graphical representations.. This EDA method will help us to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.

Actually it is a good practice to understand the data first and try to gather as many insights from it. EDA is all about making sense of data in hand,before getting them dirty with it.

### Top 5 townships (twp) for 911 calls

In [None]:
data['twp'].value_counts().head(10)

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Maximum call has came from LOWER MERION township. We need to further analysis what kind of emergency this area has high demand.

In [None]:
data[data['twp']=='LOWER MERION']['reason_category'].value_counts()

In [None]:
data[data['twp']=='LOWER MERION']['reason'].value_counts().head(10)

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* This area has a high demand for emergency related to traffic and EMS (Emergency Medical Services). So might be this township require to improve road conditions and medical facilities.

### Most called stations for emergency

In [None]:
dfsc = data['station_code'].value_counts().head(10)
dfsc

In [None]:
data[data['station_code'] == "308A"]['reason_category'].value_counts()

In [None]:
data[data['station_code'] == "308A"]['reason'].value_counts().head(10)

In [None]:
plt.figure(figsize=(12,6))
plt.bar(dfsc.index,dfsc.values,width=0.6)
plt.title("Most Called Stations")
plt.xlabel("Station")
plt.ylabel("Number of calls")
plt.tight_layout()

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Most of the calls come to stations 308A, 329, 313, 381, and 345

###  Top 10 zipcodes for 911 calls

In [None]:
dfzip = data['zip'].value_counts().head(10)
dfzip

In [None]:
data['zip'].nunique()

In [None]:
data[data['zip']==19401.0]['twp'].head(10)

In [None]:
data[data['zip']==19401.0]['reason_category'].value_counts()

In [None]:
data[data['zip']==19401.0]['reason'].value_counts().head()

In [None]:
data[data['zip']==19401.0].shape[0]

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Emergency calls has been made from total 197 zip codes.
* Maximum number of emergency calls (43075) have been received from Norristown township having zip code 19401.0
* About 16.8% (7233/43075) emergency calls in this area are related to vehical accidents only. So this area needs to improve in this field to avoid such cases, like reckless driving, driving in bad weather conditions, not stopping while the red light is running etc.

### Top 10 dates of receiving the maximum in calls in all the years:

In [None]:
data['Date'].value_counts().head(10)

### Most busy year with total number of calls recieved:

In [None]:
data["Year"].value_counts().head(1)

###  Most common Reason for a 911 call 

In [None]:
data['reason_category'].value_counts().head(5)

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x=data['reason_category'],data=data, palette='bright')
plt.title("Emergency call category")

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* People called more for medical emergency service rather than others.

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x=data['DayOfWeek'],data=data,hue=data['reason_category'],palette='bright')
plt.title("Emergency calls day wise groupby category")
plt.legend(loc=2, bbox_to_anchor=(1.05, 1))

* Emergency calls is almost equal on all days.

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x=data['reason_category'],data=data,hue=data['Year'],palette='bright')
plt.title("Emergency call category groupby year")
plt.legend(loc=2, bbox_to_anchor=(1.05, 1))

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Highest emergency service required in the year 2019 for EMS and less for fire.

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x=data['Year'],data=data,hue=data['reason_category'],palette='bright')
plt.title("Emergency calls yearly groupby category")
plt.legend(loc=2, bbox_to_anchor=(1.05, 1))

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Emergency calls are almost same throughout the year, except for year 2015 and 2020. Our dataset contains data for only one month of year 2015 and only four months of year 2020. So data is not available for complete year of 2015 and 2020.

**Now I will do the same for Month:**

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x=data['Month'],data=data,hue=data['reason_category'],palette='bright')
plt.title("Emergency call month wise groupby category")
plt.legend(loc=2, bbox_to_anchor=(1.05, 1))

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x=data['Hour'],data=data,palette='Set2')
plt.title("Emergency call hour wise groupby category")

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* By seeing the above graph, we can observe that maximum number of emergency calls happend at round 5Pm. We need to find out why at this time only and what kind of emergency is required at this time.

In [None]:
plt.figure(figsize=(15,8))
sns.countplot(x=data['Hour'],data=data,hue=data['reason_category'],palette='winter')
plt.title("Emergency call hour wise groupby category")
#plt.legend(loc=2, bbox_to_anchor=(1.05, 1))

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Now by seeing the above graph, it clear that people require maximum number of emergency service due to traffic related problem. This is may be due to people return home from their work place at this time and may find traffic jams, accidents due to signal jumping, rashin driving etc.

### Top 10 reasons for emergency calls

In [None]:
dfRes = data['reason'].value_counts().head(10)
dfRes

In [None]:
data['reason'].nunique()

In [None]:
plt.figure(figsize=(12, 6))
x = list(dfRes.index)
y = list(dfRes.values)
x.reverse()
y.reverse()

plt.title("Most emergency reasons of calls")
plt.ylabel("Reason")
plt.xlabel("Number of calls")

plt.barh(x,y)
plt.tight_layout()
plt.show()

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Highest number of emergency calls are due to vehicals accidents.

### Now group the DataFrame by column and the count() method for aggregation

In [None]:
byMonth = data.groupby('Month').count().sort_values(by='Month',ascending=True)
byMonth.head(12)

**Now create a simple plot off of the dataframe indicating the count of calls per month.**

In [None]:
byMonth['twp'].plot(figsize=(12, 6))
plt.title('Count of calls per month')

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Maximum emergency service required in the months of Jan,Feb, March and Dec, broadly speaking in Winder season. As we already seen above that people require traffic related emergecy service and in winter due to bad weather, fog, less visibility this may cuase more.

### Linear fit on the number of calls per month

**Now see if you can use seaborn's lmplot() to create a linear fit on the number of calls per month. Keep in mind you may need to reset the index to a column.**

In [None]:
plt.figure(figsize=(12, 8))
sns.lmplot(x='Month',y='twp',data=byMonth.reset_index())

### Now groupby the Date column with the count() aggregate and create a plot of counts of 911 calls.

In [None]:
byDate = data.groupby('Date').count().sort_values(by='Date',ascending=True)
byDate.head()

In [None]:
byDate['twp'].plot(figsize=(12,6))
plt.xticks(rotation=45)
plt.tight_layout()

#### Now recreate this plot but create 3 separate plots with each plot representing a Reason for the 911 call

In [None]:
data[data['reason_category']=='Traffic'].groupby('Date').count()['twp'].plot(figsize=(12,6))
plt.title('Traffic')
plt.tight_layout()

In [None]:
data[data['reason_category']=='Fire'].groupby('Date').count()['twp'].plot(figsize=(12,6))
plt.title('Fire')
plt.tight_layout()

In [None]:
data[data['reason_category']=='EMS'].groupby('Date').count()['twp'].plot(figsize=(12,6))
plt.title('EMS')
plt.tight_layout()

### Heatmap

Now let's move on to creating  heatmaps with seaborn and our data. We'll first need to restructure the dataframe so that the columns become the Hours and the Index becomes the Day of the Week. I will do this by combine the groupby with an [unstack](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html) method.

In [None]:
dayHour = data.groupby(['DayOfWeek','Hour']).count()['reason_category'].unstack()
dayHour

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(dayHour,cmap='viridis',linewidths=.1)

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* Above tells that less emergency service required at night or early morning,while more at evening time and that is even more on Friday. 

Now lets create the cluster graph to check it more precisely.

#### Now create a clustermap using this DataFrame.

In [None]:
plt.figure(figsize=(12,6))
sns.clustermap(dayHour,cmap='viridis',linewidths=.1)

#### Now repeat these same plots and operations, for a DataFrame that shows the Month as the column.

In [None]:
dayMonth = data.groupby(by=['DayOfWeek','Month']).count()['reason_category'].unstack()
dayMonth

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(dayMonth,cmap='viridis',linewidths=.1)

<p style="font-weight: bold;color:#FF4500">Highlights</p>

* People called for emergency services more on Friday's of March month. 

In [None]:
plt.figure(figsize=(12,6))
sns.clustermap(dayMonth,cmap='viridis',linewidths=.1)

<p style="font-weight: bold;color:#FF4500">Highlights</p>

This cluster graph gives us more precise information now.
* So less emergency service required on Wednesday's of July month.
* People called for emergency services more on Friday's of March month. 

<p style="font-weight:bold;color:#1E90FF;font-size:20px">I welcome comments, suggestions, corrections and of course votes also.</p>