<hr/>

# Emergency 911 Calls - Exploratory Data Analysis
**[by Tomas Mantero](https://www.kaggle.com/tomasmantero)**
<hr/>

### Table of Contents
* **1. [Introduction](#ch1)**
* **2. [Dataset](#ch2)**
    * 2.1 Imports
    * 2.2 Get the Data
    * 2.3 Analyze by Describing Data
* **3. [Creating New Features](#ch3)**
* **4. [Exploratory Data Analysis (EDA)](#ch4)**
* **5. [Overall 911 Emregency Calls](#ch5)**
* **6. [Feature Engineering](#ch6)**
* **7. [Heatmap](#ch7)**
* **8. [Clustermap](#ch8)**
* **9. [References](#ch9)**

<a id="ch1"></a>
# 1. Introduction 
<hr/>

One of the objectives of this notebook is to **show step-by-step how to analyze and visualize the dataset to better understand 911 calls and what originates them.** Moreover, we are going to explain most of the concepts used so that you understand why we are using them. 

If you have a question or feedback, do not hesitate to write and if you like this kernel,<b><font color='green'> please upvote! </font></b>

The following questions will be answered throughout the Kernel:
* ***Which features are available in the dataset?***
* ***How many rows and columns does the dataset have?***
* ***Which features are categorical?***
* ***Which features are numerical?***
* ***Which features contain blank, null or empty values?***
* ***What are the data types for various features?***
* ***How many zip codes does the dataset have?***
* ***What are the top 5 zip codes for 911 calls?***
* ***What are the top 5 townships (twp) for 911 calls?***
* ***How many unique title of emergency codes are there?***
* ***What is the most common Reason for a 911 call based off of this new column?***



<img src="https://images.pexels.com/photos/3584101/pexels-photo-3584101.jpeg?auto=compress&cs=tinysrgb&h=750&w=1260" title="source: www.pexels.com" width="600" height="600"/>
<br>

<a id="ch2"></a>
# 2. Dataset
<hr/>

* This dataset contains emergency calls from Montgomery County, PA. 
* It includes calls from 2015 to 2020. 

***Montgomery County***

Montgomery County, locally also referred to as Montco, is a county located in the Commonwealth of Pennsylvania. As of the 2010 census, the population was 799,874, making it the third-most populous county in Pennsylvania, after Philadelphia and Allegheny Counties. The county seat is Norristown. Montgomery County is very diverse, ranging from farms and open land in Upper Hanover to densely populated rowhouse streets in Cheltenham.

***911 Calls***

Created by Congress in 2004 as the 911 Implementation and Coordination Office (ICO), the National 911 Program is housed within the National Highway Traffic Safety Administration at the U.S. Department of Transportation and is a joint program with the National Telecommunication and Information Administration in the Department of Commerce.

***Feature Columns***
    
* **lat:** String variable, Latitude
* **lng:** String variable, Longitude
* **desc:** String variable, Description of the Emergency Call
* **zip:** String variable, ZIP Code
* **title:** String variable, Title of Emergency
* **timeStamp:** String variable, Date and time of the call, YYYY-MM-DD HH:MM:SS
* **twp:** String variable, Township
* **addr:** String variable, General Address
* **e:** String variable, Dummy variable, Index column (always 1)

## 2.1 Imports

In [None]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## 2.2 Get the Data
The Python Pandas packages helps us work with our datasets. We start by acquiring the datasets into Pandas DataFrames.

In [None]:
df = pd.read_csv('../input/montcoalert/911.csv')

## 2.3 Analyze by describing data
Pandas also helps describe the datasets answering following questions early in our project.

***Which features are available in the dataset?***

In [None]:
print(df.columns.values)

***How many rows and columns does the dataset have?***

In [None]:
print('Rows     :',df.shape[0])
print('Columns  :',df.shape[1])

***Which features are categorical?***

* Categorical: desc, zipcode, title, twp, addr, e.

***Which features are numerical?***

* Continous: lat, lng.
* Discrete: timeStamp.

In [None]:
# preview the data
df.head()

### Dropping column e

In [None]:
df = df.drop('e',axis=1)

***Which features contain blank, null or empty values?***

We can check for missing values with pandas isnull(). This indicates whether values are missing or not. Then we can sum all the values to check every column.
* There are 80492 missing values in the dataset. 
* For the purpose of this notebook it is not necessary to drop or fill the missing values, as we will only do a visual alanlizis.

In [None]:
# missing values
print('Missing values:',df.isnull().values.sum())

df.isnull().sum()

***What are the data types for various features?***

Three features are floats and five are objects. 

In [None]:
df.info()

***How many zip codes does the dataset have?***

In [None]:
df['zip'].nunique()

***What are the top 5 zip codes for 911 calls?***

In [None]:
df_zip = pd.DataFrame(df['zip'].value_counts().head(5))
df_zip.rename(columns = {'zip':'Top 5'}, inplace = True)
df_zip.style.background_gradient(cmap='Blues')

**What are the top 5 townships (twp) for 911 calls?**

In [None]:
df_twp = pd.DataFrame(df['twp'].value_counts().head(5))
df_twp.rename(columns = {'twp':'Top 5'}, inplace = True)
df_twp.style.background_gradient(cmap='Greens')

***How many unique title of emergency codes are there?***

In [None]:
df['title'].nunique()

<a id="ch3"></a>
# 3. Creating new features
<hr/>

### Reason feature
In the titles column there are "Reasons/Departments" specified before the title code. These are **EMS, Fire, and Traffic.** We are going to use `.apply()` with a custom lambda expression to create a new column called "Reason" that contains this string value.

For example, if the title column value is *EMS: BACK PAINS/INJURY* , the Reason column value would be *EMS.*

In [None]:
df['reason'] = df['title'].apply(lambda title: title.split(':')[0])

### Title_code feature
Using the same method from above, we are going to create a column with just the title code. 

In [None]:
df['title_code'] = df['title'].apply(lambda title: title.split(':')[1])

<a id="ch4"></a>
# 4. Exploratory Data Analysis (EDA)
<hr/>

**What is the most common Reason for a 911 call based off of this new column?**
* The number one reason for 911 calls are Emergency Medical Services.
* Almost half of the reasons are for EMS.

In [None]:
df['reason'].value_counts()

In [None]:
fig, axes = plt.subplots(1,2, figsize=(15, 5))

sns.countplot(x='reason', data=df, order=df['reason'].value_counts().index, ax=axes[0])
axes[0].set_title('Common Reasons for 911 Calls', size=13)
axes[0].set(xlabel='Reason', ylabel='Count')

df['reason'].value_counts().plot.pie(autopct='%1.1f%%',ax=axes[1],shadow=True)
axes[1].set(xlabel='', ylabel='')

sns.despine(bottom=False, left=True)

<a id="ch5"></a>
# 5. Overall 911 Emregency Calls
<hr/>

The barcahrt shows the **top 10 emergency calls** from all the categories.
* Vehicle accidents are the number one reason people call 911. 
* Disabled vehicle and fire alarm are in second and third place. 

In [None]:
fig, axes = plt.subplots(figsize=(10, 5))
sns.countplot(y='title', data=df, order=df['title'].value_counts().index, palette='prism')
sns.despine(bottom=False, left=True)
axes.set_ylim([9, 0])
axes.set_title('Overall 911 Emregency Calls', size=15)
axes.set(xlabel='Number of 911 Calls', ylabel='')
plt.tight_layout()

### Traffic 911 Emergency Calls
* The most common emergency titles are vehicle accident, disable vehicle and road obstruction.

In [None]:
df[df['reason']=='Traffic'].groupby('title_code').count()['lat'].sort_values(ascending=True).plot(kind='barh', figsize=(10, 5), color='darkblue')
plt.xlabel('Number of 911 Calls')
plt.ylabel('')
plt.title('Traffic 911 Emergency Calls', fontsize=15)

### Fire 911 Emergency Calls
* The most common emergency titles are fire alarm, vehicle accident and fire investigation.

In [None]:
df[df['reason']=='Fire'].groupby('title_code').count()['lat'].sort_values(ascending=True).tail(10).plot(kind='barh', figsize=(10, 5), color='darkred')
plt.xlabel('Number of 911 Calls')
plt.ylabel('')
plt.title('Fire 911 Emergency Calls', fontsize=15)

### EMS 911 Emergency Calls
* The most common emergency titles are fall victim, respiratory emergency and cardiac emergency.

In [None]:
df[df['reason']=='EMS'].groupby('title_code').count()['lat'].sort_values(ascending=True).tail(10).plot(kind='barh', figsize=(10, 5), color='darkgreen')
plt.xlabel('Number of 911 Calls')
plt.ylabel('')
plt.title('EMS 911 Emergency Calls', fontsize=15)

<a id="ch6"></a>
# 6. Feature Engineering
<hr/>

### TimeStamp feature
We are going to use [pd.to_datetime](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) to convert the timeStamp column from string to DateTime object. Then we use `.apply()` to create 3 new columns called Hour, Month, and Day of Week.

In [None]:
df['timeStamp'] = pd.to_datetime(df['timeStamp'])

df['Hour'] = df['timeStamp'].apply(lambda time: time.hour)
df['Month'] = df['timeStamp'].apply(lambda time: time.month)
df['Day of Week'] = df['timeStamp'].apply(lambda time: time.dayofweek)

Notice how the Day of Week is an integer 0-6. Let's map the actual string names to the day of the week:

In [None]:
# dictionary string names
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}

df['Day of Week'] = df['Day of Week'].map(dmap)

###  Weekly and monthly calls
* It looks like friday is the day with more calls during the week. 
* Regarding the monthly calls, looks like during the first semester there are more calls.

In [None]:
fig, axes = plt.subplots(1,2, figsize=(15,5))

sns.countplot(x='Day of Week', data=df, palette='viridis', ax=axes[0])
axes[0].set_title('Weekly Calls', size=15)

sns.countplot(x='Month', data=df, hue='reason', palette='viridis', ax=axes[1])
axes[1].set_title('Monthly Calls', size=15)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0)

sns.despine(bottom=False, left=True)

### Date feature
Create a new column called 'Date' that contains the date from the timeStamp column.

In [None]:
df['Date'] = df['timeStamp'].apply(lambda t: t.date())

Now groupby this Date column with the count() aggregate and create a plot of counts of 911 calls by reason.

### Traffic

In [None]:
df[df['reason']=='Traffic'].groupby('Date').count()['lat'].plot(figsize=(15,5), color='darkblue')
plt.title('Traffic', fontsize=15)
sns.despine(bottom=False, left=True)
plt.tight_layout()

### Fire

In [None]:
df[df['reason']=='Fire'].groupby('Date').count()['lat'].plot(figsize=(15,5), color='darkred')
plt.title('Fire', fontsize=15)
sns.despine(bottom=False, left=True)
plt.tight_layout()

### EMS

In [None]:
df[df['reason']=='EMS'].groupby('Date').count()['lat'].plot(figsize=(15,5), color='darkgreen')
plt.title('EMS', fontsize=15)
sns.despine(bottom=False, left=True)
plt.tight_layout()

<a id="ch7"></a>
# 7. Heatmap
<hr/>
Now let's move on to creating heatmaps with seaborn and our data. We'll first need to restructure the dataframe so that the columns become the Hours and the Index becomes the Day of the Week. There are lots of ways to do this, but I would recommend trying to combine groupby with an unstack method.
* In the heatmap we can see that during 14:00 and 17:00 hours there are more calls. 
* Friday and Wednesday have more calls.
* Apparently during Sunday the calls drop.

In [None]:
dayHour = df.groupby(by=['Day of Week', 'Hour']).count()['reason'].unstack()

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(dayHour, cmap='viridis', linewidths=0.05)

<a id="ch8"></a>
# 8. Clustermap
<hr/>

In [None]:
plt.figure(figsize=(12,6))
sns.clustermap(dayHour, cmap='viridis', linewidths=0.05)

<a id="ch9"></a>
# 9. References
---
* [An Introduction to Statistical Learning with Applications in R](http://faculty.marshall.usc.edu/gareth-james/ISL/) - This book provides an introduction to statistical learning methods.
* [Python for Data Science and Machine Learning Bootcamp](https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/) - Use Python for Data Science and Machine Learning.

# Feedback
* **Your feedback is much appreciated**
* **<b><font color='green'>Please UPVOTE if you LIKE this notebook</font></b>**
* **Comment if you have any doubts or you found any errors in the notebook**