## 1. Defining the problem

This dataset contains the data of all the calls made for emergency purpose to emergency helpline number 911. It have the type of emergency for which the call was made, the locations from where the call was made and the date and time of the call made.

Various people are facing several emegencies for several reasons. We can analyze our data to find some solution in order to reduce those emergency.

For this purpose we can find out the highest type/kind of emergencies people are facing. Also we can determine the time period in which different emergency occurs. Or maybe it is a particular location that is having the same problems over and over again.



## 2. Asking the questions

* What are the different types of emergency having in the dataset?

* Which categories of emergency has made the highest number of calls?

* In which timeframe (hour, day, month and year) maximum number of calls was made?

* Which categary of maximum emergency call was made in those timeframes?

* At what locations highest types of emergency calls were made?



** Note: When proceeding through the analysis if any question arises, please note down here. And try to answers those questions.




## 3. Importing the data and checking for consistency

In [None]:
#Importing numpy and pandas
import numpy as np
import pandas as pd

In [None]:
#Importing dataset
df=pd.read_csv('../input/montcoalert/911.csv')

In [None]:
#Viewing the dataset
df.head()

In [None]:
#Checking columns name, null values and data types of each column
df.info()

We can see that the column names 'zip',   'twp',   'addr',   'e' have names that are not written properly. Also there are null values in 'zip' and 'twp' column. Moreover 'zip' and 'timeStamp' column are not formated correctly.

We will correctly write the column names. For column name with 'e' we have to check and verify it as it contain data that doesn't look correct. 

And then check for the null values. If keeping or removing the rows with null values would affect our analysis.

Then format the columns properly.

We can also check for consistency of the time period. Eg. if the time period is from 2009 to 2015 the we can check that does the beggining year 2009 and ending year 2015 has data availble for all the months. If not available, should we remove data of those years which might affect our analysis.

At last we will remove all the extra spaces from every new column we create.




## 4. Data Cleaning

#### Renaming Columns name

In [None]:
#Renaming the column as follows:'zip' to 'zipcode', 'twp' to 'township',  'addr' to 'address'
df.rename({'zip':'zipcode', 'twp':'township', 'addr':'address'}, axis=1, inplace=True)

In [None]:
#Checking if the column name was successfully changed.
df.head(3)

#### Deleting unnecessary columns

In [None]:
#Checking unique values in column with name 'e'
df.e.unique()

Since we find that there is only 1 value in the entire 'e' column. So this column is of no use. We will delete this column.

In [None]:
#deleting column 'e'
del df['e']

In [None]:
#Checking if the column is deleted
df.head(3)

#### Handling null values

In [None]:
#Checking for null values in percentage
df.isna().sum()*100 / len(df)

We can see that there are 12.086% of null values in 'zipcode' column. And only 0.044% null values in 'township' column.

We can remove the rows with 0.044% values in the 'township' column because removing this small percent rows won't affect our analysis.

We will check the 'zipcode' column. And decide whether to remove those rows or maybe removing them would affect our further analysis.

In [None]:
#Deleting rows with null values in township column
df= df.dropna(subset=['township'])

In [None]:
#Checking rows with null values in other coumns. In this case checking null in zipcode column
df[df.isnull().any(axis=1)]

We can identify that those rows with null values in the 'zipcode' column has complete and unique information in other columns.
So we will not delete those rows for now. Because by deleting those rows, we will lost a great amount of data. 
Instead we will replace the null values with zero for the smoothness of our analysis.

In [None]:
#Replacing null values with 0 in 'zipode' column
df['zipcode'].fillna(0, inplace=True)

In [None]:
#Vreifying if the null values were removed and replaced
df.isna().sum()

#### Correcting column format

In [None]:
#We will format the column as follows: 'zipcode' as int64,  'timeStamp' as 'datetime64'
df = df.astype({'zipcode':'int64' , 'timeStamp':'datetime64'})

In [None]:
#Checking if the dateStam was formated to datetime or not!
df.info()

#### Checking for time consistency

To check the time consistency we will seperate the timestamp into four seperate column as 'Hour', 'Day of week', 'Month', and 'Year'. Doing this may also help us in further process throughout our analysis.

In [None]:
#creating four new columns from the 'timestamp' column.
df['Hour'] = pd.to_datetime(df['timeStamp']).dt.hour
df['Day'] = pd.to_datetime(df['timeStamp']).dt.dayofweek
df['Month'] = pd.to_datetime(df['timeStamp']).dt.month
df['Year'] = pd.to_datetime(df['timeStamp']).dt.year

In [None]:
#Checking if new columns were inserted
df.head(3)

We can see that new columns have been created but 'Day of week' column and 'Month' column have number instead of name. So we will convert this number into respective name.

We will first check the values of 'Day' and 'Month' column and map accordingly.

In [None]:
# checking number of unique values in day column
df.Day.unique()

In [None]:
# checking number of unique values in month column
df.Month.unique()

In [None]:
#mapping 'Day' column to day names
day_map = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df['Day'] = df['Day'].map(day_map)

In [None]:
#mapping 'Month' column to month names
month_map = {1:'Jan',2:'Feb',3:'March',4:'April',5:'May',6:'June',7:'July',8:'Aug',9:'Sept',10:'Oct',11:'Nov',12:'Dec'}
df['Month'] = df['Month'].map(month_map)

In [None]:
#checking the changes
print(df.Day.unique())
print(df.Month.unique())

Now we can check for consistency by plotting year column in seaborn

In [None]:
#Importing seaborn and matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#Plotting a bar plot of the count of different values in 'Year' column.
sns.countplot(x ='Year', data = df)

From this plot we can see that there are very less values for the year 2015. Also the year 2020 have almost half the values as compared to other years. This might happen because during the data collection process, the data may be collected from the last part of 2015. And for 2020 only the beggining few month of the year, the data was collected. So there were very less data for this two year.

We will remove the data of this two year. Because removing this will help us to be more accurate and consistence throughout our analysis.

In [None]:
#Deleting rows with 2015 and 2020 in 'year' column
df = df[df.Year != 2015]
df = df[df.Year != 2020]

In [None]:
#Viewing our cleaned data
df.tail()

##### This is the clean data we have to perform our further analysis

## 5. Performing Analysis

### Q. What are the different types of emergency having in the dataset?

To answer this question we have to find the different types of emergencies from the 'title' column.

In [None]:
# Checking the different types of emergency types
df['title'].value_counts().head(20)

#### Ans. The main type of emergencies calls are: Traffic related calls, EMS(emergency medical service) calls, and Fire related calls.

### Q. Which categories of emergency has made the highest number of calls?

We can see from here that there are three main categories of emergencies like 'Traffic', 'EMS'(emergency medical service), and 'Fire'. And various other sub categories. So, we will divide the main categories and subcategories by creating two new columns as 'category' and 's_category'.

In [None]:
# Splitting 'title' column into two different columns
df[['category','sub_category']] = df.title.str.split(':',expand=True,)

In [None]:
#checking if the split was done
df.head(3)

Now we have two seperate columns for emergency of main category and sub category.

We will plot on this coloums to find insights and answer our question.

In [None]:
#finding unique values in 'category' column in percentage
(df['category'].value_counts()/len(df['category']))*100

In [None]:
#plotting unique values in 'category' column
sns.countplot(x ='category', data = df)

##### Ans. We have seen that highest calls were made for EMS/medical service. Followed by Traffic related calls. And very few calls were made for fire related service.

In [None]:
#finding unique values in 'sub_category' columnn in percentage
((df['sub_category'].value_counts()/len(df['sub_category']))*100).head(30)

Here we noticed that 'VEHICLE ACCIDENT' is repeated two times. In first place and in forth place. It is because the first 'VEHICLE ACCIDENT' has a ' -' at the end. So two seperate values are creates. One as 'VEHICLE ACCIDENT -' ans other as 'VEHICLE ACCIDENT'.

So, we will remove the extra space and - from those values. We will do this for the entire column beacuse thee are more values like this.

In [None]:
#removing " -" from all the cells in 'sub_category' column
df['sub_category'] = df['sub_category'].str.replace(" -","")

In [None]:
#Again checking for unique values in 's_category' columnn in percentage
((df['sub_category'].value_counts()/len(df['sub_category']))*100).head(30)

We notice that previously 'VEHICLE ACCIDENT' was only 22.98%. But after correction it became 28.55%.This is a great correction that is done. Else it would have been ruin our analysis.

Now lets proceed to plotting this values.

In [None]:
#plotting unique values of first 20 values in 'category' column in descending order
plt.figure(figsize=(8,10))
sns.countplot(y ='sub_category', data = df, order=df['sub_category'].value_counts().iloc[:30].index)

#### Ans. We can see that  almost almost 28% of the calls were made for VEHICLE ACCIDENT. Followed by DISABLED VEHICLE at 7%. Then  FIRE ALARM , FALL VICTIM ,  RESPIRATORY EMERGENCY, CARDIAC EMERGENCY at 5%. And ROAD OBSTRUCTION, SUBJECT IN PAIN, HEAD INJURY at almost 3%. The rest of them are 1% or less.

We will consider only those who have call rates of 1% or above.

### Q. In which timeframe (hour, day, month and year) maximum number of calls was made?

To answer this question we will plot the number of calls made during hours of the day, during each day, during each month, and during each year.

We will look for patterns in each plot and find insights.

Ploting calls made through various hours of the day. Note: 0 represents midnight 12 or 24th hr.

In [None]:
#hourly calling plot.
plt.figure(figsize=(8,5))
sns.countplot(df['Hour'])

Here we can clearly see that calls made during the time period of 22:00 hrs at night till 6:00 hrs in the morning is extremely low compared to the other time period.

This is because its nighttime and most people sleeps in those hours.

We can see a gradually rising in the number of calls from 8am upto 5 pm. This might happen because those are the standard working hours and during their work they might faces problem.

Previously we have found that 28% of emergency calls was made for vehicle accidents. 

So, we can deduce that this 28% people may have been in a rush while driving during their working hours.

In [None]:
#Ploting calls made through various days.
plt.figure(figsize=(8,5))
sns.countplot(df['Day'])

We can see that calls made on Saturday and Sunday are much lower than that of the other days.

This may be because most people don't go to work on those days.

We confirms that few emergency situation occur due to the people going to work. Or may be they faces different emergency problems during their working hours 

In [None]:
#Ploting calls made through various months.
plt.figure(figsize=(8,5))
sns.countplot(df['Month'])

We didn't see much change or patterns.

In [None]:
#Ploting calls made through various year present in our dataset.
plt.figure(figsize=(8,5))
sns.countplot(df['Year'])

We didn't see much change or patterns.

#### Ans. Maximum calls was made during the awaking hours i.e. in between 6am to 10 pm. Also a gradually rising of calls from 8am till 5pm along with car accident as highest number of calls made denotes that most people are in a hurry during their working hours.

### Q. Which category of maximum emergency call was made in those time frames?

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(df['Hour'], hue=df['category'])

The above plot can be more clearly visualize if we seperately plot for three different category.

Analyzing the EMF emergency calls

In [None]:
#Extracting 'EMS' from category column and plotting the EMS column and hours
EMS= (df.loc[df['category'] == 'EMS'])

plt.figure(figsize=(8,5))
sns.countplot(EMS['Hour'])

We can see that most calls occured between 8am to 9pm. 

We will find the sub category of highest calls made during this period as EMS.

In [None]:
#Finding the top reasons of EMS calls made (in percentage)
EMS = df[df['category']=='EMS']['sub_category']
(EMS.value_counts()/len(EMS)*100).head(30)

Highest call is 'FALL VICTIM'. This is because there may be crimes in some areas due to which people got hurt and made calls. We can verify it by plotting those area where 'FALL VICTIM' was reported.

Next are 'RESPIRATORY EMERGENCY', and 'CARDIAC EMERGENCY'. This is normal as this are sicknesses which anyone may face in anytime of the day.

Fourth highest calls was made due to 'VEHICLE ACCIDENT' which agains derive to the previous insights we found. i.e. vehicle accidents happens due to rush during the working hours.

Creating and plotting for the fall victim calls of different locations

In [None]:
fall_victim= df[df['sub_category']=='FALL VICTIM']['township']

#plt.figure(figsize=(8,10))
#sns.countplot(y ='township', data = fall_victim, order=df['township'].value_counts().iloc[:30].index)

Since it is not showning any result, may be any leading and tailing whitespace in the sub_category column values. So we will remove those and re run it.

In [None]:
#removing leading and tailing whitespace
df.sub_category = df.sub_category.str.strip()

In [None]:
fall_victim= df[df['sub_category']=='FALL VICTIM']['township']

plt.figure(figsize=(8,10))
sns.countplot(y=fall_victim, order=fall_victim.value_counts().iloc[:30].index)

This are all towns of Pennsylvania, United States. We can see from the data that some towns like 'LOWER/UPPER MERION', 'ABINGTON', 'LOWER PROVIDENCE' have high number of calls recorded as 'FALL VICTIM'. It means that this towns may have high crime rate.

Analyzing the Traffic emergency calls

In [None]:
#Extracting 'Traffic' from category column and plotting the EMS column and hours
Traffic= (df.loc[df['category'] == 'Traffic'])

plt.figure(figsize=(8,5))
sns.countplot(Traffic['Hour'])

A sudden spike in calls between 7am to 9am. Then again high number of call between 4pm to 7pm.

Lets see the subcategory for which large number of calls recorded.

In [None]:
Traffic = df[df['category']=='Traffic']['sub_category']
(Traffic.value_counts()/len(Traffic)*100).head(30)

In [None]:
plt.figure(figsize=(7,4))
sns.countplot(y=Traffic, order=Traffic.value_counts().iloc[:30].index)

The data above shows that almost 65% of the calls of 'Traffic' category was made for 'VEHICLE ACCIDENT'. Another 20% was made for 'DISABLED VEHICLE'.

This means that during 7am to 9am and 4pm to 7pm highest Vehicle accident occurs.

Its because this are the time period for which people go to work and return back from work. They might be in a hurry to reach office. And also when returning back they might be tired of whole day work and rush to reach home as early as possible.

Analyzing the Fire emergency calls

In [None]:
#Extracting 'Fire' from category column and plotting the Fire column and hours
Fire= (df.loc[df['category'] == 'Fire'])

plt.figure(figsize=(8,5))
sns.countplot(Fire['Hour'])

Here also highest accidents occurs between 8am to 7pm.

Lets see the subcategory for which large number of calls recorded.

In [None]:
Fire = df[df['category']=='Fire']['sub_category']
(Fire.value_counts()/len(Fire)*100).head(30)

In [None]:
plt.figure(figsize=(8,7))
sns.countplot(y=Fire, order=Fire.value_counts().iloc[:20].index)

'FIRE ALARM' with 38% is the highest calls made during the working hours. It may be because during the working hours most of the hotels and resturants had to works at a fast speed, due to which a fire may occurs in some of them.

Second highest call recorded is again 'VEHICLE ACCIDENT' which is for nthe reasons described above.

Next are 'FIRE INVESTIGATION', 'GAS-ODOR/LEAK' which might happen sometimes due to lots of reasons.

#### Ans. 'VEHICLE ACCIDENT' and 'DISABLED VEHICLE' are the highest number of calls made overall. We found that this call was recorded high when people start going to work. And again when people return back from work. People might be in a hurry to reach office. And also when returning back they might be tired of whole day work and rush to reach home as early as possible. Next highestb calls recordes was 'FIRE ALARM' and 'FALL VICTIM'

### Q. At what locations highest types of emergency calls were made?

We have found that highest number of emergency call was made for 'VEHICLE ACCIDENT'. So, lets see in which locations high number of accidents occurs.

First we will create dataframes by keeping only those rows having 'VEHICLE ACCIDENT'

In [None]:
Fire = df[df['category']=='Fire']['sub_category']
(Fire.value_counts()/len(Fire)*100).head(30)

In [None]:
#Towns with highest number of vehicle accidents.(in percentage)
veh_acc= df[df['sub_category'] == 'VEHICLE ACCIDENT']['township']
(veh_acc.value_counts()/len(veh_acc)*100).head(30)

Only few towns like 'LOWER MERION', 'UPPER MERION', 'ABINGTON', 'CHELTENHAM' have high rate of above 5% Vehicle accidents. Maybe this towns have bad roads or has less roads for which high traffic occurs.

Next we will create dataframes by keeping only those rows having 'DISABLED VEHICLE''

In [None]:
#Towns with highest number of disabled vehicle.(in percentage)
dis_veh= df[df['sub_category'] == 'DISABLED VEHICLE']['township']
(dis_veh.value_counts()/len(dis_veh)*100).head(30)

Same as above. We can see three common towns 'LOWER MERION', 'UPPER MERION', 'ABINGTON' having highest disabled vehicle calls. This might be because of the bad road conditions or bad traffic.

Next we will create dataframes by keeping only those rows having 'FIRE ALARM'

In [None]:
#Towns with highest number of fire alarm.(in percentage)
fire_alm= df[df['sub_category'] == 'FIRE ALARM']['township']
(fire_alm.value_counts()/len(fire_alm)*100).head(30)

Again 'LOWER MERION' and 'ABINGTON' is at the top of the list having the highest calls made.

Next we will create dataframes by keeping only those rows having 'FALL VICTIM'

In [None]:
#Towns with highest number of fire alarm.(in percentage)
fall_vctm= df[df['sub_category'] == 'FALL VICTIM']['township']
(fall_vctm.value_counts()/len(fall_vctm)*100).head(30)

'LOWER MERION' and 'ABINGTON' are again in the second and third position of the list. 'LOWER PROVIDENCE' is also on top of the list meaning that this cities may have ahigh crime rate.

#### Ans. From this we can see that ''LOWER MERION', 'UPPER MERION', 'ABINGTON', are the cities from where highest number of complain were made. This may be because this cities have bad road or bad traffic conditions. Or it might be that this cites have a very poor city controlling authority.

## 6. Explain Outcome

##### Highest calls were made for EMS/medical service. 
##### Most EMS calls were made for the purpose of Vehicle Accident and Disabled Vehicle to be the highest.
##### Most of this calls were made in between 8 am to 5 pm for reporting Car Accident. This is because people are in a rush during their working working hours.

##### Also in cities like Lower Merion, Upper Merion and Abington may have bad road conditions or may have poor traffic controlling authority.

### Recommendation

##### Public Vehicle driving speed should be limited by the traffic controlling authority, specially during the working hours, in between 8am to 6pm.
##### Road should be properly constructed in high traffic cities like Merion and Abington