## Introduction

This is an exploratory analysis of 911 emergency calls in Montogmery County, PA dataset.
You can download it from [Kaggle](https://www.kaggle.com/mchirico/montcoalert).

Let's start by taking a look at the raw data.

In [None]:
import pandas as pd 

data = pd.read_csv("../input/montcoalert/911.csv")

data.shape

In [None]:
data.head()

### Elementary description

The data contains $663522$ rows and nine columns. 

- `lat` and `lng` columns are the coordiantes of the event
- `desc` contains the description of the emergency: the address, the police station, the date and the exact time
- `zip` is the zip code
- `title` contains the emergency type (*ems*, *fire*, *traffic*) and a more precise explanation after a colon, i.e. "back pains/injury" or "gas-odor/leak"
- `timeStamp` is the date and time of the event
- `twp` is the township
- `address` is the address
- `e` is the index columns, always equal to `1`

We can safely remove the `e` column, as it doesn't contain any useful information. 
We can remove the `desc` columns as well.
The only piece of information we could extract from it, that we don't already have, is the police station ID. 
However, values in this column are somewhat arbitrary, they don't follow an exact pattern, so the extraction would be very difficult.

In [None]:
data = data.drop(["desc", "e"], axis=1)

### Are there `NaN` values?

Notice that there is a `NaN` value in the fourth row of the `zip` column. 
Let's see how many other `NaN` value there are.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

NAs = [data[name].isna().sum() for name in data.columns]

sns.barplot(x=data.columns, y=NAs)

There are a lot of `NaN` values in the `zip` column, some in `twd` column and none in the others.
We're going to inspect that later on.

### Emergency types

Recall that `title` column contains type of the emergency and then a subtype after a colon.
Let's split this column into two new ones: `type` and `subtype`, and remove `title` column.

In [None]:
import numpy as np

get_type = np.vectorize(lambda title: title.split(":")[0])

def get_subtype(title):
    subtype = " ".join(title.split(":")[1:])[1:]
    if subtype[-2:] == " -":
      subtype = subtype[:-2]
    return subtype


get_subtype = np.vectorize(get_subtype)

data["type"] = get_type(data["title"])
data["subtype"] = get_subtype(data["title"])

data = data.drop("title", axis=1)

Now, we can plot the distribution of emergency types.

In [None]:
types = data["type"].dropna()

plt.figure(figsize=(5,5))
sns.countplot(x=types)

The most common reason for a 911 call is a medical emergency, follow by a traffic emergency. Fires are least common.

### Plotting the emergency coordinates

One of the first things that comes to mind upon seeing data containing geographical coordinates is to plot these coordinates on a scatterplot.
It would allow us to see the areas with the highest amount of emergencies in a natural way.
However, take a look at the map below.


In [None]:
plt.figure(figsize=(10,10))
g = data.plot(kind="scatter", x="lng", y="lat", alpha=0.1)

This is concerning. 
There are records with `lng` value higher than $50$, however this meridian doesn't even cross the United States.
The Montgomery County's coordinates are roughly $(40,-75)$.
Let's take a look at the `zip` column of records whith coordinates lying outside $[39.5,40.5] \times [-75.5,-74.5]$ set. 
We're going to refer to these records as *records with incorrect coordinates*.

In [None]:
# latitude conditions
mask_lat = (data["lat"] < 39.5) | (data["lat"] > 40.5)

#longitude conditions
mask_lng = (data["lng"] < -75.5) | (data["lng"] > -74.5)

# both conditions
geo_mask = mask_lat | mask_lng

print(data["zip"].loc[geo_mask].isna().sum(), 
      "|", 
      data["zip"].loc[geo_mask].isna().sum() / len(data["zip"].loc[geo_mask]),
      "|",
      len(data["zip"].loc[geo_mask]),
      "|",
      len(data["zip"].loc[geo_mask]) / data.shape[0])

There are $86330$ records with incorrect coordinates and among them $5591$ which are also missing a zip code.
Notice that in spite of this phenomenon these records have non-null values in `twp` and `addr` columns.

In [None]:
print("township: ", data["twp"].loc[geo_mask].isna().sum(), "\n address: ", data["addr"].loc[geo_mask].isna().sum())

We cannot simply remove $13\%$ of the records, that's an enormous waste.
We don't even know if these records are actually incorrectly labled or if there is another reason for their existance (later we're going to discuss `TRANSFERED CALL` subtype).
Whatever the real reason is, we're going to omit them when creating scatterplots with coordiantes.

In [None]:
g = data.loc[~geo_mask].plot(kind="scatter", x="lng", y="lat", alpha=0.002, title="Coordiante plot")

Now, this looks like the actual Montgomery County.
We can see that shape of the borders and the fact that some emergency calls from neighbouring counties were handled by services located in Montgomery.

Let's try to enhance this plot a bit.
We're going to create a map of Montco using the scatterplot again, but this time we're going to color records sharing township with the same color.

In [None]:
df = data.loc[~geo_mask]

plt.figure(figsize=(15,15))
sns.scatterplot(data=df, x="lng", y="lat", hue="twp")

This is roughly what we've expected.

### Retriving zip codes

As we've already seen, more than $10\%$ of the records is missing a value in `zip` column.
They have however a non-null value in `addr` column.
Therefore, if two records are sharing an address and one of them is missing a zip code, we can fill this missing value with the zip code from the other one.
Let's do that now.

In [None]:
mask = ~data["zip"].isna()
addr_zip = data[["addr", "zip"]][mask]
addr_to_zip = pd.Series(addr_zip["zip"].values, index=addr_zip["addr"]).to_dict()

data["zip"] = data["addr"].map(addr_to_zip)

This is the amount of records missing a zip code now.

In [None]:
data["zip"].isna().sum()

In [None]:
data["zip"].nunique()

We manged to reduce the amount of records with missing zip codes by about $25\%$, so there are still many left. 

### Emergencies by area

Having filled some of the missing zip codes, we can try to find out what areas in Montco report most emergencies.

In [None]:
df = data.groupby(["type", "zip"]).size().reset_index().pivot(columns="type", index="zip", values=0).fillna(0)
df["sum"] =  data["zip"].value_counts()
df = df.sort_values(by="sum", ascending=False)
df = df.drop("sum", axis=1)
df.head(10).plot.bar(stacked=True, title="Emergencies by type and zip code")
plt.xlabel("Zip code")
plt.ylabel("All emergencies reported")

Although most 911 calls are emergency medical services, there are some areas with only traffic or fire emergencies.

In [None]:
df = data.groupby(["type", "twp"]).size().reset_index().pivot(columns="type", index="twp", values=0).fillna(0)
df["sum"] = data["twp"].value_counts()
df = df.sort_values(by="sum", ascending=False)
df = df.drop("sum", axis=1)
df.head(10).plot.bar(stacked=True, title="Emergencies by type and township")
plt.xlabel("Township")
plt.ylabel("All emergencies reported")

Now, let's do the same thing for each type of emergency.

In [None]:
masks = {key: data["type"] == key for key in ["EMS", "Traffic", "Fire"]}

def emergency_area_plot(type, kind):
  df = data.loc[masks[type]].groupby(["type", kind]).size().reset_index().pivot(columns="type", index=kind, values=0).fillna(0)
  df["sum"] = data[kind].loc[masks[type]].value_counts()
  df = df.sort_values(by="sum", ascending=False)
  df = df.drop("sum", axis=1)
  df.head(10).plot.bar(stacked=True, title=f"{type} by {kind}")
  plt.xlabel(kind)
  plt.ylabel(f"All {type} calls")

emergency_area_plot("EMS", "zip")

In [None]:
emergency_area_plot("EMS", "twp")

In [None]:
emergency_area_plot("Traffic", "zip")

In [None]:
emergency_area_plot("Traffic", "twp")

In [None]:
emergency_area_plot("Fire", "zip")

In [None]:
emergency_area_plot("Fire", "twp")

Another interesting thing to see is where do emergencies of given type take place on the Montco map. 
Fortunately, we have the coordinates of each emergency.

In [None]:
def emergency_type_map(type): 
  subtype_mask = data["type"] == type
  data.loc[~geo_mask & subtype_mask].plot(kind="scatter", x="lng", y="lat", alpha=0.002, title=type)

emergency_type_map("EMS")

In [None]:
emergency_type_map("Traffic")

In [None]:
emergency_type_map("Fire")

All of these are pretty similar in terms of distribution (not the intensity of course, we've already seen there are fewer fires than car accidents) and likely correspond to population density.

### Emergencies and time of the day

We're going to plot emergencies against the time to see during what hours there is the highest amount of 911 calls.
To simplify things, we're going to add a new columns, `hour`, containing the hour of a call, and plot the amount of calls in each hour.


In [None]:
data["timeStamp"] = pd.to_datetime(data["timeStamp"])
data["hour"] = data["timeStamp"].dt.hour

df = data.groupby(["type", "hour"]).size().reset_index().pivot(columns="type", index="hour", values=0).fillna(0) / data.shape[0] * 100
ax = df.plot.bar(stacked=True, title="Average emergency distribution in a day")
plt.ylabel("Percentage of all emegencies reported")
plt.xlabel("Hour")
plt.xticks(rotation=0)

Wee see that most emergencies are reported between $8$ AM and $5$ PM.
Let's plot each type of emergency against time on its own.

In [None]:
def daytime_dist(type):
  df = data["hour"].loc[masks[type]].value_counts(sort=False) / data.shape[0] * 100
  df.plot.bar(title=f"Average {type} emergency distribution in a day")
  plt.ylabel("Percentage of all emegencies reported")
  plt.xlabel("Hour")
  plt.xticks(rotation=0)

daytime_dist("EMS")

In [None]:
daytime_dist("Traffic")

In [None]:
daytime_dist("Fire")

In the traffic emergency distribution plot, unlike in two others, we see two peaks: one between $7$ and $8$ AM and the other between $5$ and $6$ PM.
These are likely corresponding to morning and afternoon rush hours.
Another interesting observation is that highest amount of EMS is reported in the morning, around $10$ AM, whereas highest amount of fire emergencies is reported in the afternoon, around $5$ and $6$ PM.

This is only the average. 
The amount of emergencies in a given hour can depend on the day of the week.
We can see that on a heatmap below.

In [None]:
days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
data["weekday"] = data["timeStamp"].dt.weekday

df = data.groupby(["weekday", "hour"]).count()["type"].unstack() / data.shape[0] * 100

plt.figure(figsize=(20,5))
g = sns.heatmap(df, yticklabels=days)

Indeed, there are less emergencies during the weekend than during the week.
Furthermore, notice that there are less emergencies on Monday than on Friday.

### Emergencies and seasons

To see if there is any correlation between amount and type of 911 calls, and seasons, we're going to plot total amount of emergencies in each month.
However, it may be the case that the measurements started in one month and ended in an other a couple years later. 
That would bias the number of calls in months in between.
Let's see if that's true.

In [None]:
print(data["timeStamp"].min().date())

In [None]:
print(data["timeStamp"].max().date())

The measurements started in the middle December 2015 and ended in the end of July 2020.
The simples solution would be to condider only data from 2016 to 2019.

In [None]:
mask = (data["timeStamp"].dt.year >= 2016) & (data["timeStamp"].dt.year <= 2019)

dt = data.loc[mask].copy()

Now, we can plot the amount and type of emergencies in each month.
We're going to create a new column, `month`, containing number of the month.

In [None]:
dt["month"] = dt["timeStamp"].dt.month

df = dt.groupby(["type", "month"]).size().reset_index().pivot(columns="type", index="month", values=0).fillna(0) / data.shape[0] * 100
ax = df.plot.bar(stacked=True, title="Average emergency distribution in a year (2016-2019)")
plt.ylabel("Percentage of all emegencies reported")
plt.xlabel("Month")
plt.xticks(rotation=0)

An almost uniform distribution, I dare say.

In [None]:
def month_dist(type):
  df = dt["month"].loc[masks[type]].value_counts(sort=False) / data.shape[0] * 100
  df.plot.bar(title=f"Average {type} emergency distribution in a year (2016-2019)")
  plt.ylabel("Percentage of all emegencies reported")
  plt.xlabel("Month")
  plt.xticks(rotation=0)

month_dist("EMS")

In [None]:
month_dist("Traffic")

In [None]:
month_dist("Fire")

Nothing interesting going on here.

### Subtypes of emergencies

First, let's see how many subtypes there are.

In [None]:
data["subtype"].nunique()

Let's also see the amount of subtypes for each supertype ("EMS", "Traffic", "Fire", that is).

In [None]:
print("EMS: ", data["subtype"].loc[masks["EMS"]].nunique())
print("Traffic: ", data["subtype"].loc[masks["Traffic"]].nunique())
print("Fire: ", data["subtype"].loc[masks["Fire"]].nunique())

There are too many subtypes to do as detailed plots as we did before with `type`.
Still, we have some work to do.
Let's see which subtypes are the most common.

In [None]:
subtypes = data["subtype"].dropna()

plt.figure(figsize=(15,5))
sns.countplot(y=subtypes, order=subtypes.value_counts().iloc[:15].index, palette="crest")

We see that `VEHICLE ACCIDENT` is by far the most common subtype of a 991 call.
This is actually because `VEHICLE ACCIDENT` can be of type `EMS`, `Traffic`, or `Fire` (among other reasons).
Now, let's plot most common subtypes for each type.

In [None]:
def subtypes_plot(type):
  subtypes = data["subtype"].loc[masks[type]].dropna()
  plt.figure(figsize=(15,5))
  sns.countplot(y=subtypes, order=subtypes.value_counts().iloc[:25].index, palette="crest")

subtypes_plot("EMS")

In [None]:
subtypes_plot("Traffic")

In [None]:
subtypes_plot("Fire")

Notice the `TRANSFERRED CALL` subtype.
It explains that some coordinates pairs come from neighbouring counties.

As with types, we can plot cases of emergencies of given subtype on a map using 
the coordinates. 
We can't do that for every subtype, because there are so many of them.
In the previous plots we've suspected a strong correlation between distribution of emergencies and population density.
This time let's try to choose (naively) subtypes that won't have this characteristic.

In [None]:
def coordinate_subtype(subtype, alpha=0.002): 
  subtype_mask = data["subtype"] == subtype
  data.loc[~geo_mask & subtype_mask].plot(kind="scatter", x="lng", y="lat", alpha=alpha, title=subtype)

coordinate_subtype("OVERDOSE", alpha=0.03)

We see some areas of accumulation of overdose emergencies.
They, very roughly, overlap with areas of accumulation of `ASSAULT VICTIM` and `STABBING` emergencies below, which can indicate high crime rate in these areas.

In [None]:
coordinate_subtype("ASSAULT VICTIM", 0.075)

In [None]:
coordinate_subtype("STABBING", 0.5)

Before, we haven't seen much correlation between seasons and amount of emergencies.
Now that we look at the subtypes, we can formulate some (once again naive) hypotheses.
Consider `SYNCOPAL EPIOSDE` subtype. 
We can suspect that there are more cases of this emergency in the summer, as summer heat often causes people to faint.
The same goes for `CVA/STROKE`.

Let's test that by plotting an amount of syncopal episode emergencies in each month.

In [None]:
def month_dist_sub(subtype):
  subtype_mask = data["subtype"] == subtype
  df = dt["month"].loc[subtype_mask].value_counts().sort_index() / data.shape[0] * 100
  df.plot.bar(title=f"Average {subtype} emergency distribution in a year (2016-2019)")
  plt.ylabel("Percentage of all emegencies reported")
  plt.xlabel("Month")
  plt.xticks(rotation=0)

month_dist_sub("SYNCOPAL EPISODE")

No correlation can be seen here. 
It cannot be seen below, either:

In [None]:
month_dist_sub("CVA/STROKE")