This dataset consists of 911 calls in Philadelphia in 2015.
Source: Mike Chirico (Kaggle)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib as plt
import numpy as np
%matplotlib inline
df = pd.read_csv('../input/911.csv')
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
# Missing values
df.isnull().sum()

In [None]:
# Unique twp values. I assume this stands for township?
print(len(df.twp.unique()))
df.twp.unique()

In [None]:
# Unique titles
print(len(df.title.unique()))
df.title.unique()

In [None]:
(df.e != 1).sum()

Every e (which I assume means "emergency") value is 1. Maybe that's because this data was taken from a larger dataset of both emergency and non-emergency calls.

In [None]:
# Plot histogram of townships
sns.countplot(df.twp.values)

In [None]:
# Titles can be grouped by EMS, Fire and Traffic
print(df.title.map(lambda x: x.split(': ')[0]).unique())

In [None]:
df['call_type'] = df.title.map(lambda x: x.split(': ')[0])
df.head()

In [None]:
df['times'] = pd.to_datetime(df.timeStamp)
print(type(df['times'][0]))
df['Year-Month'] = df['times'].apply(lambda x: "%d-%d" % (x.year, x.month))
df['Year-Week'] = df['times'].apply(lambda x: "%d-%d" % (x.year, x.week))
#df['year'] = df['times'].apply(lambda x: "%d" % (x.year))
#df['month'] = df['times'].apply(lambda x: "%d" % (x.month))
sns.countplot(df['Year-Month'].values)

In [None]:
# Need to combine 2015-53 and 2016-53 - these are the same week, just broken up because it spans the year boundary.
def combine_last_week_of_2015(x):
    if x == '2016-53':
        return '2015-53'
    else:
        return x
df['Year-Week'] = df['Year-Week'].apply(combine_last_week_of_2015)
# Also, don't plot the first and last weeks, since they're incomplete.
year_week_full_intervals = df['Year-Week'][(df['Year-Week'] != '2015-50') & (df['Year-Week'] != '2016-33')]
ax = sns.countplot(year_week_full_intervals.values, color='c')
ax.set_title("Count of 911 calls by week")
ax.set_xlabel("Week")
t = ax.set_xticks(np.arange(0,34,5))

In [None]:
# Look at trends over the course of a week, day.
int_to_day_of_week = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
df['day_of_week'] = df['times'].apply(lambda x: int_to_day_of_week[x.dayofweek])
def plot_by_day_of_week(df, color=None, hue=None, hue_order=None):
    sns.countplot(x='day_of_week', data=df, color=color, hue=hue, hue_order=hue_order, order=['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
plot_by_day_of_week(df, 'y')

This is surprising because I thought there would be more on the weekend. Maybe it has to do with traffic, stress level? Would be useful to get more granularity. How about a multiplot of the above for different call categories?

In [None]:
ems_df = df[df['call_type'] == 'EMS']
traffic_df = df[df['call_type'] == 'Traffic']
fire_df = df[df['call_type'] == 'Fire']
plot_by_day_of_week(ems_df, 'r')

In [None]:
plot_by_day_of_week(fire_df, 'b')

In [None]:
plot_by_day_of_week(traffic_df, 'g')

In [None]:
plot_by_day_of_week(df, color=None, hue='call_type', hue_order=['EMS', 'Traffic', 'Fire'])

It turns out that most of the variablility is due to Traffic calls, which are predictably sensitive to day-of-week and represent a sizable proportion of the samples. EMS and Fire calls are not strongly affected by day-of-week.

What about time of day?

In [None]:
df['seconds_since_midnight'] = df.times.apply(lambda x: x.hour * 3600 + x.minute * 60 + x.second)
ax = sns.distplot(df.seconds_since_midnight, bins=24)
def seconds_to_time_formatter(seconds, pos):
    # Add some interval and use modulo to change start time.
    # Add am/pm
    hours_since_midnight = int(seconds / 3600)
    ampm_time = hours_since_midnight
    is_pm = False
    if hours_since_midnight < 12:
        if hours_since_midnight == 0:
            ampm_time = 12
    else:
        is_pm = True
        if hours_since_midnight > 12:
            ampm_time = hours_since_midnight - 12
    return "{0}{1}".format(ampm_time, 'pm' if is_pm else 'am')
ax.xaxis.set_major_formatter(plt.ticker.FuncFormatter(seconds_to_time_formatter))
# Show a tick for every other hour
ax.xaxis.set_major_locator(plt.ticker.MultipleLocator(3600 * 2))
ax.set_xlim(0,3600*24)
ax.set_xlabel("Time of day")

I can fix this so that the min and max values are paired across axes. By which I mean the x value for the max y value (15/3:00 pm?) would be centered. The data looks like it would be a pretty good fit for the KDE once it's translated appropriately.

Given the repeating nature of this daily slice of the dataset, a translation is intuitive and valid. I'll graph this, but I should also see how well a sin curve would fit it.

From what I gather, the y axis is showing the kernel's density at particular points. This is not intuitive in this case and needs to be corrected. 

In [None]:
# Make 4am the start of the x axis
# Translate to the right by 20 hours, then mod by 24
from scipy.stats import norm
df['seconds_since_4am'] = (df.seconds_since_midnight + 20 * 3600) % (24 * 3600)
# Graph the same way as above
ax = sns.distplot(df.seconds_since_4am, bins=48, fit=norm)
def seconds_to_time_formatter_4am(seconds, pos):
    # Add some interval and use modulo to change start time.
    # Add am/pm
    hours_since_midnight = int(seconds / 3600)
    # Adjust for 4am start time
    hours_since_midnight = (hours_since_midnight + 4) % 24
    ampm_time = hours_since_midnight
    is_pm = False
    if hours_since_midnight < 12:
        if hours_since_midnight == 0:
            ampm_time = 12
    else:
        is_pm = True
        if hours_since_midnight > 12:
            ampm_time = hours_since_midnight - 12
    return "{0}{1}".format(ampm_time, 'pm' if is_pm else 'am')
ax.xaxis.set_major_formatter(plt.ticker.FuncFormatter(seconds_to_time_formatter_4am))
ax.xaxis.set_major_locator(plt.ticker.MultipleLocator(3600 * 2))
ax.set_xlim(0,3600*24)
ax.set_xlabel("Time of day")

I expected that once I fit the data to start at 4am, the KDE would look more like a normal curve. However, the KDE shape is unchanged. It might benefit from increasing the bandwidth somewhat and the it seems to have strange behavior at the ends. However, I also fit a normal curve and it doesn't fit the data all that well, either. It fails to capture the somewhat wavy shape on the left side, which I assume has something to do with traffic.

For good measure, here's a version of the histogram with the "kde" parameter set to false. This causes the count to be shown on the y axis.

In [None]:
ax = sns.distplot(df.seconds_since_4am, bins=48, kde=False)
ax.xaxis.set_major_formatter(plt.ticker.FuncFormatter(seconds_to_time_formatter_4am))
ax.xaxis.set_major_locator(plt.ticker.MultipleLocator(3600 * 2))
ax.set_xlim(0,3600*24)
ax.set_xlabel("Time of day")
ax.set_ylabel("Calls per half hour")

In [None]:
g = sns.FacetGrid(df, col='call_type', size=4)
g.map(sns.distplot, 'seconds_since_4am', kde=False, bins=48)
for i in range(3):
    ax = g.facet_axis(0,i)
    ax.xaxis.set_major_formatter(plt.ticker.FuncFormatter(seconds_to_time_formatter_4am))
    ax.xaxis.set_major_locator(plt.ticker.MultipleLocator(3600 * 2))
    ax.set_xlim(0,3600*24)
    ax.set_xlabel("Time of day")
g.set_ylabels('Calls per half hour')

Sure enough, doing separate plots for each call category makes it clear that most of the irregularity comes from traffic calls. These spike during rush hour and at noon. Another interesting trend is that EMS calls seem most popular in the late morning, which I can't think of an obvious explanation for. Going forward, it would be interesting to find out how time and location interact. For example, which neighborhoods in Philadelphia have the most 911 calls of various types? Do the time-of-day trends stay consistent across neighborhoods? Are certain areas more sensitive to time of day? Also, how do these trends change over the course of a week?