The purpose of this notebook is to learn pandas and seaborn.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import datetime
from collections import Counter
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression

In [None]:
i = pd.read_csv('../input/911.csv')

In [None]:
i.info()

In [None]:
i.head()

In [None]:
i['isFirstHalfOfTheDay'] = i.timeStamp.apply(lambda x : datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S").time() <= datetime.time(12,0))
i['isNight'] = i.timeStamp.apply(lambda x : datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S").time() >= datetime.time(19,0))

In [None]:
# Lets plot the counts for the two halves of the day
sns.countplot(i.isFirstHalfOfTheDay)

No surprises there. One would expect there to be more emergencies towards the second half of the day, which is the case here.

In [None]:
# Lets extract the type and cause of the emergency
i['Type'] = i.title.apply(lambda x : x.split(':')[0].strip())
i['Cause'] = i.title.apply(lambda x : x.split(':')[1].strip('- '))

In [None]:
i.Type.unique()

In [None]:
i.Cause.unique()

In [None]:
## Now lets see the count plots based on type of the emergency
sns.countplot(i.Type)

As we can see the count of medical emergencies is the highest, followed by Traffic and lastly fire.

Lets see how the emergencies are distributed through out the day. For this, we split the day into four equal time slices

* 00 - 06 -> 0
* 06 - 12  -> 1
* 12 - 18 -> 2
* 18 - 00 ->  3

In [None]:
## Now lets see the counts of emergencies by splitting the time into different time

def get_timeslice(timestamp):
    if timestamp.time() < datetime.time(6,0):
        return 0
    elif timestamp.time() >= datetime.time(6,0) and timestamp.time() < datetime.time(12,0):
        return 1
    elif timestamp.time() >= datetime.time(12,0) and timestamp.time() < datetime.time(18,0):
        return 2
    else:
        return 3

In [None]:
## Now lets split the 24hrs time into 4 zones and run a count plot on it.
i['timeSlice'] = i.timeStamp.apply(lambda x : get_timeslice(datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))) 

In [None]:
sns.countplot(i.timeSlice)

It appears most of the emergencies happen during the day between 6AM and 6PM. Now, let's see how each type of emergency is distributed among the time slices

In [None]:
a = i.groupby(['timeSlice', 'Type']).e.sum()

In [None]:
a.unstack().plot(kind='bar', stacked=True)

Irrespective of the time slice, Medical emergencies occupy the top position, but, for time slice 2, medical and traffic emergencies are almost equal in number. Below are the top 10 Causes for emergencies in the 
time slice 2

In [None]:
i[i.timeSlice == 2].Cause.value_counts()[:10]

The top 10 causes for emergencies seem to result in a medical emergency (It is vehical that a traffic accident requires medical support) except Road Obstruction. 

In [None]:
# Here's another cool way to see the distribution of emergencies wrt Type and timeSlice
sns.heatmap(a.unstack())

Let's see if we can predict the Type of the emergency based on the time slice.

In [None]:
encoder = preprocessing.OneHotEncoder()
b = encoder.fit_transform(i.timeSlice.reshape(-1,1))

In [None]:
r = LogisticRegression(multi_class='multinomial', solver='lbfgs')
r.fit(b, i.Type)

In [None]:
sum(r.predict(b) == i.Type)/len(i.Type)

Apparently not so well.