Welcome to this notebook. I use it as an excercise for the Udemy course "Python for Data Science and machine learning bootcamp" (https://www.udemy.com/python-for-data-science-and-machine-learning-bootcamp) where the data is used as a capstone-project for the data visualization part. I'll go through my learning process here step-by-step so that the interested fellow learner can see / learn from the struggles I had when wrangling with the data. I hope you find it helpful. 

I would be glad to receive suggestions for improvements, upvotes and any comment you may have!

Let the analysis and visualization begin:
![](http://media.dmnews.com/images/2013/10/10/bigstock-modern-business-conce_473073.jpg)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

df=pd.read_csv(r"../input/911.csv")

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
# Drop rows where no zip-code is provided so there is only full data-items (~8% of data)
df.dropna(inplace=True)
df.describe()

In [None]:
df.drop("e",axis=1).head(1)

In [None]:
df.plot.scatter(y="lat",x="lng")

Some extreme outliers.

In [None]:
fig=plt.figure(figsize=(10,10))
ax=fig.add_axes([0,0,1,1])
ax.set_xlim(-74.98, -75.75)
ax.set_ylim(39.90, 40.67)
ax.invert_xaxis()
ax.scatter(y=df["lat"],x=df["lng"],s=0.001)

This actually looks like a map. Roads, cities and villages. Lets look at the map of Montgomery County:
![](http://www.usgwarchives.net/maps/pa/county/montgo/usgs/montgolg.jpg)

In [None]:
df[(df["zip"]<30000)&(df["zip"]>18500)]["zip"].hist(bins=40)

The 194xx zip-codes are the upper left corner and the 190xx codes are in the lower right regions.
http://www.eachtown.com/ZIP/194/

Therefore zip and longitude/latitude should have a certain correlation.

In [None]:
print(df.zip.corr(df.lat))
print(df.zip.corr(df.lng))

The higher the zip-code, the lower the negative longitude (north-south) = the more north the district.
Same for rather west than east when zip is higher.
![](http://www.wpb-radon.com/maps/Montgomery%20County%20radon.jpg)

Let's look at the time

In [None]:
from datetime import datetime as dt
df["Day"]=df["timeStamp"].apply(lambda x: dt.strptime(str(x)[0:10], '%Y-%m-%d'))
df["Month"]=df["timeStamp"].apply(lambda x: dt.strptime(str(x)[5:7], '%m'))
df["Year"]=df["timeStamp"].apply(lambda x: dt.strptime(str(x)[0:4], '%Y'))
df["Time"]=df["timeStamp"].apply(lambda x: dt.strptime(str(x)[11:19], '%X'))

In [None]:
df.groupby("Year").size()

In [None]:
by_year=df.groupby("Year").size()
by_year.plot(kind="bar")

Calls where obviously started to record in late 2015

In [None]:
by_month=df.groupby("Month").size()
by_month.plot(kind="bar")

This is obviously skewed by the 2015 figures which obviously started to be recorded in December

In [None]:
df2=df[df["Year"]!="2015-01-01"]
by_month=df2.groupby("Month").size()
by_month.plot()

Seems that December/January and the Summer months June/July/August seem to be more dangerous then spring and autumn.

In [None]:
by_day=df.groupby("Day").size()
by_day.plot()

This is not what I was interested in. I wanted to see weekdays. When are more calls incoming? Weekends?

In [None]:
# Thanks stackoverflow: https://stackoverflow.com/questions/30222533/create-a-day-of-week-column-in-a-pandas-dataframe-using-python
df["Weekday"]=df["Day"].dt.weekday_name

In [None]:
df.groupby("Weekday").size().plot(kind="bar")
#weekday.plot()

Funny sorting. Anyway. Different to my assumption: Weekends appear to be safer then weekdays.
![](https://sd.keepcalm-o-matic.co.uk/i-w600/its-friday-forget-work-and-enjoy-your-weekend.jpg)

Ok. Next: Lets look into times of calls: 

In [None]:
by_time=df.groupby("Time").size()
by_time.size

In [None]:
by_time.plot()

Looks like emergency calls reach a rather stable level during "normal day time" from 8 to 19 o'clock. In the mornings, starting from 6 a.m. it ramps up and from 19:00 to midnight it ramps down. In the night (midnight to six a.m.) a rather stable low level of emergencies.

Next: Lets look into the adresses. Only possibly interesting question that comes to my mind is whether there are "frequent callers". Let's see.

In [None]:
two_calls=df.groupby("addr")
two_calls.size().head(5)

Nevermind. Adresses contain only streets so # of calls will (also) depend on length of street.

Next one to look into: "title". I think these are categories of incidents.

In [None]:
incidents=df.groupby("title")
incidents.size().head(5)

In [None]:
incidents=df.groupby("title").sum()
top50=incidents.sort_values("e",ascending=False).head(50).drop(["lat","lng","zip"],axis=1)
top50

3 Categories of Calls: Fire, Traffic and "EMS" I guess some "Emergency Situation": Looks like these are medical issues. Actually interesting to save these categories into columns to do some analysis.

In [None]:
def fire(title):
        if title[0:4]=="Fire":
            return 1
        else:
            return 0
def Traffic(title):
        if title[0:7]=="Traffic":
            return 1
        else:
            return 0

def EMS(title):
        if title[0:3]=="EMS":
            return 1
        else:
            return 0

df["Fire"]=df["title"].apply(lambda x: fire(x))
df["Traffic"]=df["title"].apply(lambda x: Traffic(x))
df["EMS"]=df["title"].apply(lambda x: EMS(x))

df.head(10)

In [None]:
df.Fire.sum()

In [None]:
df.Traffic.sum()

In [None]:
df.EMS.sum()

I do think that a categorical column will be nice to make some visualizations.

In [None]:
def categorization(title):
    if title[0:4]=="Fire":
            return "Fire"
    elif title[0:7]=="Traffic":
            return "Traffic"
    elif title[0:3]=="EMS":
            return "EMS"
        
df["Category"]=df["title"].apply(lambda x: categorization(x))
df.head(5)

In [None]:
sns.countplot(df.Category)

Ok. Lets create a heatmap with categories on one axis and times on the other (weekday, hours)

category_matrix=

To Do / To look into next:

twp = township - Histogram
desc = freetext (I think) - lets analyze word appearance, most used words