Kaggle Dashboarding Day 1 - *Seattle Crisis Data*

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

import matplotlib.pyplot as plt
print(plt.style.available)
plt.style.use('ggplot')

**Initialize the DataFrame and read in the .csv data. Check for columns to probe for data of interest.**

In [None]:
df = DataFrame()
df = pd.read_csv("../input/crisis-data.csv")
print(list(df.columns.values))

**I thought it would be interesting to analyze the distribution of initial call types across the precincts. So to start I got rid of any data that wouldn't be useful, including invalid precinct data, invalid call type data, and invalid report dates.**

In [None]:
print(df.groupby('Reported Date').count())
print(df['Precinct'].value_counts())
print(df['Precinct'].isnull().value_counts())
print(df['Initial Call Type'].isnull().value_counts())

**Get rid of rows that don't contribute to the analysis.**

In [None]:
df = df[(df['Reported Date'] != '1900-01-01') & (df['Precinct'] != 'UNKNOWN')]
df = df.dropna(subset=['Precinct','Reported Date','Initial Call Type'], how='any')

**Check if rows were deleted based on criteria provided.**

In [None]:
print('1900-01-01' in df['Reported Date'])
print('UNKNOWN' in df['Precinct'])
print(df['Precinct'].isnull().value_counts())
print(df['Initial Call Type'].isnull().value_counts())

**I am most interested in learning about the top five most frequent call types and analyzing their distribution.**

In [None]:
print(df['Initial Call Type'].value_counts().head())

**Since I now know the top five most frequent call types, I'll filter the initial DataFrame so that I am only left with rows in which those call types are present.**

In [None]:
filtered = df.groupby('Initial Call Type').filter(lambda x: len(x) >= 1298)

**Now I want to check how many times each call type shows up within each precinct, so I'll group my DataFrame accordingly and add a row with the counts.**

In [None]:
totals = filtered.groupby(['Precinct','Initial Call Type'])['Template ID'].count().reset_index(name='Count')
print(totals)

**Now all that's left is to create a stacked bar graph plot, as the data I'm most interested in is total number of top five most frequent calls in each precinct and the distribution of those top five calls in each precinct.**

In [None]:
N = 5
ind = np.arange(N)
pers = (1832,1940,852,728,2183)
suic = (1622, 2567, 843, 722, 2289)
dist = (1072, 1084, 462, 267, 1355)
susp = (356, 501, 266, 143, 422)
serv = (282, 443, 177, 113, 283)
width = 0.4

plt.figure(figsize=(15,10))
p1 = plt.bar(ind, pers, width)
p2 = plt.bar(ind, suic, width, bottom = pers)
p3 = plt.bar(ind, dist, width, bottom = pers)
p4 = plt.bar(ind, susp, width, bottom = pers)
p5 = plt.bar(ind, serv, width, bottom = pers)

plt.ylabel('Number of Occurrences')
plt.title('Distribution of Initial Call Types by Precinct')
plt.xticks(ind, ('East','North','South','Southwest','West'))
plt.yticks(np.arange(0,5000,500))
plt.legend((p1[0], p2[0], p3[0], p4[0], p5[0]), ('Person in Crisis', 'Suicide', 'Disturbance', 'Suspicious Person', 'Service'))

plt.show()

**Thank you for looking over my notebook. I appreciate any and all feedback!**