The first notebook I prepared was [My Findings on the External factors on Exam Scores](https://www.kaggle.com/ksg30111992/my-findings-on-the-external-factors-on-exam-scores) where the data was fictional. It was my past experiences as a student, that was the premise which helped me frame analyses rather quickly.

With Crisis data, I had a quick look, made a few visualizations, and then made some more. At some point, it stopped being "what insights can I get from this" and became more about "Just how many visualizations were possible on the data". With that in mind, I'll try to keep this notebook simple. There will definitely be insights that I would have overlooked or missed out.

I start off this notebook by mentioning the Source of Data. This data is hosted by the [City of Seattle](https://data.seattle.gov/). The data is maintained in Kaggle by [Socrata](https://socrata.com/)'s API.

I'm still trying to find the best way to acknowledge data, but this one should do for now. I know Kaggle associates each notebook with it's respective dataset, and that they provide acknowledgements, but starting from this notebook, I'll try to acknowledge the data even if there is one by default.

I will also be using Pandas and Seaborn. Since I'm a beginner to Seaborn, there will be parts that might appear inefficient and could be improved significantly.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import datetime
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))
sns.set()

# Any results you write to the current directory are saved as output.

**Loading the Data**
* I will be importing the data into a DataFrame. I will also be cleaning up the column names so that they appear more like variables that can be easily referred. In other words, I convert column name into lower case, replace spaces and symbols with underscores
* The dataset contains two columns _Reported Date_ and _Reported Time_, which I will be merging into one column so as to maintain consistency with the column _Occurred Date / Time_. Will this help me or not, I'm not very sure, but as long as it's similar, I'm happy with it.
* The data also contains a few records where Reported Date is 1/1/1900. I will be replacing these values with the respective  value in Occurred Date/Time. In other words, where Reported Date is unclear, I will considering the Reported Date/Time and Occurred Date/Time to be the same.
* The data contains more than one year of data, so I will be taking the years into a variable which I can later use for drilling down totals into their respective years
* I will also be using a small list of months in 3 letters (not full names because the labels almost overlapped in some graphs) for sorting my month series data in calendar order and not alphabetical order. I could use [pd.Categorical](https://stackoverflow.com/questions/48042915/sort-a-pandass-dataframe-series-by-month-name) or [pd.CategoricalIndex](https://stackoverflow.com/questions/40816144/pandas-series-sort-by-month-index), but then I'd have to have a special column just for months. I find reindexing just the counts, a lot easier and requires me to write less code.

In [None]:
crisis_data = pd.read_csv("../input/crisis-data.csv", parse_dates=[["Reported Date", "Reported Time"], "Occurred Date / Time"], infer_datetime_format=True)
crisis_data.columns = crisis_data.columns.str.strip().str.lower().str.replace("/","_").str.replace(" ","_")
# Some reported date values are 1/1/1900. These values, which are rather few in number, will be overwritten by the occurred date/time
crisis_data.reported_date_reported_time = np.where(crisis_data.reported_date_reported_time.dt.year==1900, crisis_data.occurred_date___time,crisis_data.reported_date_reported_time)
crisis_data_years = [2015,2016,2017,2018]
crisis_data_months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
crisis_data.head()

**Crises Reported each year**

Total Crises reported has increased in the past 4 years. Reported Crises have doubled when comparing 2015 and 2018. Note that these crisis reports are not made by individuals, rather they are reported by the Officers of the Seattle Police Department.

In [None]:
pd.concat([crisis_data[crisis_data.reported_date_reported_time.dt.year==x].call_type.value_counts().rename(x) for x in crisis_data.reported_date_reported_time.dt.year.unique()],axis=1, sort=False).fillna(0).T.plot(kind="bar", stacked=True, layout=(2,2), title="Crises reported each year",figsize=(16,4))
crisis_by_initial_call_type = pd.concat([crisis_data[crisis_data.reported_date_reported_time.dt.year==x].initial_call_type.value_counts().rename(x) for x in crisis_data.reported_date_reported_time.dt.year.unique()],axis=1, sort=False).fillna(0)
crisis_by_initial_call_type["Total"] = crisis_by_initial_call_type.sum(axis=1)
crisis_by_initial_call_type.sort_values(by="Total",ascending=False).drop("Total",axis=1).head(5)[::-1].plot(kind="barh", stacked=True, layout=(2,2), title="Types of Crises reported each year",figsize=(16,8),cmap="Reds");

**Crises Reports by Month**

2015 Data doesn't show crises reports for the first 5 months. Crisis reports in 2018 have increased in almost all months when comparing to 2017; the exceptions being August (1141 in 2018 to 1196 in 2017) and October (1036 in 2018 to 1156 in 2017). I'm not considering December as the data is still incomplete at the time of compiling this.

In [None]:
plt.subplots(figsize=(16,4))
sns.heatmap(pd.concat([crisis_data[crisis_data.reported_date_reported_time.dt.year==x].reported_date_reported_time.dt.strftime("%b").value_counts().rename(x) for x in crisis_data_years[::-1]], axis=1, sort=False).fillna(0).reindex(index=crisis_data_months).T,cmap='Reds',annot=True, fmt='g').set_title("Crises reported by year by month");

**Crises reported by the hour**

The general trend is that most crises are reported by officers during the day. Two spikes are noticed: one between 10am-12pm, and one more (highest) between 5pm-7pm. But these are just crisis reports; when did these crises actually occur and how long is the delay between the crisis reported to the crisis actually reported?

In [None]:
ax = pd.concat([
    crisis_data[crisis_data.reported_date_reported_time.dt.year==x].reported_date_reported_time.dt.hour.value_counts().rename(x) for x in crisis_data_years
],axis=1).plot(kind="line", figsize=(16,4), title="Crises reported by the hour", xticks=np.arange(24), legend=True,x_compat=True).set_xticklabels(["{0:0=2d}:00".format(x) for x in np.arange(24)],rotation=90);

**When did the reported crises occur?**

Inferring from the maps, most crises are reported an hour after it occurred.

As observed in the "Crises reported by the hour", most crises are reported between 10am-12pm and 5pm-7pm. It's also in line with the start and end of the 9-5 workday. Zooming out of this data, I see this to a certain level that stress at work pushes individuals to hurt themselves and those around them. If it is, then that's all the more reason to have some sort of counselling programs or something similar to help improve their employees' mental wellness. I'm pretty sure some organizations do, but I digress.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=2,figsize=(16,16))
sns.heatmap(pd.concat([crisis_data[(crisis_data.occurred_date___time.dt.hour==x) & (crisis_data.occurred_date___time.dt.year==2015)].reported_date_reported_time.dt.hour.value_counts().rename("{0:0=2d}:00".format(x)) for x in np.arange(24)],axis=1).fillna(0),ax=axes[0,0],yticklabels=["{0:0=2d}:00".format(x) for x in np.arange(24)]).set_title("2015")
axes[0,0].set_xlabel("Occurred Time");
axes[0,0].set_ylabel("Reported Time");
sns.heatmap(pd.concat([crisis_data[(crisis_data.occurred_date___time.dt.hour==x) & (crisis_data.occurred_date___time.dt.year==2016)].reported_date_reported_time.dt.hour.value_counts().rename("{0:0=2d}:00".format(x)) for x in np.arange(24)],axis=1).fillna(0),ax=axes[0,1],yticklabels=["{0:0=2d}:00".format(x) for x in np.arange(24)]).set_title("2016")
axes[0,1].set_xlabel("Occurred Time");
axes[0,1].set_ylabel("Reported Time");
sns.heatmap(pd.concat([crisis_data[(crisis_data.occurred_date___time.dt.hour==x) & (crisis_data.occurred_date___time.dt.year==2017)].reported_date_reported_time.dt.hour.value_counts().rename("{0:0=2d}:00".format(x)) for x in np.arange(24)],axis=1).fillna(0),ax=axes[1,0],yticklabels=["{0:0=2d}:00".format(x) for x in np.arange(24)]).set_title("2017")
axes[1,0].set_xlabel("Occurred Time");
axes[1,0].set_ylabel("Reported Time");
sns.heatmap(pd.concat([crisis_data[(crisis_data.occurred_date___time.dt.hour==x) & (crisis_data.occurred_date___time.dt.year==2018)].reported_date_reported_time.dt.hour.value_counts().rename("{0:0=2d}:00".format(x)) for x in np.arange(24)],axis=1).fillna(0),ax=axes[1,1],yticklabels=["{0:0=2d}:00".format(x) for x in np.arange(24)]).set_title("2018")
axes[1,1].set_xlabel("Occurred Time");
axes[1,1].set_ylabel("Reported Time");

**Crises reported per Precinct**

Referring to Seattle's [Precinct and Patrol Boundaries](http://www.seattle.gov/police/about-us/about-policing/precinct-and-patrol-boundaries), a crisis reported at a region of Seattle is picked up by its respective precincts.

North and West precincts have the highest crises reports, and South taking the least. It's intereting to note that a lot of the crises in the data are reported around northern and western corners.

In [None]:
sns.heatmap(pd.concat([crisis_data[crisis_data.reported_date_reported_time.dt.year==x].precinct.value_counts().rename(x) for x in crisis_data_years],axis=1,sort=False).assign(Total=crisis_data.precinct.value_counts()).sort_values(by="Total", ascending=False).drop("Total",axis=1),cmap="Reds",annot=True,fmt='g').set_title("Crises reported per Precinct");

**Crises by sectors in 2018**

Seattle's [Precinct and Patrol Boundaries](http://www.seattle.gov/police/about-us/about-policing/precinct-and-patrol-boundaries), there are 17 sectors within the city.

I have to admit, I got caught in a spiral trying to find out what representation would be easier for this. Multiple pies, or Nested Pies. Turns out that nested pie required a whole lot of work, so I went with the former, and that was no walk in the park either.

Sectors Edward (EAST), King (WEST), David (WEST), Lincoln (NORTH), Mary (WEST), Union (NORTH), Nora (NORTH), had the most crises in 2018 (In that order). While the highest crisis originated from the East Precinct, Most crises as observed earlier originatd from the Northern and Western precincts.

I also made a nested pie of Total crises reported by Precinct by Sector (for all years). Although I prefer having individual illustrations for individual sectors, this is another alternative I thought of.

In [None]:
seattle_precincts = crisis_data.precinct.unique()
# Removing nan and UNKNOWN precincts
seattle_precincts = seattle_precincts[~((seattle_precincts=="UNKNOWN")|(pd.isnull(seattle_precincts)))]
crisis_data_by_sector_by_precinct = pd.concat([crisis_data[(crisis_data.precinct==x) & (crisis_data.reported_date_reported_time.dt.year==2018)].sector.value_counts().rename("") for x in seattle_precincts],axis=1,sort=False).fillna(0)

# Pie charts in Pandas dont provide values by default. The autopct attribute provides a floating number equal to the slice percentage.
# To provide values instead, I will be indexing the numbers as illustrated here https://stackoverflow.com/questions/48299254/pandas-pie-plot-actual-values-for-multiple-graphs
# There the test is a temporary function that iterates over each element in my dataframe, and then returns the relevant value

row_index = [0]
column_index = [0]
def test(value):
    a = crisis_data_by_sector_by_precinct.iloc[row_index[0],column_index[0]]
    row_index[0]=(row_index[0]+1)%17
# The autopct sends values in a flattended format, meaning that if it's a multidimensional DataFrame, it will flatten to a single dimension and then send the values.
# This can be leveraged such that the column_index is incremented when the last row is hit
    if(row_index[0]==0):
        column_index[0] = (column_index[0]+1)%5
# Since most values in my DataFrame could be zeros, I will not be displaying these values to prevent overlapping
    return int(a) if a>0 else ""

crisis_data_by_sector_by_precinct.plot(kind="pie",subplots=True,title=["Sectors in {0} Precinct".format(x) for x in seattle_precincts],figsize=(15,10),layout=(2,3),legend=False,autopct=test,cmap="Reds_r");

In [None]:
fig, ax = plt.subplots()
size = 0.3
radius = 2
pctdistance = 0.9
# Total crises reported by precinct by sector by year. Summing up this frame by level will give total crises per precinct for level=0, crises per sector for level=1, crises per year for level=2
crisis_by_precinct_sector_year_alternate = pd.DataFrame(pd.concat([crisis_data.precinct,crisis_data.sector,crisis_data.reported_date_reported_time.dt.year.rename("reported_year")],axis=1).groupby(["precinct","sector"]).reported_year.value_counts())
index_level = [0,0,0]

# add labels for level0 (precinct)
def add_labels_level0(value):
    temp = crisis_by_precinct_sector_year_alternate.sum(level=0).values[index_level[0]]
    tempstr = crisis_by_precinct_sector_year_alternate.sum(level=0).index.values[index_level[0]]
    index_level[0]+=1
    return int(temp)
# add labels for level1 (sector)
def add_labels_level1(value):
    temp = crisis_by_precinct_sector_year_alternate.sum(level=1).values[index_level[1]]
    tempstr = crisis_by_precinct_sector_year_alternate.sum(level=1).index.values[index_level[1]]
    index_level[1]+=1
    return int(temp)
# add labels for level2 (year)
def add_labels_level2(value):
    temp = crisis_by_precinct_sector_year_alternate.sum().values[index_level[2]]
    tempstr = crisis_by_precinct_sector_year_alternate.sum().index.values[index_level[2]]
    index_level[2]+=1
    return int(temp)

# turns out you can't pass a dataframe or a series and expect labels to be rotated. I guess pandas ruined me that way. Anyway, the values are flattened and then plotted
ax.pie(crisis_by_precinct_sector_year_alternate.sum(level=0).values.flatten(),radius=radius,wedgeprops=dict(width=size,edgecolor='w'),autopct=add_labels_level0,pctdistance=pctdistance,labels=crisis_by_precinct_sector_year_alternate.sum(level=0).index.tolist(),rotatelabels=True,labeldistance=1)
radius=radius-size
ax.pie(crisis_by_precinct_sector_year_alternate.sum(level=1).values.flatten(),radius=radius,wedgeprops=dict(width=size,edgecolor='w'),autopct=add_labels_level1,pctdistance=pctdistance,labels=crisis_by_precinct_sector_year_alternate.sum(level=1).index.tolist(),rotatelabels=True,labeldistance=0.45)
ax.set(aspect="equal");

**Suicide related crisis, Behavioral/Emotional crisis vs all the others**

Over 33% of the crisis reports are suicide or emotional/behavioral related. With the total numbers increasing year after year, crises related to mental wellness doesn't seem to be going down. Now this is just the initial_call_type; Almost all the final_call_types for suicide related incidents are General complaints

Looking at the final call types that are not General complaints, the top ones are Suicial persn and Attempts, Pickup/Transport, and Disturbances.

In [None]:
suicide_emotional_vs_others=pd.concat([
#     Find all records by year where initial_call_type contains suicid. This will include Suicide and Suicidal call types
    crisis_data[crisis_data.initial_call_type.str.contains("suicid",case=False,na=False) & crisis_data.reported_date_reported_time.dt.year.isin(crisis_data_years)].reported_date_reported_time.groupby(crisis_data.reported_date_reported_time.dt.year).count().rename("Suicide Related"),
#     Find all records by year where initial_call_type contains emotion. This will include Emotion or emotion related crises
    crisis_data[crisis_data.initial_call_type.str.contains("emotion",case=False,na=False) & crisis_data.reported_date_reported_time.dt.year.isin(crisis_data_years)].reported_date_reported_time.groupby(crisis_data.reported_date_reported_time.dt.year).count().rename("Emotional Related"),
#     Finally, Find all records by year that are not the above. These are all the rest
    crisis_data[~(crisis_data.initial_call_type.str.contains("suicid",case=False,na=False)|crisis_data.initial_call_type.str.contains("emotion",case=False,na=False)) & crisis_data.reported_date_reported_time.dt.year.isin(crisis_data_years)].reported_date_reported_time.groupby(crisis_data.reported_date_reported_time.dt.year).count().rename("Others")
],axis=1).T
temp_index=[0]
def add_labels(value):
    a = suicide_emotional_vs_others.T.values.flatten()[temp_index[0]]
    temp_index[0]+=1
    return ("{0}\n{1:.1f}%".format(a,value))
suicide_emotional_vs_others.plot(
    kind="pie", subplots=True, figsize=(12,12), layout=(2,2), legend=False, title="Suicide and Emotional Crises vs the Other Crises", autopct=add_labels
);

In [None]:
suicide_initial_vs_final_calltype = pd.concat([
    crisis_data[crisis_data.initial_call_type.str.contains("suicid",na=False,case=False)].reported_date_reported_time.dt.year.rename("year"),
    crisis_data[crisis_data.initial_call_type.str.contains("suicid",na=False,case=False)].initial_call_type,
    crisis_data[crisis_data.initial_call_type.str.contains("suicid",na=False,case=False)].final_call_type
],axis=1)
suicide_initial_vs_final_calltype_count = pd.concat([suicide_initial_vs_final_calltype[(suicide_initial_vs_final_calltype.year==x) & (~suicide_initial_vs_final_calltype.final_call_type.str.contains("GENERAL",case=False))].final_call_type.value_counts().rename(x) for x in crisis_data_years],axis=1,sort=False).fillna(0)
suicide_initial_vs_final_calltype_count.index.name="Final Call Types for Suicide Related Initial Report"
suicide_initial_vs_final_calltype_count.sort_values(by=2018,ascending=False).head(10)

**Crisis Intervention Team Officer**

5% of the crises reports in 2018 requested for a Crisis Intervention Team. Most of these crises had a CIT Officer arriving. 

**Note:** The visualizations do not consider if the dispatched/arriving CIT Officer was requested in the first place, unless explicitly specified. In other words, the first three rows of the visualizations do not drill down into the CIT Officer request.

In [None]:
cit_details_by_call_type = (crisis_data[crisis_data.reported_date_reported_time.dt.year==2018])[["template_id","initial_call_type","final_call_type","cit_officer_requested","cit_officer_dispatched","cit_officer_arrived"]]
fig,ax = plt.subplots(nrows=4,ncols=3,figsize=(16,20))

cit_details_by_call_type.cit_officer_requested.value_counts().plot(kind="pie",autopct="%.2f",title="How Often was a CIT Officer requested in 2018?",ax=ax[0,0]).set(ylabel="")
cit_details_by_call_type.cit_officer_dispatched.value_counts().plot(kind="pie",autopct="%.2f",title="How Often was a CIT Officer dispatched in 2018?",ax=ax[0,1]).set(ylabel="")
cit_details_by_call_type.cit_officer_arrived.value_counts()[::-1].plot(kind="pie",autopct="%.2f",title="How Often did the CIT Officer arrive in 2018?",ax=ax[0,2]).set(ylabel="")

cit_details_by_call_type.pivot_table(index="initial_call_type",columns=cit_details_by_call_type.cit_officer_requested.rename("CIT Officer Requested"),values="template_id",aggfunc="count").sort_values(by="N",ascending=False).head(5)[::-1].plot(kind="barh",ax=ax[1,0],sharey=True).set(ylabel="Initial Call Type");
cit_details_by_call_type.pivot_table(index="final_call_type",columns=cit_details_by_call_type.cit_officer_requested.rename("CIT Officer Requested"),values="template_id",aggfunc="count").sort_values(by="N",ascending=False).head(5)[::-1].plot(kind="barh",ax=ax[2,0]).set(ylabel="Final Call Type");
cit_details_by_call_type.pivot_table(index="initial_call_type",columns=cit_details_by_call_type.cit_officer_dispatched.rename("CIT Officer Dispatched"),values="template_id",aggfunc="count").sort_values(by="N",ascending=False).head(5)[::-1].plot(kind="barh",ax=ax[1,1]).set(ylabel="Initial Call Type");
cit_details_by_call_type.pivot_table(index="final_call_type",columns=cit_details_by_call_type.cit_officer_dispatched.rename("CIT Officer Dispatched"),values="template_id",aggfunc="count").sort_values(by="N",ascending=False).head(5)[::-1].plot(kind="barh",ax=ax[2,1]).set(ylabel="Final Call Type");
cit_details_by_call_type.pivot_table(index="initial_call_type",columns=cit_details_by_call_type.cit_officer_arrived.rename("CIT Officer Arrived"),values="template_id",aggfunc="count").sort_values(by="N",ascending=False).head(5)[::-1].plot(kind="barh",ax=ax[1,2]).set(ylabel="Initial Call Type");
cit_details_by_call_type.pivot_table(index="final_call_type",columns=cit_details_by_call_type.cit_officer_arrived.rename("CIT Officer Arrived"),values="template_id",aggfunc="count").sort_values(by="N",ascending=False).head(5)[::-1].plot(kind="barh",ax=ax[2,2]).set(ylabel="Final Call Type");

sns.heatmap(ax=plt.subplot2grid((4,2),(3,0),rowspan=1,colspan=1),data=cit_details_by_call_type[cit_details_by_call_type.cit_officer_requested=="Y"].pivot_table(index="cit_officer_dispatched",columns="cit_officer_arrived",values="template_id",aggfunc="count",fill_value=0),annot=True,fmt="g").set(xlabel="CIT Officer Arrived",ylabel="CIT Officer Dispatched",title="CIT Officer was Requested");
sns.heatmap(ax=plt.subplot2grid((4,2),(3,1),rowspan=1,colspan=1),data=cit_details_by_call_type[cit_details_by_call_type.cit_officer_requested=="N"].pivot_table(index="cit_officer_dispatched",columns="cit_officer_arrived",values="template_id",aggfunc="count",fill_value=0),annot=True,fmt="g").set(xlabel="CIT Officer Arrived",ylabel="CIT Officer Dispatched",title="CIT Officer was NOT Requested");


**Summary**

* 2018 for Seattle saw more crises reported than the year before.
* Northern and Western precincts had the highest reports of crisis overall, albeit one of the Eastern sector had the highest in 2018
* These crises are reported during the day, with the first peak at 10am-12pm, continuously rising to the second peak at 5pm-7pm. This leads me to believe that work, while not a direct influence, is in someway contributing to their crises
* Considering initial reports, most reports are either Suicide related or Emotional/Behavioral crisis, with their final call types as General complaints. Does this mean that most were false alarms or worst case scenarios? I'm not sure about that
* Only 5% of the reports in 2018 requested that a CIT officer be dispatched. This didn't make much of a difference as a CIT officer was dispatched regardless of the request. I like to think of this as a good thing considering that most reports were either Suicide related or individuals in Emotional/Behavioral crisis.

**Final thoughts on this notebook**

To you, The Reader, Thank you for getting this far. The data might seem all over the place, but I'm pretty happy with where this notebook got me. I consider this as an exercise to look at the bigger picure more, than looking at just one metric. I don't know how Data/Business Analysts tell stories so well on such large data; then again, I didn't know what I was up against, when I started this notebook (mea culpa).

I deviated several times while preparing this notebook. First I thought I'd use the plot method of Pandas, and got lost in finding ways to visualize the data, then I started anew and thought I'd use seaborn and visualize some data. Lost my way again, but this time got caught up on suicide and mental wellness; so much so, that I was on the verge of concluding that a lot of people in Seattle were depressed. Took another step back, and made a mix of seaborn and matplotlibs and looked at crisis data as a whole.

While I liked using seaborn, I might not be using just that for all my visualizations. Towards the end, I made use of pivot tables, which I thought was nice. Some of the outputs returned unexpected text results, but I'm pretty sure they will be eliminated once I'm comfortable using variables and not plot off the data itself.

I can't help shake the feeling that I ended the notebook abruptly. There was so much more I had expected (like cross referencing this data with the [WHO Suicide Statistics](https://www.kaggle.com/szamil/who-suicide-statistics)) or something similar. I'll try it for another notebook, at a later time.

Comments and/or criticisms on the notebook? Any visualizations that I wrongly used or could be used better? Any best practices or general advice on how I can make these better in the future? 