## Hunting for Insights / Exploration - Policing Equity

The Center for Policing Equity (CPE) is research scientists, race and equity experts, data virtuosos, and community trainers working together to build more fair and just systems. Police departments across the United States have joined our National Justice Database, the first and largest collection of standardized police behavioral data. 

## Data Science for Good : Problem Statement 

How do you measure justice? And how do you solve the problem of racism in policing? We look for factors that drive racial disparities in policing by analyzing census and police department deployment data. The ultimate goal is to inform police agencies where they can make improvements by identifying deployment areas where racial disparities exist and are not explainable by crime rates and poverty levels. The biggest challenge is automating the combination of police data, census-level data, and other socioeconomic factors. 

## About this Kernel : 

This kernel aims to unearth the hidden insights from the data shared. This kernel will be updated regularly. 

Lets Load the required libraries first. 

In [None]:
import numpy as np 
import pandas as pd 
import folium
from folium import plugins
from io import StringIO
import geopandas as gpd
from pprint import pprint 
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.plotly as py
from plotly import tools
import plotly.figure_factory as ff
import matplotlib.pyplot as plt
import seaborn as sns
import os 
init_notebook_mode(connected=True)

### About Dataset : Department Files

The dataset consists of different data files for different police deparments. Lets quickly look at those department names. 

In [None]:
depts = [f for f in os.listdir("../input/cpe-data/") if f.startswith("Dept")]
pprint(depts)

### About Dataset : Different Data Files for Police Departments

Among different departments, different files are shared corresponding to different data files, such as Education, Race, Poverty etc. Lets have a look

In [None]:
files = os.listdir("../input/cpe-data/Dept_23-00089/23-00089_ACS_data/")
files

Now, lets start exploring these data files. 

### Exploration 1 - Department : Dept_23-00089 | Metric : Race, Sex, Age

Lets load the dataset

In [None]:
basepath = "../input/cpe-data/Dept_23-00089/23-00089_ACS_data/23-00089_ACS_race-sex-age/"
rca_df = pd.read_csv(basepath + "ACS_15_5YR_DP05_with_ann.csv")
rca_df.head()

The meanings of columns is given in an another file. Here is the description of all the columns used in the avove dataset. 

In [None]:
a_df = pd.read_csv(basepath + "ACS_15_5YR_DP05_metadata.csv")

# for j, y in a_df.iterrows():
#     if y['Id'].startswith("Estimate"):
#         print (y['GEO.id'], y['Id'])

a_df.head()

So there are coluns about Estimate, Margin of Error, Percent related to Sex, Age, Race, and Total Population. Lets start exploring these variables. 

### Distribution of Total Population across Census Tracts

<br>

**Census Tracts:** 
Census tracts (CTs) are small, relatively stable geographic areas that usually have a population between 2,500 and 8,000 persons. They are located in census metropolitan areas and in census agglomerations that had a core population of 50,000 or more in the previous census.



In [None]:
total_population = rca_df["HC01_VC03"][1:]

trace = go.Histogram(x=total_population, marker=dict(color='orange', opacity=0.6))
layout = dict(title="Total Population Distribution - Across the counties", margin=dict(l=200), width=800, height=400)
data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)

male_pop = rca_df["HC01_VC04"][1:]
female_pop = rca_df["HC01_VC05"][1:]
trace1 = go.Histogram(x=male_pop, name="male %", marker=dict(color='blue', opacity=0.6))
trace2 = go.Histogram(x=female_pop, name="female %", marker=dict(color='pink', opacity=0.6))
layout = dict(title="Population Distribution Breakdown - Across the Census Tracts", margin=dict(l=200), width=800, height=400)
data = [trace1, trace2]
fig = go.Figure(data=data, layout=layout)
iplot(fig)

So about 50 census tracts have population around 3000 - 4000. One Census tract has very high population. Female gender percentage is higher in only two of the census tracts. 

### Distribution of Age Groups

Lets plot the census tract wise different agegroup's population count 

In [None]:
age_cols = []
names = []
for i in range(13):
    if i < 2:
        i = "0"+str(i+8)
        relcol = "HC01_VC" + str(i)
    else:
        relcol = "HC01_VC" + str(i+8)
    age_cols.append(relcol)
    name = a_df[a_df["GEO.id"] == relcol]["Id"].iloc(0)[0].replace("Estimate; SEX AND AGE - ","")
    names.append(name)

rca_df['GEO.display-label_cln'] = rca_df["GEO.display-label"].apply(lambda x : x.replace(", Marion County, Indiana", "").replace("Census Tract ", "CT: "))

traces = []
for i,agecol in enumerate(age_cols):
    x = rca_df["GEO.display-label_cln"][1:]
    y = rca_df[agecol][1:]
    trace = go.Bar(y=y, x=x, name=names[i])
    traces.append(trace)

tmp = pd.DataFrame()
vals = []
Geo = []
Col = []
for i,age_col in enumerate(age_cols):
    Geo += list(rca_df["GEO.display-label_cln"][1:].values)
    Col += list([names[i]]*len(rca_df[1:]))
    vals += list(rca_df[age_col][1:].values)

tmp['Geo'] = Geo
tmp['Col'] = Col
tmp['Val'] = vals
tmp['Val'] = tmp['Val'].astype(int)  * 0.01

data = [go.Scatter(x = tmp["Geo"], y = tmp["Col"], mode="markers", marker=dict(size=list(tmp["Val"].values)))]
layout = dict(title="Age Distribution by Census Tract - Marion County, Indiana", legend=dict(x=-0.1, y=1, orientation="h"), 
              margin=dict(l=150, b=100), height=600, barmode="stack")
fig = go.Figure(data=data, layout=layout)
iplot(fig)

The above plot gives a view about which age groups are located in which areas. Lets look at an other view of age group distributions. 

In [None]:
trace1 = go.Histogram(x = rca_df["HC01_VC26"][1:], name="18+", marker=dict(opacity=0.4)) 
trace2 = go.Histogram(x = rca_df["HC01_VC27"][1:], name="21+", marker=dict(opacity=0.3)) 
trace3 = go.Histogram(x = rca_df["HC01_VC28"][1:], name="62+", marker=dict(opacity=0.4)) 
trace4 = go.Histogram(x = rca_df["HC01_VC29"][1:], name="65+", marker=dict(opacity=0.3)) 

titles = ["Age : 18+","Age : 21+","Age : 62+","Age : 65+",]
fig = tools.make_subplots(rows=2, cols=2, print_grid=False, subplot_titles=titles)
fig.append_trace(trace1, 1, 1);
fig.append_trace(trace2, 1, 2);
fig.append_trace(trace3, 2, 1);
fig.append_trace(trace4, 2, 2);
fig['layout'].update(height=600, title="Distribution of Age across the Census Tracts", showlegend=False);
iplot(fig, filename='simple-subplot');

Let's plot the population distribution by different Race. First, lets consider only the single Race variables

In [None]:
single_race_df = rca_df[["HC01_VC49", "HC01_VC50", "HC01_VC51", "HC01_VC56", "HC01_VC64", "HC01_VC69"]][1:]
ops = [1, 0.85, 0.75, 0.65, 0.55, 0.45]
traces = []
for i, col in enumerate(single_race_df.columns):
    nm = a_df[a_df["GEO.id"] == col]["Id"].iloc(0)[0].replace("Estimate; RACE - One race - ", "")
    trace = go.Bar(x=rca_df["GEO.display-label_cln"][1:], y=single_race_df[col], name=nm, marker=dict(opacity=0.6))
    traces.append(trace)
layout = dict(barmode="stack", title="Population Breakdown by Race (Single)", margin=dict(b=100), height=600, legend=dict(x=-0.1, y=1, orientation="h"))
fig = go.Figure(data=traces, layout=layout)
iplot(fig)

We can see that majority wise White or Black American population exists. It will be interesting to look at which ones are the dominating other races. Lets remove white and black population and plot again

In [None]:
traces = []
for i, col in enumerate(single_race_df.columns):
    nm = a_df[a_df["GEO.id"] == col]["Id"].iloc(0)[0].replace("Estimate; RACE - One race - ", "")
    if nm in ["White", "Black or African American"]:
        continue
    trace = go.Bar(x=rca_df["GEO.display-label_cln"][1:], y=single_race_df[col], name=nm, marker=dict(opacity=0.6))
    traces.append(trace)
layout = dict(barmode="stack", title="Population Breakdown by Race (Single)", margin=dict(b=100), height=400, legend=dict(x=-0.1, y=1, orientation="h"))
fig = go.Figure(data=traces, layout=layout)
iplot(fig)

In [None]:
## .. 

Lets, also look at other files - (Incomplete Exploration)


### Department : Dept_23-00089 | Metric : Poverty


In [None]:
basepath2 = "../input/cpe-data/Dept_23-00089/23-00089_ACS_data/23-00089_ACS_poverty/"
a_df = pd.read_csv(basepath2 + "ACS_16_5YR_S1701_metadata.csv")
a_df.T.head()

In [None]:
pov_df = pd.read_csv(basepath2 + "ACS_16_5YR_S1701_with_ann.csv")
pov_df.head()

### Department : Dept_23-00089 | Metric : Owner Occupied Housing


In [None]:
basepath = "../input/cpe-data/Dept_23-00089/23-00089_ACS_data/23-00089_ACS_owner-occupied-housing/"
a_df = pd.read_csv(basepath + "ACS_16_5YR_S2502_metadata.csv")
a_df.T.head()

In [None]:
a_df = pd.read_csv(basepath + "ACS_16_5YR_S2502_with_ann.csv")
a_df.T.head()

### Department : Dept_23-00089 | Metric : Education


In [None]:
basepath = "../input/cpe-data/Dept_23-00089/23-00089_ACS_data/23-00089_ACS_education-attainment/"
a_df = pd.read_csv(basepath + "ACS_16_5YR_S1501_metadata.csv")
a_df.T.head()

In [None]:
a_df = pd.read_csv(basepath + "ACS_16_5YR_S1501_with_ann.csv")
a_df.T.head()

Some other files contain shape files. Lets plot them as well. 

### Indianapolis Police Zones

Lets plot the shape file and related data 

In [None]:
p1 = """../input/cpe-data/Dept_23-00089/23-00089_Shapefiles/Indianapolis_Police_Zones.shp"""
One = gpd.read_file(p1)  
One.head()

In [None]:
mapa = folium.Map([39.81, -86.26060805912148], height=400, zoom_start=10, tiles='CartoDB dark_matter',API_key='wrobstory.map-12345678')
folium.GeoJson(One).add_to(mapa)
mapa 

Lets plot the districts and juridiction realted with this shapefile data

In [None]:
f, ax = plt.subplots(1, figsize=(10, 8))
One.plot(column="DISTRICT", ax=ax, cmap='Accent',legend=True);
plt.title("Districts : Indianapolis Police Zones")
plt.show()

In [None]:
f, ax = plt.subplots(1, figsize=(10, 8))
One.plot(column="JURISDCTN", ax=ax, cmap='Accent', legend=True);
plt.title("JuriDiction : Indianapolis Police Zones")
plt.show()

### CPMD Police Division Offices

In [None]:
p2 = """../input/cpe-data/Dept_35-00103/35-00103_Shapefiles/CMPD_Police_Division_Offices.shp"""
One = gpd.read_file(p2) 

pmap = folium.Map(location=[35.15637558, -80.75600938], height=400, tiles='CartoDB dark_matter', zoom_start=10)
for j, rown in One.iterrows():
    lon = float(str(rown["geometry"]).split()[1].replace("(",""))
    lat = float(str(rown["geometry"]).split()[2].replace(")",""))
    folium.CircleMarker([lat, lon], radius=4, color='red', fill=True).add_to(pmap)
pmap

### Bostan Police Districts

In [None]:
p3 = """../input/cpe-data/Dept_11-00091/11-00091_Shapefiles/boston_police_districts_f55.shp"""
One = gpd.read_file(p3)  
mapa = folium.Map([42.3, -71.0], height=400, zoom_start=10,  tiles='CartoDB dark_matter',API_key='wrobstory.map-12345678')
folium.GeoJson(One).add_to(mapa)
mapa 

### Dallas Districts

In [None]:
p4 = """../input/cpe-data/Dept_37-00049/37-00049_Shapefiles/EPIC.shp"""
One = gpd.read_file(p4)  
mapa = folium.Map([32.7, -96.7],zoom_start=10, height=400, tiles='CartoDB dark_matter',API_key='wrobstory.map-12345678')
folium.GeoJson(One).add_to(mapa)
mapa 

### In Progress, 
#### Thanks, Do Upvote
