# Introduction

Since Jan. 1, 2015, [The Washington Post](https://www.washingtonpost.com/) has been compiling a database of every fatal shooting in the US by a police officer in the line of duty. 

<center><img src=https://i.imgur.com/sX3K62b.png></center>

While there are many challenges regarding data collection and reporting, The Washington Post has been tracking more than a dozen details about each killing. This includes the race, age and gender of the deceased, whether the person was armed, and whether the victim was experiencing a mental-health crisis. The Washington Post has gathered this supplemental information from law enforcement websites, local new reports, social media, and by monitoring independent databases such as "Killed by police" and "Fatal Encounters". The Post has also conducted additional reporting in many cases.

There are 4 additional datasets: US census data on poverty rate, high school graduation rate, median household income, and racial demographics. [Source of census data](https://factfinder.census.gov/faces/nav/jsf/pages/community_facts.xhtml).

## Import Statements

In [137]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

# This might be helpful:
from collections import Counter

## Notebook Presentation

In [138]:
pd.options.display.float_format = '{:,.2f}'.format

## Load the Data

In [139]:
df_hh_income = pd.read_csv('Median_Household_Income_2015.csv', encoding="windows-1252")
df_pct_poverty = pd.read_csv('Pct_People_Below_Poverty_Level.csv', encoding="windows-1252")
df_pct_completed_hs = pd.read_csv('Pct_Over_25_Completed_High_School.csv', encoding="windows-1252")
df_share_race_city = pd.read_csv('Share_of_Race_By_City.csv', encoding="windows-1252")
df_fatalities = pd.read_csv('Deaths_by_Police_US.csv', encoding="windows-1252")

# Preliminary Data Exploration

* What is the shape of the DataFrames? 
* How many rows and columns do they have?
* What are the column names?
* Are there any NaN values or duplicates?

In [140]:
print('Household income shape:', df_hh_income.shape)
print('Poverty share shape:', df_pct_poverty.shape)
print('High school graduation shape:', df_pct_completed_hs.shape)
print('Racial distribution shape:', df_share_race_city.shape)
print('Fatalities shape:', df_fatalities.shape)

Household income shape: (29322, 3)
Poverty share shape: (29329, 3)
High school graduation shape: (29329, 3)
Racial distribution shape: (29268, 7)
Fatalities shape: (2535, 14)


In [141]:
print('Household income columns:', df_hh_income.columns)
print('Poverty share columns:', df_pct_poverty.columns)
print('High school graduation columns:', df_pct_completed_hs.columns)
print('Racial distribution columns:', df_share_race_city.columns)
print('Fatalities columns:', df_fatalities.columns)

Household income columns: Index(['Geographic Area', 'City', 'Median Income'], dtype='object')
Poverty share columns: Index(['Geographic Area', 'City', 'poverty_rate'], dtype='object')
High school graduation columns: Index(['Geographic Area', 'City', 'percent_completed_hs'], dtype='object')
Racial distribution columns: Index(['Geographic area', 'City', 'share_white', 'share_black',
       'share_native_american', 'share_asian', 'share_hispanic'],
      dtype='object')
Fatalities columns: Index(['id', 'name', 'date', 'manner_of_death', 'armed', 'age', 'gender',
       'race', 'city', 'state', 'signs_of_mental_illness', 'threat_level',
       'flee', 'body_camera'],
      dtype='object')


In [142]:
print('Household income NA values:', df_hh_income.isna().sum())
print('\nPoverty share NA values:', df_pct_poverty.isna().sum())
print('\nHigh school graduation NA values:', df_pct_completed_hs.isna().sum())
print('\nRacial distribution NA values:', df_share_race_city.isna().sum())
print('\nFatalities NA values:', df_fatalities.isna().sum())

Household income NA values: Geographic Area     0
City                0
Median Income      51
dtype: int64

Poverty share NA values: Geographic Area    0
City               0
poverty_rate       0
dtype: int64

High school graduation NA values: Geographic Area         0
City                    0
percent_completed_hs    0
dtype: int64

Racial distribution NA values: Geographic area          0
City                     0
share_white              0
share_black              0
share_native_american    0
share_asian              0
share_hispanic           0
dtype: int64

Fatalities NA values: id                           0
name                         0
date                         0
manner_of_death              0
armed                        9
age                         77
gender                       0
race                       195
city                         0
state                        0
signs_of_mental_illness      0
threat_level                 0
flee                        65
body_

In [143]:
print('Household income duplicates:', df_hh_income.duplicated().sum())
print('Poverty share duplicates:', df_pct_poverty.duplicated().sum())
print('High school graduation duplicates:', df_pct_completed_hs.duplicated().sum())
print('Racial distribution duplicates:', df_share_race_city.duplicated().sum())
print('Fatalities duplicates:', df_fatalities.duplicated().sum())

Household income duplicates: 0
Poverty share duplicates: 0
High school graduation duplicates: 0
Racial distribution duplicates: 0
Fatalities duplicates: 0


## Data Cleaning - Check for Missing Values and Duplicates

Consider how to deal with the NaN values. Perhaps substituting 0 is appropriate. 

In [144]:
# Not sure why the City column is like that
df_hh_income.head(10)

Unnamed: 0,Geographic Area,City,Median Income
0,AL,Abanda CDP,11207
1,AL,Abbeville city,25615
2,AL,Adamsville city,42575
3,AL,Addison town,37083
4,AL,Akron town,21667
5,AL,Alabaster city,71816
6,AL,Albertville city,32911
7,AL,Alexander City city,29874
8,AL,Alexandria CDP,56058
9,AL,Aliceville city,21131


In [145]:
# Add 'city' to Carson City so name is preserved in next step
for idx, row in df_hh_income.iterrows():
    if row['City'] == 'Carson City':
        row['City'] = 'Carson City drop_this_string'

# Clean City column
#   Note: There are edge cases in the column that still need to be cleaned; however, I ignore these since they are dropped after the merge
trimmed_city = [' '.join(city.split(' ')[:-1]) for city in df_hh_income['City']]
df_hh_income['Trimmed City'] = trimmed_city
df_hh_income.head()

Unnamed: 0,Geographic Area,City,Median Income,Trimmed City
0,AL,Abanda CDP,11207,Abanda
1,AL,Abbeville city,25615,Abbeville
2,AL,Adamsville city,42575,Adamsville
3,AL,Addison town,37083,Addison
4,AL,Akron town,21667,Akron


In [146]:
# Merge fatalities and income datasets
df = df_fatalities.merge(df_hh_income, how='left', left_on=['city', 'state'], right_on=['Trimmed City', 'Geographic Area'], )
df = df.drop(columns=['id', 'Geographic Area', 'City', 'Trimmed City'])

In [147]:
# Convert column of strings (with NaN and '(X)' values) to integers.
income_str_list = df['Median Income']
income_list = []
for income in income_str_list:
    try:
        income_int = int(income)
    except ValueError:
        income_int = np.nan
    income_list.append(income_int)

# Impute missing values with median income
df['Median Income'] = income_list
median_income = df['Median Income'].median()
df['Median Income'] = df['Median Income'].fillna(median_income)

# While at it, fill NaN ages with median value
df['age'] = df['age'].fillna(df['age'].median())

# Drop all remaining NaN values in categorical columns
df = df.dropna()
df

Unnamed: 0,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,Median Income
0,Tim Elliot,02/01/15,shot,gun,53.00,M,A,Shelton,WA,True,attack,Not fleeing,False,37072.00
1,Lewis Lee Lembke,02/01/15,shot,gun,47.00,M,W,Aloha,OR,False,attack,Not fleeing,False,65765.00
2,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.00,M,H,Wichita,KS,False,other,Not fleeing,False,45947.00
3,Matthew Hoffman,04/01/15,shot,toy weapon,32.00,M,W,San Francisco,CA,True,attack,Not fleeing,False,81294.00
4,Michael Rodriguez,04/01/15,shot,nail gun,39.00,M,H,Evans,CO,False,attack,Not fleeing,False,47791.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2530,Kesharn K. Burney,26/07/17,shot,vehicle,25.00,M,B,Dayton,OH,False,attack,Car,False,27683.00
2532,Deltra Henderson,27/07/17,shot,gun,39.00,M,B,Homer,LA,False,attack,Car,False,27050.00
2535,Alejandro Alvarado,27/07/17,shot,knife,34.00,M,H,Chowchilla,CA,False,attack,Not fleeing,False,34559.00
2540,Isaiah Tucker,31/07/17,shot,vehicle,28.00,M,B,Oshkosh,WI,False,attack,Car,True,42650.00


In [148]:

df

Unnamed: 0,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,Median Income
0,Tim Elliot,02/01/15,shot,gun,53.00,M,A,Shelton,WA,True,attack,Not fleeing,False,37072.00
1,Lewis Lee Lembke,02/01/15,shot,gun,47.00,M,W,Aloha,OR,False,attack,Not fleeing,False,65765.00
2,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.00,M,H,Wichita,KS,False,other,Not fleeing,False,45947.00
3,Matthew Hoffman,04/01/15,shot,toy weapon,32.00,M,W,San Francisco,CA,True,attack,Not fleeing,False,81294.00
4,Michael Rodriguez,04/01/15,shot,nail gun,39.00,M,H,Evans,CO,False,attack,Not fleeing,False,47791.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2530,Kesharn K. Burney,26/07/17,shot,vehicle,25.00,M,B,Dayton,OH,False,attack,Car,False,27683.00
2532,Deltra Henderson,27/07/17,shot,gun,39.00,M,B,Homer,LA,False,attack,Car,False,27050.00
2535,Alejandro Alvarado,27/07/17,shot,knife,34.00,M,H,Chowchilla,CA,False,attack,Not fleeing,False,34559.00
2540,Isaiah Tucker,31/07/17,shot,vehicle,28.00,M,B,Oshkosh,WI,False,attack,Car,True,42650.00


# Chart the Poverty Rate in each US State

Create a bar chart that ranks the poverty rate from highest to lowest by US state. Which state has the highest poverty rate? Which state has the lowest poverty rate?  Bar Plot

# Chart the High School Graduation Rate by US State

Show the High School Graduation Rate in ascending order of US States. Which state has the lowest high school graduation rate? Which state has the highest?

# Visualise the Relationship between Poverty Rates and High School Graduation Rates

#### Create a line chart with two y-axes to show if the rations of poverty and high school graduation move together.  

#### Now use a Seaborn .jointplot() with a Kernel Density Estimate (KDE) and/or scatter plot to visualise the same relationship

#### Seaborn's `.lmplot()` or `.regplot()` to show a linear regression between the poverty ratio and the high school graduation ratio. 

# Create a Bar Chart with Subsections Showing the Racial Makeup of Each US State

Visualise the share of the white, black, hispanic, asian and native american population in each US State using a bar chart with sub sections. 

# Create Donut Chart by of People Killed by Race

Hint: Use `.value_counts()`

# Create a Chart Comparing the Total Number of Deaths of Men and Women

Use `df_fatalities` to illustrate how many more men are killed compared to women. 

# Create a Box Plot Showing the Age and Manner of Death

Break out the data by gender using `df_fatalities`. Is there a difference between men and women in the manner of death? 

# Were People Armed? 

In what percentage of police killings were people armed? Create chart that show what kind of weapon (if any) the deceased was carrying. How many of the people killed by police were armed with guns versus unarmed? 

# How Old Were the People Killed?

Work out what percentage of people killed were under 25 years old.  

Create a histogram and KDE plot that shows the distribution of ages of the people killed by police. 

Create a seperate KDE plot for each race. Is there a difference between the distributions? 

# Race of People Killed

Create a chart that shows the total number of people killed by race. 

# Mental Illness and Police Killings

What percentage of people killed by police have been diagnosed with a mental illness?

# In Which Cities Do the Most Police Killings Take Place?

Create a chart ranking the top 10 cities with the most police killings. Which cities are the most dangerous?  

# Rate of Death by Race

Find the share of each race in the top 10 cities. Contrast this with the top 10 cities of police killings to work out the rate at which people are killed by race for each city. 

# Create a Choropleth Map of Police Killings by US State

Which states are the most dangerous? Compare your map with your previous chart. Are these the same states with high degrees of poverty? 

# Number of Police Killings Over Time

Analyse the Number of Police Killings over Time. Is there a trend in the data? 

# Epilogue

Now that you have analysed the data yourself, read [The Washington Post's analysis here](https://www.washingtonpost.com/graphics/investigations/police-shootings-database/).