# Introduction


Since Jan. 1, 2015, [The Washington Post](https://www.washingtonpost.com/) has been compiling a database of every fatal shooting in the US by a police officer in the line of duty. 

<center><img src=https://i.imgur.com/sX3K62b.png></center>

While there are many challenges regarding data collection and reporting, The Washington Post has been tracking more than a dozen details about each killing. This includes the race, age and gender of the deceased, whether the person was armed, and whether the victim was experiencing a mental-health crisis. The Washington Post has gathered this supplemental information from law enforcement websites, local new reports, social media, and by monitoring independent databases such as "Killed by police" and "Fatal Encounters". The Post has also conducted additional reporting in many cases.

There are 4 additional datasets: US census data on poverty rate, high school graduation rate, median household income, and racial demographics. [Source of census data](https://factfinder.census.gov/faces/nav/jsf/pages/community_facts.xhtml).

# Importing Necessary Modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from collections import Counter

## Notebook Presentation

In [2]:
pd.options.display.float_format = '{:,.2f}'.format

## Load the Data

In [3]:
df_hh_income = pd.read_csv(
    'Median_Household_Income_2015.csv', encoding="windows-1252")
df_pct_poverty = pd.read_csv(
    'Pct_People_Below_Poverty_Level.csv', encoding="windows-1252")
df_pct_completed_hs = pd.read_csv(
    'Pct_Over_25_Completed_High_School.csv', encoding="windows-1252")
df_share_race_city = pd.read_csv(
    'Share_of_Race_By_City.csv', encoding="windows-1252")
df_fatalities = pd.read_csv('Deaths_by_Police_US.csv', encoding="windows-1252")

# Preliminary Data Exploration

* What is the shape of the DataFrames? 
* How many rows and columns do they have?
* What are the column names?
* Are there any NaN values or duplicates?

In [4]:
print(f"Shape of dataframes are:")
print(
    f"\n\nHouseHold Income DataFrame (df_hh_income) shape:\n(row, columns): {df_hh_income.shape}")
print(
    f"\n\nPoverty DataFrame (df_pct_poverty) shape:\n(row, columns): {df_pct_poverty.shape}")
print(
    f"\n\nHigh School Completion DataFrame (df_pct_completed_hs) shape:\n(row, columns): {df_pct_completed_hs.shape}")
print(
    f"\n\nShare of city by Race DateFrame (df_share_race_city):\n(row, columns): {df_share_race_city.shape}")
print(
    f"\n\nDeath by Police DataFrame (df_fatalities):\n(row, columns): {df_fatalities.shape}")

Shape of dataframes are:


HouseHold Income DataFrame (df_hh_income) shape:
(row, columns): (29322, 3)


Poverty DataFrame (df_pct_poverty) shape:
(row, columns): (29329, 3)


High School Completion DataFrame (df_pct_completed_hs) shape:
(row, columns): (29329, 3)


Share of city by Race DateFrame (df_share_race_city):
(row, columns): (29268, 7)


Death by Police DataFrame (df_fatalities):
(row, columns): (2535, 14)


In [5]:
print("Column Names of DataFrames are:")
print(
    f"\n\nHouseHold Income DataFrame (df_hh_income) columns:\n {df_hh_income.columns}")
print(
    f"\n\nPoverty DataFrame (df_pct_poverty) columns:\n {df_pct_poverty.columns}")
print(
    f"\n\nHigh School Completion DataFrame (df_pct_completed_hs) columns:\n {df_pct_completed_hs.columns}")
print(
    f"\n\nShare of city by Race DataFrame (df_share_race_city):\n {df_share_race_city.columns}")
print(
    f"\n\nDeath by Police DataFrame (df_fatalities):\n {df_fatalities.columns}")

Column Names of DataFrames are:


HouseHold Income DataFrame (df_hh_income) columns:
 Index(['Geographic Area', 'City', 'Median Income'], dtype='object')


Poverty DataFrame (df_pct_poverty) columns:
 Index(['Geographic Area', 'City', 'poverty_rate'], dtype='object')


High School Completion DataFrame (df_pct_completed_hs) columns:
 Index(['Geographic Area', 'City', 'percent_completed_hs'], dtype='object')


Share of city by Race DataFrame (df_share_race_city):
 Index(['Geographic area', 'City', 'share_white', 'share_black',
       'share_native_american', 'share_asian', 'share_hispanic'],
      dtype='object')


Death by Police DataFrame (df_fatalities):
 Index(['id', 'name', 'date', 'manner_of_death', 'armed', 'age', 'gender',
       'race', 'city', 'state', 'signs_of_mental_illness', 'threat_level',
       'flee', 'body_camera'],
      dtype='object')


In [6]:
print("Calculating NaN values in each DataFrame:")
print(
    f"\n\nHouseHold Income DataFrame (df_hh_income) NaN values:\n{df_hh_income.isna().sum()}")
print(
    f"\n\nPoverty DataFrame (df_pct_poverty) NaN values:\n{df_pct_poverty.isna().sum()}")
print(
    f"\n\nHigh School Completion DataFrame (df_pct_completed_hs) NaN values:\n{df_pct_completed_hs.isna().sum()}")
print(
    f"\n\nShare of City Race DataFrame (df_share_race_city): NanValues:\n{df_share_race_city.isna().sum()}")
print(
    f"\n\nDeath by Police DataFrame (df_fatalities): NanValues:\n{df_fatalities.isna().sum()}")

Calculating NaN values in each DataFrame:


HouseHold Income DataFrame (df_hh_income) NaN values:
Geographic Area     0
City                0
Median Income      51
dtype: int64


Poverty DataFrame (df_pct_poverty) NaN values:
Geographic Area    0
City               0
poverty_rate       0
dtype: int64


High School Completion DataFrame (df_pct_completed_hs) NaN values:
Geographic Area         0
City                    0
percent_completed_hs    0
dtype: int64


Share of City Race DataFrame (df_share_race_city): NanValues:
Geographic area          0
City                     0
share_white              0
share_black              0
share_native_american    0
share_asian              0
share_hispanic           0
dtype: int64


Death by Police DataFrame (df_fatalities): NanValues:
id                           0
name                         0
date                         0
manner_of_death              0
armed                        9
age                         77
gender                       0

In [7]:
print("Calculating Duplicate values in each DataFrame:")
print(
    f"\n\nHouseHold Income DataFrame (df_hh_income) Duplicate values:\n{df_hh_income.duplicated().sum()}")
print(
    f"\n\nPoverty DataFrame (df_pct_poverty) Duplicate values:\n{df_pct_poverty.duplicated().sum()}")
print(
    f"\n\nHigh School Completion DataFrame (df_pct_completed_hs) Duplicate values:\n{df_pct_completed_hs.duplicated().sum()}")
print(
    f"\n\nShare of City Race DataFrame(df_share_race_city) Duplicate values:\n{df_share_race_city.duplicated().sum()}")
print(
    f"\n\nDeath by Police DataFrame (df_fatalities) Duplicate values:\n{df_fatalities.duplicated().sum()}")

Calculating Duplicate values in each DataFrame:


HouseHold Income DataFrame (df_hh_income) Duplicate values:
0


Poverty DataFrame (df_pct_poverty) Duplicate values:
0


High School Completion DataFrame (df_pct_completed_hs) Duplicate values:
0


Share of City Race DataFrame(df_share_race_city) Duplicate values:
0


Death by Police DataFrame (df_fatalities) Duplicate values:
0


## <b style="color: green"> There are no duplicated values. </b>

## <b style="color: orange"> But there are NaN values many places, let's see what we can do about that </b>

In [11]:
# First house hold income
df_hh_income[df_hh_income.isna().any(axis=1)].count()

Geographic Area    51
City               51
Median Income       0
dtype: int64

In [12]:
# Here we can drop NaN values rows because they are the important columns, so we can't fill them with any values
df_hh_income.dropna(inplace=True)
df_hh_income.isna().sum()

Geographic Area    0
City               0
Median Income      0
dtype: int64

In [15]:
df_fatalities.isna().sum()

id                           0
name                         0
date                         0
manner_of_death              0
armed                        9
age                         77
gender                       0
race                       195
city                         0
state                        0
signs_of_mental_illness      0
threat_level                 0
flee                        65
body_camera                  0
dtype: int64

In [14]:
# Let's deal with df_fatalities
df_fatalities[df_fatalities.isna().any(axis=1)]

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
59,110,William Campbell,25/01/15,shot,gun,59.00,M,,Winslow,NJ,False,attack,Not fleeing,False
124,584,Alejandro Salazar,20/02/15,shot,gun,,M,H,Houston,TX,False,attack,Car,False
241,244,John Marcell Allen,30/03/15,shot,gun,54.00,M,,Boulder City,NV,False,attack,Not fleeing,False
266,534,Mark Smith,09/04/15,shot and Tasered,vehicle,54.00,M,,Kellyville,OK,False,attack,Other,False
340,433,Joseph Roy,07/05/15,shot,knife,72.00,M,,Lawrenceville,GA,True,other,Not fleeing,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2528,2812,Alejandro Alvarado,27/07/17,shot,knife,,M,H,Chowchilla,CA,False,attack,Not fleeing,False
2529,2819,Brian J. Skinner,28/07/17,shot,knife,32.00,M,,Glenville,NY,True,other,Not fleeing,False
2530,2822,Rodney E. Jacobs,28/07/17,shot,gun,31.00,M,,Kansas City,MO,False,attack,Not fleeing,False
2531,2813,TK TK,28/07/17,shot,vehicle,,M,,Albuquerque,NM,False,attack,Car,False


### I really don't think we can replace any NaN value with a default value, because it will affect the analysis
### So, I will depending upon the analysis decide what to do with the NaN values
