# US Police Shooting - A first Data Analysis Study#

The aim of this notebook is to present an analysis of the data about the US police kills between $2015$ and $2020$.

The data are provided by several kaggle users 
1. [data-police-shootings](https://www.kaggle.com/mrmorj/data-police-shootings) uploaded by Andriy Samoshyn (used as main data source)
2. [fatal-police-shootings-in-the-us](https://www.kaggle.com/kwullum/fatal-police-shootings-in-the-us) uploaded by Karolina Wullum (used for data comparison)
3. [us-police-shootings](https://www.kaggle.com/ahsen1330/us-police-shootings) uploaded by Ahsen Nazir (used for data comparison)
4. [police-violence-in-the-us](https://www.kaggle.com/jpmiller/police-violence-in-the-us) uploaded by JohnM (used for data comparison)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#data visualisation
import seaborn as sns 
from matplotlib import pyplot as plt 
import plotly.graph_objects as go
import plotly.figure_factory as ff

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Looking at the data ##

We start looking at the entries of the data. In the table we are analysing there are $4895$ entries, meaning that at least $4895$ people were killed by the US police from $2015$ to $2020$.

In [None]:
df = pd.read_csv('../input/us-police-shootings/shootings.csv')
df.info()

Luckily we have for each column no null entries, and we see that the vast majority of the data are not a number. Let see in details the first lines of the table.

In [None]:
df.head()

## Fatal encounter study ##

We now explore the data about how the people were killed. 

In [None]:
from plotly.subplots import make_subplots
colors = ['indianred','crimson']
fig = go.Figure()
fig.add_trace(go.Bar(x = df['manner_of_death'].unique(), y = df['manner_of_death'].value_counts(), marker_color = colors, name ='manner of death'))
fig.update_layout(boxmode='group',width = 800)
fig.show()

The vast majority has been shot ($4647$) and a very small number of people has been shot and tasered ($248$). We now look at the deaths according to the threat level. 

In [None]:
data = df['threat_level'].value_counts()
fig = go.Figure()
fig.add_trace(go.Bar(x = data.index, y = data, marker_color = 'indianred', name ='Threat level'))
fig.update_layout(width = 800)
fig.show()

As we see the $31.2\%$ of the people killed were not considered attacking. 

It is quite impressive the deaths distribution by race, when considering people unarmed and not attacking.

In [None]:
data = df.groupby(['threat_level','armed'])['race'].value_counts()
fig = go.Figure()
fig.add_trace(go.Bar(x=data.loc['other']['unarmed'].index, y = data.loc['other']['unarmed'], marker_color = 'indianred'))
fig.update_layout(title = 'Unarmed and not attacking deaths per Race')
fig.show()

Considering the percentage of Black and Hispanic people in the US with respect to the white people, the differencies are too small, especially for black people.

Next we look what happened when the police shot and used the taser all together. We want to understand better what force them to use them both. In particular we would like to see it with respect to the threat level and the typology of weapon.

In [None]:
col = ['manner_of_death','threat_level']
data = df.groupby(col)['arms_category'].value_counts()
print('Percentage of people shot and Tasered not considered attacking {}% \n'.format(round(data.loc['shot and Tasered']['other'].sum()*100/248,1)))
print('Weapon distribution of People not considered attacking, that have been shot and Tasered \n\n {}'.format(data.loc['shot and Tasered']['other']))


The $47.6\%$ percent of the people shot and Tasered was not considered attacking. Among them only the $8.5\%$  had a gun - and a $13.5\%$ were actually unarmed. 

## Age, Gender and Race ##

We now study the Age gender and distribution of the deaths. We start looking at the ages.

In [None]:
df['age'] = df['age'].astype(int)
df['age'].describe()

The average age is $36$ years old. Sadly the minimum age is $6$ years old. They correspond to the following events:
1. [Jeremy Mardis](http://en.wikipedia.org/wiki/Shooting_of_Jeremy_Mardis)
2. [Kameron Prescott](http://www.ksat.com/news/2019/03/13/deputies-involved-in-fatal-shooting-of-kameron-prescott-wont-face-charges/)

The eldest person was a $91$ years old person. The first and third quartiles are $27$ and $45$ years old respectively. 

Next we start looking at gender. 

In [None]:
data = df.groupby('gender')['age'].value_counts()
fig = make_subplots(rows =1, cols =2,specs=[[{"type": "box"},{"type": "pie"}]], column_titles = ['Age Distribution per Gender', 'Gender Count and Ratio'])
fig.add_trace(go.Box(x = [i[0] for i in data.index], y = [j[1] for j in data.index], showlegend = False, boxmean = True, name = 'gender age'),1,1)
fig.add_trace(go.Pie(labels= df['gender'].unique(), values = df['gender'].value_counts(),showlegend = True, name ='quantities'), row =1, col = 2)
fig.update_layout(boxmode='group', width = 800)
fig.show()

As we see the vast majority are males. In the table below we present the gender distribution per race.

In [None]:
data = df.groupby('race')['gender'].value_counts()
print('Age distribution per Race \n \n{}'.format(data))

In the next plot there is the study of the ages accordingly to the races.

In [None]:
data = df.groupby(['race'])['age'].value_counts()
fig = go.Figure()
fig.add_trace(go.Box(x = [i[0] for i in data.index], y =[j[1] for j in data.index], marker_color = 'indianred', boxmean = True))
fig.update_layout(yaxis = dict(title = 'Deaths count'))
fig.show()

We may also check the age per race study leaving out the outliers.

In [None]:
fig = go.Figure()
for race in df['race'].unique():
    fig.add_trace(go.Box(y =df['age'][df['race'] == race], marker_color = 'indianred',name = race, boxmean = True))
    fig.update_layout(yaxis = dict(title = 'Victims count'), showlegend = False)

fig.show()

The ration of the deaths per race is the following

In [None]:
fig = go.Figure()
fig.add_trace(go.Pie(labels= df['race'].value_counts().index, values = df['race'].value_counts()))
fig.show()

The number of white people is the highest followed by the black people. However the number of people in each race is very different, so for a better undertanding we should rescale the data with respect to the population of each race. 

We were not able to retrieve the Race demographisc for each year, however we found the one for the years $2015-2017$ [here](http://en.wikipedia.org/wiki/Historical_racial_and_ethnic_demographics_of_the_United_States)

In [None]:
df['date'] = pd.to_datetime(df['date'])
df['Year'] = df['date'].dt.year
df['Month'] = df['date'].dt.strftime('%b')
df['Week Day']= df['date'].dt.day_name()

race_2017 = df['race'][df['Year'] == 2017].value_counts()[:-1]
race_2016 = df['race'][df['Year'] == 2016].value_counts()[:-1]
race_2015 = df['race'][df['Year'] == 2015].value_counts()
race_2015.pop('Other')
race_USA_2016 = [197.479, 39.717, 57.398, 2.676+0.595, 17.556]
race_USA_2015 = [197.534, 39.597, 56.496, 17.273, 2.597+0.554]
race_USA_2017 = [197.285, 40.129, 58.846,2.726+0.608, 18.215,]
proportion_2017 = round(race_2017/race_USA_2017,3)
proportion_2016 = round(race_2016/race_USA_2016,3)
proportion_2015 = round(race_2015/race_USA_2015,3)
colors = ['darkred','crimson', 'firebrick', 'sienna','peru']
fig = go.Figure()
fig.add_trace(go.Bar(x = proportion_2015.index, y = proportion_2015, name = '2015' ))
fig.add_trace(go.Bar(x = proportion_2016.index, y = proportion_2016, name = '2016' ))
fig.add_trace(go.Bar(x = proportion_2017.index, y = proportion_2017, name = '2017' ))
fig.update_layout(barmode='group',title_text = 'Deaths per Race per millions people in 2015-2017')
fig.show()

As we see in $2015-2017$ period Black people have the highest killed by the police per million. 
The natives have constantly growth and have more than doubled in the three year period. In $2017$ they have the highest ration killed/native-population of all races and years. Driven by the curiosity we found an interesting article speaking about that. You may consult it [here](https://edition.cnn.com/2017/11/10/us/native-lives-matter/index.html)

## Region and States ##

Here we look at the States and the macro region North East, Mid West, West and South. As always for a clear understanding we need to rescale the data with respect to the population of each state/region. We found these data for the states relatively to the year $2019$ [here](https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population).

Before using them we need to clean the data as the population entry is written as text.

In [None]:
df_pop = pd.read_csv('../input/stateus-pop-and-race-dist/statesUS - states-Foglio2.csv')
for i,line in enumerate(df_pop['Population']):
    df_pop['Population'].loc[i] = line.replace(",","")
df_pop['Population'] = df_pop['Population'].astype(int)
df_pop.head()

Now we can perform the analysis for the $2019$.

In [None]:
state_2019 = df['state'][df['Year']==2019].value_counts()
prop_states_2019 = []
for i,state in enumerate(df_pop['Abbreviation']):
    try: prop_states_2019.append((state,state_2019[state]*1000000/df_pop['Population'].loc[i]))
    except KeyError: continue
fig = go.Figure()
fig.add_trace(go.Bar(x = [i[0] for i in prop_states_2019], y =[round(j[1],3) for j in prop_states_2019], marker = dict(color = [round(j[1],3) for j in prop_states_2019])))
fig.update_layout(title_text = 'Deaths per State per millions Inhabitants 2019')
fig.show()

Here, Alaska and Oklahoma have the most deaths rate with $8$ people per million, followed by the state of West Virginia with $6$ person killed per million by the police.

Next we divide our state in the Four region used by the US census, they are: North East, Mid West, West and South.

In [None]:
region = {'CT':'NE','DE':'NE','DC':'NE','ME':'NE','MD':'NE','MA':'NE','NH':'NE','NJ':'NE','NY':'NE','PA':'NE','RI':'NE','VT':'NE',
          'ND':'MW','SD':'MW','NE':'MW','KS':'MW','MO':'MW','IA':'MW','MN':'MW','WI':'MW','MI':'MW','IL':'MW','IN':'MW','OH':'MW',
          'AL':'S','AR':'S','FL':'S','GA':'S','KY':'S','LA':'S','MS':'S','NC':'S','SC':'S','TN':'S','VA':'S','WV':'S','TX':'S','OK':'S',
          'CO':'W','ID':'W','MT':'W','NV':'W','UT':'W','WY':'W','AK':'W','CA':'W','HI':'W','OR':'W','WA':'W','AZ':'W','NM':'W','WY':'W'}
df['region'] = df['state'].map(region)
#df.loc[df['region'].isna()]['state'].unique()

Then we plot the contingency table of the regions and races.

In [None]:
tab = pd.crosstab(df['race'],df['region'], margins = False)
tab

As we see for the Black and White people the region with the most killed is the South. On the other hand for Asian, Hispanic, Native and Other is the West.

We would like to test whether the two categorical variables race and region are dependent. We would like to apply a $\chi^2$ contingency test when the race categorical variable contains only Black,White and Hispanic. As always we need to rescale for the population of each race in each region. We retrieve the data from the Statistical Atlas website (see [here](https://statisticalatlas.com/United-States/Race-and-Ethnicity))

In [None]:
n_tab = pd.concat([tab.iloc[1],tab.iloc[2],tab.iloc[5]],axis = 1)

NE_race_dist = [6.55,6.74,37.9] #Black, Hispanic, White
MW_race_dist = [6.92,4.67,52] 
S_race_dist = [22.3,18.4,69] 
W_race_dist = [3.41,20.8,38.1] 

n_tab.iloc[0] = n_tab.iloc[0]/MW_race_dist
n_tab.iloc[1] = n_tab.iloc[1]/NE_race_dist
n_tab.iloc[2] = n_tab.iloc[2]/S_race_dist
n_tab.iloc[3] = n_tab.iloc[3]/W_race_dist
n_tab = round(n_tab).astype(int)
n_tab

The difference between between the death by Police hand of black people in the regions (per millions) with respect to the other two races is quite remarkable.

In [None]:
from scipy.stats import chi2_contingency
stat,p,dof,expected = chi2_contingency(n_tab)
p,expected

Based on our data, there is no statistical evidence of the dependance of race and the region after rescaling. Obviously doing the test before the rescaling tell us there is a strong dependance that is explained by the density of each race in the regions. 

However, it is interesting to study whether the deaths on the regions are equidistributed among the races. To see this we need to know the expected amount of deaths of each race in each region.

In [None]:
exp_H = np.array([(tab['MW'].sum()/sum(MW_race_dist))*MW_race_dist[1],(tab['NE'].sum()/sum(NE_race_dist))*NE_race_dist[1], (tab['S'].sum()/sum(S_race_dist))*S_race_dist[1],(tab['W'].sum()/sum(W_race_dist))*W_race_dist[1]])
exp_B = np.array([(tab['MW'].sum()/sum(MW_race_dist))*MW_race_dist[0],(tab['NE'].sum()/sum(NE_race_dist))*NE_race_dist[0], (tab['S'].sum()/sum(S_race_dist))*S_race_dist[0],(tab['W'].sum()/sum(W_race_dist))*W_race_dist[0]])
exp_W = np.array([(tab['MW'].sum()/sum(MW_race_dist))*MW_race_dist[2],(tab['NE'].sum()/sum(NE_race_dist))*NE_race_dist[2], (tab['S'].sum()/sum(S_race_dist))*S_race_dist[2],(tab['W'].sum()/sum(W_race_dist))*W_race_dist[2]])

In [None]:
fig = make_subplots(rows = 1, cols = 3)
fig.add_trace(go.Bar(x = tab.columns, y=tab.loc['Black'], name= 'Observed Black Deaths'),1,1)
fig.add_trace(go.Bar(x = tab.columns, y=exp_B,name = 'Expected Black Deaths'),1,1)

fig.add_trace(go.Bar(x = tab.columns, y=exp_H, name = 'Expected Hispanic Deaths'),1,2)
fig.add_trace(go.Bar(x = tab.columns, y=tab.loc['Hispanic'], name= 'Observed Hispanic Deaths'),1,2)

fig.add_trace(go.Bar(x = tab.columns, y=exp_W, name = 'Expected White Deaths'),1,3)
fig.add_trace(go.Bar(x = tab.columns, y=tab.loc['White'], name= 'Observed White Deaths'),1,3)

fig.update_traces(opacity = 0.6)
fig.update_layout(barmode='overlay',title_text = 'Death per Region: Obsereved vs Expected',showlegend = True)
fig.show()

Where we can clearly see as the Black people Deaths are well above the expected amount in each region (165 more on average). On the other hand White people is always below the expected quantity in each region (193 less on average). To be completely rigorous we perform a final $\chi^2$ goodness of fit test.

In [None]:
obs_B = np.array(tab.loc['Black'])
obs_H = np.array(tab.loc['Hispanic'])
obs_W = np.array(tab.loc['White'])
from scipy.stats import chisquare
chisquare(obs_B,exp_B)
print('Black deaths p-value:{}, \n Hispanic deaths p-value:{}, \n White deaths p-value:{}.'.format(chisquare(obs_B,exp_B)[1],chisquare(obs_H,exp_H)[1],chisquare(obs_W,exp_W)[1]))

That strongly support our observed non equidistribution of the deaths among the races.

Finally, we can check wether the distributions of two races in the regions come from a population with the same median. 

In [None]:
from scipy.stats import kruskal
kruskal(n_tab['White'],n_tab['Black'])

So we have only the $2\%$ probability to incorrectly reject that the distribution in the region of Black people killed by the police is the same as the white people.  

We look also at the linear correlation table of the race and region variable.

In [None]:
data = pd.get_dummies(df[['race','state','region']]) 
ind = ['race_Asian', 'race_Black', 'race_Hispanic', 'race_Native', 'race_Other', 'race_White']
cols_1 = ['region_MW', 'region_NE', 'region_S', 'region_W']
cols_2 = ['state_AK', 'state_AL', 'state_AR','state_AZ', 'state_CA', 'state_CO', 'state_CT', 'state_DC', 'state_DE',
          'state_FL', 'state_GA', 'state_HI', 'state_IA', 'state_ID', 'state_IL', 'state_IN', 'state_KS', 'state_KY', 'state_LA', 
          'state_MA', 'state_MD','state_ME', 'state_MI', 'state_MN', 'state_MO', 'state_MS', 'state_MT', 'state_NC', 'state_ND', 
          'state_NE', 'state_NH', 'state_NJ', 'state_NM','state_NV', 'state_NY', 'state_OH', 'state_OK', 'state_OR', 'state_PA',
          'state_RI', 'state_SC', 'state_SD', 'state_TN', 'state_TX', 'state_UT','state_VA', 'state_VT', 'state_WA', 'state_WI', 
          'state_WV', 'state_WY']

In [None]:
plt.figure(figsize=(10,7))
sns.heatmap(data.corr()[cols_1].loc[ind], annot = True, fmt = ".2f", cmap = "coolwarm")

We now start to investigate the relation between the races and the states.

In [None]:
fig = go.Figure()
fig.add_trace(go.Heatmap(z = data.corr()[cols_2].loc[ind], x=cols_2, y = ind, colorscale = 'thermal'))
fig.update_layout(title_text = 'Race-State deaths correlation table')
fig.show()

Then we would like to repeat the $\chi^2$ test for independency of the cathegorical variable race = {White,Black} and the State. 

Before doing it we need to manipulate our data. This is because quite often we may end up with state with very few death (usually for smaller states). We set our Threshold at $60$ total death per state.

In [None]:
US_pop = df_pop['Population'].sum()
Us_mean = len(df)/US_pop
Threshold = 60/Us_mean

low_states = df_pop['State'][df_pop['Population']<Threshold]
other_pop = df_pop['Population'][df_pop['Population']<Threshold].sum()
others_dist = np.zeros(3)
for state in low_states:
    others_dist  = others_dist + np.array(df_pop[['White (mil)', 'Hispanic (mil','Black (mil)']][df_pop['State'] == state])

n_row = {'State':'Other', 'Abbreviation':'OTR', 'Population': other_pop, 'White (mil)': others_dist[0][0],'Hispanic (mil':others_dist[0][1], 'Black (mil)': others_dist[0][1] }
df_pop = df_pop.append(n_row, ignore_index = True)

Then we need to rescale with respect to the White and Black population of each state. As before the data comes from Statistical Atlas.

In [None]:
tab = pd.crosstab(df['race'],df['state'], margins = False)
low_ab = [df_pop['Abbreviation'][df_pop['State']==state] for state in low_states]
new_other = np.zeros(6).astype(int)
for ab in low_ab:
    A = np.array(tab[ab]).T
    new_other = new_other + A
    tab = tab.drop(ab,axis = 1)

n_col = pd.DataFrame({'OTR': new_other[0]})

n_col.index = ['Asian','Black','Hispanic','Native','Other','White']

tab = pd.concat([tab,n_col],axis = 1)

n_tab = pd.concat([tab.iloc[5],tab.iloc[2],tab.iloc[1]],axis = 1)
for state in n_tab.index:
    A = np.array(n_tab.loc[state])
    B = np.array(df_pop[['White (mil)', 'Hispanic (mil','Black (mil)']][df_pop['Abbreviation']== state])
    entries = A/B[0][0]
    n_tab.loc[state] = entries[0],entries[1],entries[2]

n_tab = round(n_tab).astype(int)
n_tab = n_tab[['White','Black']]
n_tab

In [None]:
stat,p,dof,expected = chi2_contingency(n_tab)
p,expected

In conclusion there is a significance evidence that the white and black people killed by the police depend on the state. 

## History ##

We now study the evolution of law enforcement killing through the time. 

Let's begin with the number of deaths per year.

In [None]:
data = df.groupby('Year')['Month'].count()
fig = go.Figure()
fig.add_trace(go.Bar(x = data.index, y = data))
fig.update_layout(yaxis = dict(title = 'Deaths count'))

There is a slightly decreasing tren over the year. The registered $2020$ are small because the data stopped in June. 

We plot now the study of deaths per month.

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y = df.groupby('Year')['Month'].value_counts(),boxmean = 'sd', name ='Deaths per month 2015-2020'))
fig.show()


We have an average of $74$ death per month each year, the maximum number was $100$ and the minimum $22$ (June 2020 because it stopped at the 15th).

We can specialise the previous study for each year.

In [None]:
data = df.groupby('Year')['Month'].value_counts()
fig = go.Figure()
fig.add_trace(go.Box(x = [i[0] for i in data.index], y = data,boxmean = True))
fig.update_layout(yaxis = dict(title = 'Deaths count'))
fig.show()

We can also see the deaths behaviour each month of each year. We do it firstly grouped.

In [None]:
from plotly.subplots import make_subplots

fig = go.Figure()
for year in df['Year'].unique():
    df_year = df[df['Year']==year]
    entry = [(month,df_year['Month'][df_year['Month']==month].count()) for month in df_year['Month'].unique()]
    fig.add_trace(go.Bar(x = [i[0] for i in entry], y = [j[1] for j in entry], name = '{}'.format(year)))
    fig.update_layout(barmode='group', yaxis = dict(title = 'Deaths count'), showlegend = True)
fig.show()

Then we plot the monthly trend.

In [None]:
data = df.groupby('Year')['Month'].value_counts().to_frame(name = 'count').reset_index()
nmonth = {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}
data['month_n'] = data['Month'].map(nmonth)
data['month_year'] = data['Year'].astype(str)+' '+ data['Month']
fig = go.Figure()
d = data.sort_values(by = ['Year','month_n'])
fig.add_trace(go.Bar(x = d['month_year'], y = d['count'], marker = dict(color =d['count'])))
fig.update_layout(barmode='group', yaxis = dict(title = 'Victims count'), xaxis = dict(nticks = 6, dtick= 12,  tickangle = -40))
fig.show()

Finally we perform a Kruskal test to determine whether the killing behaviour of white,black and hispanic people is different. 

In [None]:
data = df.groupby(['Year','race'])['Month'].value_counts().to_frame(name = 'count').reset_index()

data_black = data.loc[(data['race'] == 'Black') & (data['Year'] <= 2017)].copy()
data_white = data.loc[(data['race'] == 'White')& (data['Year'] <= 2017)].copy()
data_hispanic = data.loc[(data['race'] == 'Hispanic') & (data['Year'] <= 2017)].copy()

wbh_dist_2015 = [197.534,39.597,56.496]
wbh_dist_2016 = [197.479, 39.717,57.398]
wbh_dist_2017 = [197.285, 40.129, 58.846]

white_dist = np.append([np.array([19.7534 for i in range(12)]),np.array([19.7479 for i in range(12)])], 
                      [np.array([19.7285 for i in range(12)])])
black_dist = np.append([np.array([3.9597 for i in range(12)]),np.array([3.9717 for i in range(12)])], 
                      [np.array([4.0129 for i in range(12)])])
hisp_dist = np.append([np.array([5.6496 for i in range(12)]),np.array([5.7398 for i in range(12)])], 
                      [np.array([5.8846 for i in range(12)])])

data_white['count'] = data_white['count']/white_dist
data_black['count'] = data_black['count']/black_dist
data_hispanic['count'] = data_hispanic['count']/hisp_dist


print(kruskal(data_white['count'],data_black['count']))
print(kruskal(data_white['count'],data_hispanic['count']))
print(kruskal(data_hispanic['count'],data_black['count']))

In conclusion, we can say that consistently the differences of police shooting over time of White, Hispanic and Black people is statistically significant. 