# Fatal Police Shoothings US June 2021 - EDA

In this notebook I will exlpore the phenomenon of fatal police shootings in the U.S. with data provided by the Washington Post : https://www.washingtonpost.com/graphics/investigations/police-shootings-database/ (updated to June 1st) and other sources. The analysis is purely exploratory and the considerations are personal and based on the data sources. I will start by firstly analize some social variables and then combining this with the data of the Washington Post I'll try to extract other useful information. Please consider that I'm a beginner data science student and this is one of my first project. Feel free to comment with any piece of advice you have to help me improve. If you are a beginner as well I hope you'll find something usefull in this notebook. 

![](http://i.guim.co.uk/img/media/9765477859f031afe596005d82bc7b4aa1d3a76c/0_106_3500_2099/master/3500.jpg?width=445&quality=45&auto=format&fit=max&dpr=2&s=414b457190a3f8acba2d2224a749f8e4)

In [None]:
#libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import plotly.express as px
import plotly.graph_objects as go


#for geopandas legend
from mpl_toolkits.axes_grid1 import make_axes_locatable

%matplotlib inline

import warnings

warnings.filterwarnings("ignore")

#default style for plots
plt.rcParams["figure.figsize"] = 15, 9
plt.style.use('ggplot')

In [None]:
#Loading files
Fatal_Shoothing = pd.read_csv('../input/fatal-police-shoothing-us-geo/FatalPoliceShooting.csv', encoding="windows-1252")
Median_Income = pd.read_csv('../input/duplicate/MedianHouseholdIncome2015.csv', encoding="windows-1252")
Poverty_Rate = pd.read_csv('../input/duplicate/PercentagePeopleBelowPovertyLevel.csv', encoding="windows-1252")
Hs_Rate = pd.read_csv('../input/duplicate/PercentOver25CompletedHighSchool.csv', encoding="windows-1252")
Population_byCity = pd.read_csv('../input/duplicate/PopulationByCity.csv', encoding="windows-1252")
Race_byCity = pd.read_csv('../input/duplicate/ShareRaceByCity.csv', encoding="windows-1252")

<h1>Table of Contents<span class="tocSkip"></span></h1>
<li><span><a href="#Social-Analysis" data-toc-modified-id="Social-Analysis-2"><span class="toc-item-num">1&nbsp;&nbsp;</span>Social Analysis</a></span></li><li><span><a href="#Police-Shootings-Analysis" data-toc-modified-id="Police-Shootings-Analysis-3"><span class="toc-item-num">2&nbsp;&nbsp;</span>Police Shootings Analysis</a></span></li><li><span><a href="#Combined-Analysis" data-toc-modified-id="Combined-Analysis-4"><span class="toc-item-num">3&nbsp;&nbsp;</span>Combined Analysis</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-5"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusions</a></span></li>

# Social Analysis

Let's start by exploring some social data of the US. Firstly we will check the dataframe that we have and clean them up, than we will make some useful visualization to better understand the data.

In [None]:
#Poverty Rate
Poverty_Rate.info()

We don't have null values but every column has an object data type. We have to convert them in order to perform some calculations. Let's check if these columns are formatted as object beacuse not all values in them are numbers. 

In [None]:
Poverty_Rate.poverty_rate.value_counts()

We have 201 observations registred as '-' which is not a number. We can suppose that these enters are like missing values so, in order to convert the column into float, I'm going to substitute this value with 0.0

In [None]:
Poverty_Rate.poverty_rate.replace(['-'], 0.0, inplace=True)
Poverty_Rate.poverty_rate = Poverty_Rate.poverty_rate.astype('float')


To better represent this dataframe in a plot, I'm going to aggregate the average value by State

In [None]:
Poverty_Rate_byState = Poverty_Rate.groupby('Geographic Area')['poverty_rate'].mean().sort_values(ascending=False)
Poverty_Rate_byState.head()

In [None]:
Poverty_Rate_byState.plot(kind='bar', color='#029386')
plt.xticks(rotation=45)
plt.title('Poverty Rate by State')
plt.xlabel('State')
plt.ylabel('Poverty Rate(%)')
plt.show()

Here we can just see the US states ordered by poverty rate in decreasing order. Right now is not very helpful but we will use it's going to be handy for future considerations. 
Now let's see how the poverty rate is distributed.

In [None]:
ax = sns.kdeplot(Poverty_Rate_byState, color = '#029386', shade=True)
#calculate the median
poverty_median = np.median(Poverty_Rate_byState)
#plot the median line
plt.axvline(poverty_median, c='#00035b')

ax.set_title('Poverty Rate by State Distribution')
ax.set_xlabel('Poverty Rate(%)')
ax.set_ylabel('Density')
plt.show()

From here we can see that the most common value of Poverty Rate in the US falls between 12% and 15%, but we have a good amount of values in the 15% - 20% range too; in fact the median is slightly over 15%. So we could say that the US poverty rate in not very high, however we have to consider that the poverty rate is not an accurate index as it suffers from sampling bias (https://blogs.worldbank.org/developmenttalk/why-measuring-poverty-impacts-more-difficult-simply-using-score-cards).

Let's move on with the analysis of th median income.

In [None]:
Median_Income.info()

It looks like that we have the same problem of the previous dataframe, so we will repeat the same process to convert the Median Income column into float.

In [None]:
Median_Income['Median Income'].replace(['-','(X)','2,500-', '250,000+'], 0.0, inplace=True)
Median_Income['Median Income'] = Median_Income['Median Income'].astype('float')

By iteration, I discovered that in this dataframe we have also some 'semi-numeric' values (2,500- and 250,000) to express very high and very love incomes. I have to remove this values for the purpose of the analysis despite the loss of some date (in particular the very high incomes). Again for the plot, I will aggregate the average level by State.

In [None]:
Median_Income_State = Median_Income.groupby('Geographic Area')['Median Income'].mean().sort_values(ascending=False)

In [None]:
Median_Income_State.plot(kind='bar', color= '#76cd26')
plt.xticks(rotation=45)
plt.title('Average Household Income by State ($)')
plt.xlabel('State')
plt.ylabel('Average Household Income($)')
plt.show()

In [None]:
ax = sns.kdeplot(Median_Income_State, color = '#76cd26', shade=True)
income_median = np.median(Median_Income_State)
plt.axvline(income_median, c='#154406')
ax.set_title('Average Household Income Distribution')
ax.set_xlabel('Average Household Income($)')
ax.set_ylabel('Density')
plt.show()

We can see that the most common range is between \\$ 40,000 and \\$ 50,000 where we also have the median. If we look at tha bar plots we can see that New Jersey is the state with the highest median income and the lowest poverty rate. It is logical to think that this two variables are positively correlated. We can check this correlation in a plot. 

In [None]:
#Combine the datasets
IncomeVsPoverty = pd.concat([Median_Income_State, Poverty_Rate_byState], axis=1)

In [None]:
x = IncomeVsPoverty['Median Income']
y = IncomeVsPoverty['poverty_rate']
ax = sns.jointplot(x, y, 'o', color='#029386',  height=10)

plt.show()

Here we can see that there's indeed a negative correlation: the higher the median income, the lower the poverty rate. 
We can now do the same cleaning operations for the Percent of over 25 who completed High School.

In [None]:
Hs_Rate.percent_completed_hs.replace(['-','(X)'], 0.0, inplace=True)
Hs_Rate.percent_completed_hs = Hs_Rate.percent_completed_hs.astype('float')
Hs_Rate_State = Hs_Rate.groupby('Geographic Area')["percent_completed_hs"].mean().sort_values(ascending=False)


In [None]:
Hs_Rate_State.plot(kind='bar', color = '#047495')
plt.xticks(rotation=45)
plt.title("Percent of population over 25 who completed High School ")
plt.xlabel('State')
plt.ylabel('Percent of High School Graduates')
plt.show()

This plot isn't tellign anything interesting, in fact  as we can see the High School graduated rate is almost the same in all the states. However I expect to see a positive correlation between this data and the median income, let's check if it is true.

In [None]:
EducationVsIncome = pd.concat([Median_Income_State, Hs_Rate_State], axis=1)

In [None]:
x = EducationVsIncome['Median Income']
y = EducationVsIncome['percent_completed_hs']
sns.jointplot(x, y, 'o', color='#f97306',  height=10)
plt.show()


We can see that in fact there's a weak positive relation (this is because the HS graduated rate is almost the same across all States. We can see however that there's an outlier value which is Texas, the only State with an high education level below 75%. In spite of this, we can see that the Median Income still falls between \\$ 40,000 and \\$ 50,000, that is the most common interval. Let's proceed with the last dataframe of the social analysis.

In [None]:
#Race share
Race_byCity.replace(['-'],0.0,inplace=True)
Race_byCity.replace(['(X)'],0.0,inplace=True)

Race_byCity["share_white"] = Race_byCity['share_white'].astype('float')
Race_byCity["share_black"] = Race_byCity['share_black'].astype('float')
Race_byCity["share_native_american"] = Race_byCity['share_native_american'].astype('float')
Race_byCity["share_asian"] = Race_byCity['share_asian'].astype('float')
Race_byCity["share_hispanic"] = Race_byCity['share_hispanic'].astype('float')

#Aggregation
Race_by_State = Race_byCity[[x for x in Race_byCity.columns if 'share' in x] + ['Geographic area']].groupby('Geographic area').mean().sort_values(by = 'share_white', ascending=False)

In [None]:
Race_by_State.plot(kind='bar', width=0.8, figsize=(30, 10), color=['#fd411e','#3e82fc', '#ffdf22','#0cdc73', '#030aa7'])
plt.xticks(rotation=45)
plt.xlabel('State')
plt.legend(['% White', '% African American', '% Native American', '% Asian', '% Hispanic'], loc='best')
plt.title("Percentage of State's Population according to Races")
plt.show()


This plot is ordered by the share of white population - as it is the most common race in the US - for a better visualization.  

In [None]:
fig, ax = plt.subplots(1,1)
ax.xaxis.set_ticks([])

sns.kdeplot(Race_by_State.share_white, ax=ax, shade=True, color='#fd411e')
sns.kdeplot(Race_by_State.share_black, ax=ax, shade=True, color='#3e82fc')
sns.kdeplot(Race_by_State.share_native_american, ax=ax, shade=True, color='#ffdf22')
sns.kdeplot(Race_by_State.share_asian, shade=True, color='#0cdc73')
sns.kdeplot(Race_by_State.share_hispanic, ax=ax, shade=True, color='#030aa7')


ax.set_title("Distribution of Ethnicities per State")
ax.set_xlabel(' ')
ax.set_ylabel('Density') 
plt.legend(['% White', '% African American', '% Native American', '% Asian', '% Hispanic'])
plt.show()

From this plots we can see that, as we already knew, the majority of population is white. The second most represented group are african american and only in a few state we have a large share of hispanics. We can see that the States in which we have a large share of native americans, are to ones with Indian Reservations (https://en.wikipedia.org/wiki/List_of_Indian_reservations_in_the_United_States) except for Alaska. We can see this in the distribution plot too. In fact some ethnic groups are higly concentred in few States (like native americans and asians), whereas others are more homogeneously across the nation whit occasional peaks (like white and african american).

Now let's move on with the fatal police shooting analysis.

# Police Shootings Analysis

In this dataset we a have a lot of interesting variables that can provide a good analysis, however we will combine some of these variables with the data sets used in the social analsys.First of all, let's see these variables.

In [None]:
Fatal_Shoothing.head()

First I'll convert the date column into datetime objects and then I'll add the colum Year and Month to the dataframe and finally I'll aggregate the data to plot them

In [None]:
Fatal_Shoothing['date'] = pd.to_datetime(Fatal_Shoothing['date'])

Fatal_Shoothing['Month'] = pd.DatetimeIndex(Fatal_Shoothing['date']).month
Fatal_Shoothing['Year'] = pd.DatetimeIndex(Fatal_Shoothing['date']).year

shooting_over_time = Fatal_Shoothing.groupby('Year')['Month'].count()

In [None]:
shooting_over_time.plot(color='#363737', marker='o', mec='#ffffff', mfc='#ff000d')
plt.title('Number of Fatal Shoothings per year')
plt.xlabel('Year')
plt.ylabel('N. of Deaths')
plt.show()

We can see that the number of deaths due to fatal shothings has been stable during years. Off course the value is lower for 2021 because we don't have all data available yet. Now we can visualize the month distribution. 

In [None]:
Fatal_Shoothing['date'].groupby(Fatal_Shoothing.date.dt.to_period('M')).count().plot(color='#363737', marker='o', mec='#ffffff', mfc='#ff000d')
plt.title("Monthly distribution of Fatal Shoothings")
plt.xlabel('Year')
plt.ylabel('N. of Deaths')
plt.show()

Each marker represents a month in the year. We don't have seasonality - as we would expect - nonetheless we can see some peaks during time: July 2015, March 2018, December 2019 and May 2020 (when the homicide of George Floyd occurred as well.)

Now let's see to which ethnicity belong the victims.

In [None]:
Shoot_Race = Fatal_Shoothing.race.value_counts(normalize=True)
Shoot_Race.index = ['White', 'African American', 'Hispanic', 'Asian', 'Native American', 'Other']

fig, ax = plt.subplots(1,1 , figsize=(10,8))
sns.barplot(y=Shoot_Race.index, x=Shoot_Race.values, palette='Blues_r')
ax.set_title('Percentage of deaths by Ethnicity')
plt.show()

This plot is wrong. In fact it doesn't take into account the race share of population. Off course most of the victims are white because, as we have seen previously, whites are the largest share of US population. To plot a more truthful visualization let's create a series of the race shares in the USA with data from:  https://data.census.gov/cedsci/table?q=Hispanic%20or%20Latino&tid=ACSDP1Y2019.DP05&hidePreview=false (updated to 2019).

In [None]:
Share_Race_US2019 = pd.Series([60.0, 12.4, 0.9, 5.7, 18.4, 2.6], index=['White', 'African American', 'Native American', 'Asian', 'Hispanic','Other'])
TotalShare_Shooting = Shoot_Race / Share_Race_US2019
TotalShare_Shooting = TotalShare_Shooting.sort_values(ascending=False)

fig, ax = plt.subplots(1,1 , figsize=(10,8))
sns.barplot(y=TotalShare_Shooting.index, x=TotalShare_Shooting.values, palette='Reds_r')
ax.set_title('Total cases for each race on total USA race percentage rate')
plt.show()

So, in proportion, most of the victims are african american, followed by native american despite the fact they represent only 0.9% of total US population.

Now we will visualize a distribution of the age of the victims.

In [None]:
ax = sns.kdeplot(Fatal_Shoothing.age, color='#137e6d', shade=True)

ax.set_title("Distribution of the Age of the Victims")
ax.set_xlabel('Age')
ax.set_ylabel('Density') 
plt.show()

Most of the victims are quite young since the most common age range is 20 - 40; nevertheless we also have some cases for older age. Let's see how this distribution changes based on the ethnicity. 

In [None]:
fig, ax = plt.subplots(1,1)
ax.xaxis.set_ticks(np.arange(0, 100, 10))

sns.kdeplot(Fatal_Shoothing[Fatal_Shoothing.race == 'W'].age, ax=ax, shade=True, color='#fd411e')
sns.kdeplot(Fatal_Shoothing[Fatal_Shoothing.race == 'B'].age, ax=ax, shade=True, color='#3e82fc')
sns.kdeplot(Fatal_Shoothing[Fatal_Shoothing.race == 'N'].age, ax=ax, shade=True, color='#ffdf22')
sns.kdeplot(Fatal_Shoothing[Fatal_Shoothing.race == 'A'].age, ax=ax, shade=True, color='#0cdc73')
sns.kdeplot(Fatal_Shoothing[Fatal_Shoothing.race == 'H'].age, ax=ax, shade=True, color='#030aa7')
sns.kdeplot(Fatal_Shoothing.age, color='#ff073a')

ax.set_title("Distribution of the Age of the Victims vs. Ethnicity")
ax.set_xlabel('Age')
ax.set_ylabel('Density') 
plt.legend(['Total','White', 'African American','Native American', 'Asian', 'Hispanic',])
plt.show()

We can see that the distribution function of african american is slightly on the left side, thus indicating that african american victims are also younger compared to other ethnicities.

Now we will se how the fatal shoothings changes with respect to other characteristics of the victims and the use of body cams by the police.

In [None]:
#Creating the variables
v_1 = Fatal_Shoothing['signs_of_mental_illness'].value_counts(normalize=True)
v_2 = Fatal_Shoothing['threat_level'].value_counts(normalize=True)
v_3 = Fatal_Shoothing['body_camera'].value_counts(normalize=True)
v_4 = Fatal_Shoothing['manner_of_death'].value_counts(normalize=True)

#Visualization
fig, ax = plt.subplots(2,2, sharey=True)
sns.barplot(x=v_1.index, y=v_1.values, palette='CMRmap', ax=ax[0,0])
ax[0,0].set_title('Signs of Mental Illness (%)')
sns.barplot(x=v_2.index, y=v_2.values, palette='hsv', ax=ax[0,1])
ax[0,1].set_title('Threat Level (%)')
sns.barplot(x=v_3.index, y=v_3.values, palette='coolwarm', ax=ax[1,0])
ax[1,0].set_title('Body Camera (%)')
sns.barplot(x=v_4.index, y=v_4.values, palette='prism_r', ax=ax[1,1])
ax[1,1].set_title('Manner of Death (%)')
plt.show()


* Only in 20% of the cases victims showed signs of mental illnes
* Most of the fatal shoothings occured when police was not wearing body cameras
* The threat level was high in most of the cases
* The manner of death actually is not very useful

Now let's see if thee victims were armed.

In [None]:
def armed_unarmed(x):
    if x != 'unarmed' and x != 'toy weapon' and x != 'undetermined':
        return 'Armed'
    if x == 'toy weapon':
        return 'Toy Weapon'
    if x == 'undetermined':
        return 'Undetermined'
    else:
        return 'Unarmed'
    

armed_unarmed = Fatal_Shoothing.armed.apply(armed_unarmed).value_counts(normalize=True)
labels = ['Armed', 'Toy Weapon','Undetermined', 'Unarmed']
plt.pie(armed_unarmed, labels=labels, autopct='%1.1f%%', colors=['#fd411e','#3e82fc', '#ffdf22','#0cdc73'],counterclock=False)
plt.title('Armed or Unarmed(%)')
plt.show()

Most of the victims were actually armes.

According to most of statistics on crimial rate, male tend to commint more crime than women ( https://www.statista.com/statistics/424145/prevalence-rate-of-violent-crime-in-the-us-by-gender/). Let's see the gender composition of our data.

In [None]:
plt.pie(Fatal_Shoothing.gender.value_counts(normalize=True), labels=['M', 'F'], autopct='%1.1f%%', colors=['#3e82fc','#fd411e'])
plt.title('Gender of the Victims (%)')
plt.show()

Most of the victims were men. Now let's move with the last part of the analysis.

# Combined Analysis

Now I'll merge some of the prevoius dataframe to create other visualizations.

In [None]:
Race_byCity.head()

In [None]:
#Victims + city and state
cases = Fatal_Shoothing.groupby(['state','city']).count()['id'].reset_index()
cases.columns = ['state','city','cases']

#Need to correct col name in Race_byCity
Race_byCity.rename(columns = {'Geographic area': 'Geographic Area'}, inplace=True)

#Merging Social Analysis dataframes
Social_var = pd.merge(Poverty_Rate, Hs_Rate)
Social_var = pd.merge(Social_var, Median_Income)
Social_var = pd.merge(Social_var, Race_byCity)
Social_var.columns = ['state','city','poverty_rate','percent_completed_hs','median_income','share_white','share_black', 'share_native_american','share_asian','share_hispanic']

#Modifing Social_var city names
Social_var['city'] = Social_var['city'].apply(lambda x: x.rsplit(' ',1)[0])

#Removing duplicate indexes
duplicate_index = Social_var[['state','city']][Social_var[['state','city']].duplicated()].index
Social_var = Social_var.drop(duplicate_index)

#Merging into one dataframe
Social_var = pd.merge(cases, Social_var, on=['state','city'], how='left')

Social_var.head()




Finally, I'll add total pooulation data.

In [None]:
total = pd.merge(Social_var, Population_byCity, on=['state','city'], how='left').dropna()

#create a column for cases on population

total['cases_on_population'] = (total['cases']/total['population'])*100
total.head()

Now let's see how cases vary compared to social variables and the three most large ethnic groups.

In [None]:
fig, ax = plt.subplots(2,3, figsize=(21,12), sharey=True)
#Only including cities with pop > 100
sns.scatterplot(y='cases_on_population', x='poverty_rate', data=total[(total.population > 100)], ax=ax[0,0],size='cases_on_population', color='#c65102', legend=False)
sns.scatterplot(y='cases_on_population', x='percent_completed_hs', data=total[(total.population > 100)], ax=ax[0,1],size='cases_on_population', color='#028f1e', legend=False)
#Remove incomes = 0
sns.scatterplot(y='cases_on_population', x='median_income', data=total[(total.population > 100) & (total.median_income > 0)], ax=ax[0,2],size='cases_on_population', color='#448ee4', legend=False)

sns.scatterplot(y='cases_on_population', x='share_white', data=total[(total.population > 100)], ax=ax[1,0],size='cases_on_population', color='#ceb301', legend=False)
sns.scatterplot(y='cases_on_population', x='share_black', data=total[(total.population > 100)], ax=ax[1,1],size='cases_on_population', color='#4b006e', legend=False)
sns.scatterplot(y='cases_on_population', x='share_hispanic', data=total[(total.population > 100)], ax=ax[1,2],size='cases_on_population', color='#e17701', legend=False)

ax[0,0].set_title('cases on poverty rate')
ax[0,1].set_title('cases on percentage complete HS')
ax[0,2].set_title('cases on median household income')
ax[1,0].set_title('cases on share white')
ax[1,1].set_title('cases on share african american')
ax[1,2].set_title('cases on share hispanic')
plt.show()

On the first row we can see that cities with higher poverty rates don't register more cases. As we have seen before, the percentage of people ho completed HS is homogeneous across USA and infact most cases are registred around 80%. Finally we have most cases between \\$ 25,000 and \\$ 50,000 dollars.

On the second row we can see that we register a lot of cases for a high percentage of white population. This is normal because as we have seen before white is the largest ethinc group of US. We can also see thath we don't have more cases when we have a larger share of african american or hispanic population, even though we have some interesting points. 

Let's visualize some data on maps.


In [None]:
#Create a dataframe with total cases per state
sum_state = Fatal_Shoothing.groupby(['state']).apply(lambda x:x['manner_of_death'].count()).reset_index(name='Counts')

In [None]:
#Have to modify states name to fit the shape file

states = {
        'AK': 'Alaska','AL': 'Alabama','AR': 'Arkansas','AS': 'American Samoa','AZ': 'Arizona','CA': 'California','CO': 'Colorado',
        'CT': 'Connecticut','DC': 'District of Columbia','DE': 'Delaware','FL': 'Florida','GA': 'Georgia','GU': 'Guam','HI': 'Hawaii',
        'IA': 'Iowa','ID': 'Idaho','IL': 'Illinois','IN': 'Indiana','KS': 'Kansas','KY': 'Kentucky','LA': 'Louisiana','MA': 'Massachusetts',
        'MD': 'Maryland','ME': 'Maine','MI': 'Michigan','MN': 'Minnesota','MO': 'Missouri','MP': 'Northern Mariana Islands','MS': 'Mississippi',
        'MT': 'Montana','NA': 'National','NC': 'North Carolina','ND': 'North Dakota','NE': 'Nebraska','NH': 'New Hampshire','NJ': 'New Jersey',
        'NM': 'New Mexico','NV': 'Nevada','NY': 'New York','OH': 'Ohio','OK': 'Oklahoma','OR': 'Oregon','PA': 'Pennsylvania','PR': 'Puerto Rico',
        'RI': 'Rhode Island','SC': 'South Carolina','SD': 'South Dakota','TN': 'Tennessee','TX': 'Texas','UT': 'Utah','VA': 'Virginia',
        'VI': 'Virgin Islands','VT': 'Vermont','WA': 'Washington','WI': 'Wisconsin','WV': 'West Virginia','WY': 'Wyoming'
}

sum_state.replace(to_replace=states, inplace=True)
sum_state.set_index('state')
sum_state.head()

In [None]:
#Read the shapefile
usa = gpd.read_file('../input/fatal-police-shoothing-us-geo/cb_2016_us_state_500k.shp')
data_map = usa.set_index('NAME').join(sum_state.set_index('state'))
var='Counts'

#Min and Max value
vmin, vmax = np.min(data_map.loc[:,['Counts']]), np.max(data_map.loc[:,['Counts']])

#Visualization

fig, ax = plt.subplots(1,1, figsize=(20,20))

#Removed Alaska because it's out of scale
ax.set_xlim(-127,-65)
ax.set_ylim(20,60)

#color bar options
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.1,)

data_map.plot(column=var, ax=ax, legend=True,cax=cax, cmap='inferno', edgecolor='0.8')

ax.set_xticks([])
ax.set_yticks([])
ax.set_facecolor("white")
ax.set_title('Fatal Shoothings per state 2015-2021')
cax.set_title('N. of cases 2015-2021')

plt.show()

Here we can see that some of sud states (California, Arizona, Texas and Florida) have registred a high number of cases, as well as Whashington and Colorado. Let's see how's the situation city-wise.

In [None]:
#Viz with plotly Graph Objects
fig = go.Figure(data=go.Scattergeo(
        lon = Fatal_Shoothing['longitude'], #geo coordinates from df
        lat = Fatal_Shoothing['latitude'],  #geo coordinates from df
        text = Fatal_Shoothing['city'] + ', ' + Fatal_Shoothing['state'] + ': ' + Social_var['cases'].astype(str) ,
        mode = 'markers', #text to visualize
        marker = dict(
            size = Social_var['cases'].apply(lambda x: 4 if x < 5 else x), #total cases from Social_var
            opacity = 0.8,
            symbol = 'circle',
            colorscale = 'Portland_r',
            reversescale = True,
            cmin = 1,
            color = Social_var['cases'],
            cmax = Social_var['cases'].max(),
            colorbar_title="N. of cases 2015-2021"
        )))

fig.update_layout(
        margin={"r":0,"t":30,"l":0,"b":0},
        title = 'Fatal Shoothings per city 2015 - 2021',
        geo = dict(
            scope='usa',
            projection_type='albers usa',
            showland = False,
            landcolor = "blue",
            subunitcolor = "blue",
            countrycolor = "blue",
            subunitwidth = 0.5
        )
    )
fig.update_layout(title_x=0.5)
fig.show()

Here we can see cases for single cities. Here too California seems to have the majority of cases, in particular in the metropolitan area of Los Angeles. We have 90 cases in Dublin, NC. In the state if Whashington there are two cities with high values: Tacoma and Spokane. In particular, in May 2021, in the city of TAcoma 2 officers have been charged for the murder of Manuel Ellis (https://www.theguardian.com/us-news/2021/may/27/manuel-ellis-tacoma-three-officers-charged-death); Spokane has the 3rd most deadly police force in America (https://www.scarspokane.org/police-brutality). Let's check if south states have more cases even in relative terms. 

In [None]:
#Drop Wisconsin because it's aggregated with Maryland
cases_by_State = total.groupby('state')['cases_on_population'].mean().sort_values(ascending=False).drop('WI')


In [None]:
#Top10 states by cases on population
top_10 = cases_by_State[:10]
fig, ax = plt.subplots(1,1 , figsize=(10,8))
sns.barplot(y=top_10.index, x=top_10.values, palette='Greens_r')
ax.set_title('Top 10 states for cases on population')
plt.show()

Taking into account the population, we can see that Colorado and Maryland&Wisconsin are the state with most cases.

# Conclusions

* Since 2015 there's been no reduction in fatal shothing cases.
* Despite the fact afro americans represent only 12.4% of the population, they are the most involved ethnicity in fatal shoothing cases.
* Afro american victims are also younger. 
* Most of the victims are men 
* 87.5% of the vistims were armed.
* We have significantly less cases while officers were wearing body cameras.
* Colorado is the most dangerous state, having a high rate of cases compared to its population.
