# Fatal Police Shootings in the US
## by Fares Lassoued

## Preliminary Wrangling

I chose this dataset to work on for my final project for Udacity's Data Analyst Nanodegree, you can find this notebook with explanatory analysis and a slideshow + my other projects in this [repository](https://github.com/Zowlex/Data-Analyst-ND) .Any feedback would be very helpful

P.S: I graduated yesterday :)

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',100)
%matplotlib inline

uni_color = sb.color_palette()[0]

In [None]:
#defining functions

def pie_plot(df, cat_var):
    """
    plots a cat_var from given df pie plot with ordered values, 90Â° start angle and counterclock direction
    """
    sorted_counts = df[cat_var].value_counts()
    plt.pie(sorted_counts, labels = sorted_counts.index, startangle = 90,
            counterclock = False);
    plt.axis('square')
    plt.show()

The dataset comes into 5 different files : `PoliceKillings.csv` and we will use other csv files to depict relationships and answer specific questions.

In [None]:
#data cleaning was made in a separate notebook 'data cleaning.ipynb'

path = '../input/fatal-police-shootings-clean'

median_house_income = pd.read_csv(f'{path}/median_house_income_clean.csv')
percentage_below_poverty_level = pd.read_csv(f'{path}/percentage_below_poverty_level_clean.csv')
percent_over25_comp_highschool = pd.read_csv(f'{path}/percent_over25_comp_highschool_clean.csv')
share_by_race = pd.read_csv(f'{path}/share_by_race_clean.csv')
police_killings = pd.read_csv(f'{path}/police_killings_clean.csv', parse_dates=['date'])

In [None]:
print(police_killings.shape)
print(police_killings.dtypes)
police_killings.head()

In [None]:
print(median_house_income.shape)
print(median_house_income.dtypes)
median_house_income.head()

In [None]:
print(percentage_below_poverty_level.shape)
print(percentage_below_poverty_level.dtypes)
percentage_below_poverty_level.head()

In [None]:
print(percent_over25_comp_highschool.shape)
print(percent_over25_comp_highschool.dtypes)
percent_over25_comp_highschool.head()

In [None]:
print(share_by_race.shape)
print(share_by_race.dtypes)
share_by_race.head()

### What is the structure of your dataset?

> The dataset covers 2254 (after cleaning data) police killings since Jan. 1, 2015 with different features like (id, name, manner_of_death, ...) where most of the features (9/14) are qualitative, 2 boolean,1 date, 1 numeric and 1 id.

>There is additional information about ~29k cities like median house income per city, share by race per city, percentage of high school graduation for people over 25 and percentage below poverty level per city.

### What is/are the main feature(s) of interest in your dataset?

> I'm most interested in depicting the most important factors that lead to shootings.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> I call that cities with high poverty level and low graduation rate are important features. 

## Univariate Exploration

we will start by looking at the age of victims.

In [None]:
plt.hist(police_killings.age)
plt.xlabel('age')
plt.ylabel('count')
plt.show()

Age distribution is right skewed with a peak between 20 and 40.

let's try a different bin size

In [None]:
plt.hist(police_killings.age, bins=40)
plt.xlabel('age')
plt.ylabel('count')
plt.show()

interesting! there's certain age (~25) with the highest count, otherwise the plot is the same. Let's see the distribution of median_house_income.

In [None]:
plt.hist(median_house_income.median_income)
plt.xlabel('median outcome')
plt.ylabel('count')
plt.show()

the plot is right skewed with a peak around 45k, given that cities with high median_income might not be noticed we will try a different transformation to dig deeper

In [None]:
log_data = np.log10(median_house_income.median_income)
log_bin_edges = np.arange(0, log_data.max()+0.1, 0.1)
plt.hist(log_data, bins = log_bin_edges)
plt.xlabel('log(median_house_income)')
plt.show()

let's zoom into the plot to start from 10^4 values while keeping in mind the lowest values

In [None]:
log_data = np.log10(median_house_income.median_income)
log_bin_edges = np.arange(4, log_data.max()+0.1, 0.1) #the plot is zoomed to start from 10**4
plt.hist(log_data, bins = log_bin_edges)
plt.xlabel('log(median_house_income)')
plt.show()

The log transformation of median_house_income follows a normal distribution with a peak between 40k and 60k and doesn't show any abnormalities.

Next we will plot number of deaths per month

In [None]:
plt.figure(figsize=(15,5))

dates = police_killings.set_index('date').groupby(pd.Grouper(freq='M'))['id'].count()
sb.lineplot(data=dates)
plt.ylabel('# of deaths per month')
plt.xticks(rotation=90);

Plotting dates of police killings between Jan 2015 and Jan 2018 shows a number of deaths between 60 and 80 for each month until May 2017 where we notice a sudden fall in number of monthly deaths below 20 per month.

now, we'll look for different qualitative variables

In [None]:
#body_cam
sb.countplot(data=police_killings, x='body_camera', color=uni_color);

In most killings there was no body camera.

In [None]:
#gender
sb.countplot(data=police_killings, x='gender', color=uni_color);

Males have the highest count in this data with around 90/10 % ratio, let's see the values for different races.

In [None]:
sb.countplot(data=police_killings, x='race', color=uni_color, order=police_killings.race.value_counts().index);

ordered from highest to lowest counts, people with white race are the most killed in this data, followed by black as the 2nd most killed, then comes hispanic race in third place and finally Asian, Native american and other with small counts. However, we should remake this plot with each race's proportion of the whole data to be fair.

In [None]:
(share_by_race.mean()/100).reset_index(name='proportions')

In [None]:
us_pop_2015 = 320000000.7
prop = share_by_race.mean()/100 * us_pop_2015
killings_per_race_count = police_killings.race.value_counts()[:-1] #There's no data for other races, that's why we exclude it from the count

killings_per_race_count.loc['W']/=prop.loc['share_white']
killings_per_race_count.loc['B']/=prop.loc['share_black']
killings_per_race_count.loc['N']/=prop.loc['share_native_american']
killings_per_race_count.loc['H']/=prop.loc['share_hispanic']
killings_per_race_count.loc['A']/=prop.loc['share_asian']

In [None]:
sb.barplot(x=killings_per_race_count.index, y=killings_per_race_count.values, color=uni_color);

The modification of the previous plot made by dividing the number of kills per race proportion shows that black people are the most killed during this period, next comes hispanic while white is the 2nd least in the list as opposed to what the previous plot has conveyed.

let's check manner of death variable.

In [None]:
pie_plot(police_killings, 'manner_of_death')

Most of the manner_of_death counts are direct shots and a small percentage (around 10%) of shot and tasered.

let's explore the signs_of_mentall_illness variable

In [None]:
pie_plot(police_killings, 'signs_of_mental_illness')

Only around 25% of killings have shown signs of mental illness, let's check more variables like if the suspect were a fleeing or not and the threat level

In [None]:
most_used = police_killings.armed.value_counts()>50
sb.countplot(data=police_killings.loc[police_killings.armed.isin(most_used[most_used].index.tolist())],
             y='armed', color=uni_color);

gun has the highest count as of 'armed' variable, however we can see that there were around 200 unarmed and toy weapon which raises the question of how where those 'armed' and led them to get killed? that's why we'll check another important variable: flee

In [None]:
plt.figure(figsize=(9,5))

plt.subplot(121)
ax1 = sb.countplot(data=police_killings, x='threat_level', color=uni_color)
ax1.title.set_text('threat_level')

plt.subplot(122)
ax1 = sb.countplot(data=police_killings, x='flee', color=uni_color)
ax1.title.set_text('flee')

We can clearly see that most suspects were showing *attack* threat level and were'nt fleeing.

Next, we'll check the states where the killings took place and dig deeper beyond that by exploring data related to the state with most killings.

In [None]:
plt.figure(figsize=(7,10))
sb.countplot(data=police_killings, y='state', color=uni_color, order=police_killings.state.value_counts().index);

California is the state where most killings took place while Rhode Island is the one with least killings.

We will compare these two states' share by race, poverty level and high school grad level in the coming sections.

N.B: We will compare the mentioned states given that they have a huge difference in population (CA:39 million vs. RI: 1.1 million)

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

- The median_income_value distribution showed a right skewed plot which is why we log transformed the variable and it showed areas with very low median income compared to the rest of the most values which followed a normal distribution.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

- The count of kills per race plot shows that white people are the most killed among the given data without taking race proportion into consideration, that's why we had to redo the plot by dividing the number of kills per race by the respective proportion and this shows as a result that black people are the most killed.

- The age variable showed a high peak around the age of 25 under a different bin size otherwise there was no need to perform any operation to tidy or change the form of the data.

- The number of killings per month had a sudden drop in May 2017

## Bivariate Exploration

let's plot age against different qualitative variables

In [None]:
#age vs gender
sb.boxplot(data = police_killings, x = 'gender', y = 'age', color = uni_color);

the boxplot for male ages show more outliers and a lower mean than female.

In [None]:
#age vs. race
sb.pointplot(data = police_killings, x = 'race', y = 'age', color = uni_color, linestyles='')
plt.ylabel('avg age');

this plot gives a shows clearly the avg age of different races and it depicts for example that white race avg age is around 40 whilst black is around 32.

let's plot age against signs of mentall illness

In [None]:
#age vs. signs_of_mental_illness
sb.violinplot(data = police_killings, x = 'signs_of_mental_illness', y = 'age', color = uni_color, inner='quartile');

signs of mental illness appear more frequently within 30s while the distribution of ages above 50 are more larger for people showing signs of mental illness.

Now let's compare the states we mentioned earlier: Carlifornia Vs. Rhode Island

In [None]:
#median_house_income/percentage_below_poverty_level
plt.figure(figsize=(15,4))

state_income = median_house_income.loc[median_house_income.geographic_area.isin(['CA','RI'])]
state_poverty_lvl = percentage_below_poverty_level.loc[percentage_below_poverty_level.geographic_area.isin(['CA','RI'])]
state_comp = percent_over25_comp_highschool.loc[percent_over25_comp_highschool.geographic_area.isin(['CA','RI'])]

plt.subplot(131)
sb.pointplot(data=state_income,x='geographic_area', y='median_income',color = uni_color, linestyles='', ci='sd')

plt.subplot(132)
sb.pointplot(data=state_poverty_lvl,x='geographic_area', y='poverty_rate',color = uni_color, linestyles='')

plt.subplot(133)
sb.pointplot(data=state_comp,x='geographic_area', y='percent_completed_hs',color = uni_color, linestyles='');


In 2015, Rhode Island cities has on average higher median_house_income and with less standard deviation than cities of California.In addition, poverty rate in RI cities is much lower that CA cities. On average, percentage of people over the age of 25 that completed high school in RI (around 88%) cities is higher than CA (around 82%)

In [None]:
data = share_by_race.loc[share_by_race.geographic_area.isin(['CA','RI'])].groupby('geographic_area').mean()
data.plot(kind='bar',figsize=(15,4));

California has more variant race shares than Rhode Island

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Investigating the variables of interest depict that CA cities have on average lower median_house_income than RI cities, CA cities have on average more poverty rate than RI cities and CA cities have on average lower percentage of people over 25 that completed high school. These three variables are in favour of lower killings as expected.

>In addition, people showing signs of mental illness according to the data are older on average than people who do not 

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Other features show no special relationships

## Multivariate Exploration

First we will the average age across different races and gender

In [None]:
plt.figure(figsize=(15,8))

cat_means = police_killings.groupby(['gender', 'race']).mean()['age']
cat_means = cat_means.reset_index(name = 'age_avg')
cat_means = cat_means.pivot(index = 'gender', columns = 'race',
                            values = 'age_avg')
sb.heatmap(cat_means,annot=True, fmt = '.3f',
           cbar_kws = {'label' : 'mean(age)'})
plt.title('');

This plot allow us to diffrence of avg age between gender and race variables where the lowest avg age is for hispanic females and highest is white males.

Next we will look into weapon (among the most used weapons only) usage per state

In [None]:
#https://stackoverflow.com/questions/45122416/one-horizontal-colorbar-for-seaborn-heatmaps-subplots-and-annot-issue-with-xtick
plt.figure(figsize=(20,15))

cat_means = police_killings.loc[police_killings.armed.isin(most_used[most_used].index.tolist())].groupby(['state', 'armed']).count()['id']
cat_means = cat_means.reset_index(name = 'count')
cat_means = cat_means.pivot(index = 'armed', columns = 'state',
                            values = 'count')
sb.heatmap(cat_means,cbar_kws={'orientation': 'horizontal', 'label' : 'weapons count', "shrink": .80},annot=True,cmap=sb.cm.rocket_r, square=True);

We already know that gun is the most used weapon but now we can spot which states used gun the most like California (CA), Texas(TX), and states with medium gun usage like Arizona(AZ),New York(NY),et... and other states with low weapon usage  like Rhode Island (RI), Vermont(VT),etc... . However, these observations are from this dataset only and we did not take each states population or number of existing guns etc...

Finally, we'll take a check the number of monthly deaths per state and race

In [None]:
police_killings['year-month'] = police_killings.date.apply(lambda x: x.strftime('%b-%Y')) 

In [None]:
#https://stackoverflow.com/questions/25146121/extracting-just-month-and-year-separately-from-pandas-datetime-column

plt.figure(figsize=(15,8))

dates1 = police_killings.groupby(['race','year-month'])['id'].agg('count').reset_index().rename(columns={'id':'count'}).sort_values(by='year-month')

custom_dict = {x:i for i,x in enumerate(police_killings.sort_values(by='date')['year-month'].unique())}

df = dates1[dates1['race'] == 'B']
sb.lineplot(data=df.iloc[df['year-month'].map(custom_dict).argsort()] , x='year-month', y='count', sort=False)

df = dates1[dates1['race'] == 'W']
sb.lineplot(data=df.iloc[df['year-month'].map(custom_dict).argsort()], x='year-month', y='count', sort=False)

df = dates1[dates1['race'] == 'N']
sb.lineplot(data=df.iloc[df['year-month'].map(custom_dict).argsort()], x='year-month', y='count', sort=False)

df = dates1[dates1['race'] == 'H']
sb.lineplot(data=df.iloc[df['year-month'].map(custom_dict).argsort()], x='year-month', y='count', sort=False)

df = dates1[dates1['race'] == 'A']
sb.lineplot(data=df.iloc[df['year-month'].map(custom_dict).argsort()], x='year-month', y='count', sort=False)

df = dates1[dates1['race'] == 'O']
sb.lineplot(data=df.iloc[df['year-month'].map(custom_dict).argsort()], x='year-month', y='count', sort=False)



plt.xticks(rotation=90)
plt.ylabel('kills per month')
plt.legend(['B','W','N','H','A','O']);

We can see that the time pattern of kills for white, black and hispanic is not similair but they share in common the same period of fluctuation which is 2 months.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

- For most races there is no big difference for average age of the victims except for Hispanic females where avg age 26.5 and males 33, also this plot depicts a previously discussed observation where avg age of white people is more than black people.

- Showing the count of weapon usage in the second plot makes the picture about involved states more clear like depicting the states with the most gun usage, etc...

- The surprising pattern discussed earlier (The fall of monthly number of kills by May2017) applies for all races and it starts fluctuating again after May 2017. This pattern may be related to Trump's election (I searched events by that period and this was the one that made sense to me)

### Were there any interesting or surprising interactions between features?

- I don't think that there was any surprising interactions between features given that most of our features are qualitative.