## PREPROCESSING THE DATA

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
from sklearn import linear_model, metrics
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("../input/PoliceKillingsUS.csv", encoding="windows-1252")
df.head()

In [None]:
df = df.rename(columns={"city": "City"})

df.drop(["id", "name", "manner_of_death"], axis=1, inplace=True) # Deleting useless columns

df.age.fillna(value=df.age.mean(), inplace=True) # Dealing with missing AGE values. Set them to mean of all ages. 
df.age = df.age.astype(int)

df.dropna(subset=["race"], inplace=True) # Deleting rows with missing values for race

df.drop(df.index[2363:], inplace=True) # Deleting deaths after 01/06/2017, as more info is missing about these, including vital information such as race

<br>
Adding a column to the dataset called "total population" with the total US population of the corresponding race. 
Source: https://en.wikipedia.org/wiki/Demography_of_the_United_States
<br>

In [None]:
# Adding total_population column with data corresponding to race

conditions = [df["race"]=="A", df["race"]=="W", df["race"]=="H", df["race"]=="B", df["race"]=="N", df["race"]=="O"]
numbers = [14674252, 223553265, 50477594, 38929319, 2932248, 22579629]

df["total_population"] = np.select(conditions, numbers, default="zero")

df.head()

## EXPLORATORY ANALYSIS

#### Total number of people killed, by race

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(data=df, x="race")

plt.title("Total number of people killed, by race", fontsize=17)

The dataset divides race into Asian, White, Hispanic, Black, Native American and Other. From the bar chart we can see that the overwhelming majority being killed by police is either White, Hispanic or Black, with White being the race with the largest amount of victims. 
This makes sense since White is the largest racial group in the US, followed by Black and Hispanic.

#### Number of people killed as a proportion of respective races

In [None]:
# List of nr of people killed per race

races = ["A", "W", "H", "B", "N", "O"]
killed_per_race = []

for i in races:
    i_killings = df.race.loc[(df.race==i)].count()
    killed_per_race.append(i_killings)
    
print (killed_per_race)

In [None]:
prop_killed_per_race = []

for i in races:
    
    if i == "A":
        prop_i_killed = killed_per_race[0]/14674252.0
        print (prop_i_killed)
    elif i == "W":
        prop_i_killed = killed_per_race[1]/223553265.0
        print (prop_i_killed)
    elif i == "H":
        prop_i_killed = killed_per_race[2]/50477594.0
        print (prop_i_killed)
    elif i == "B":
        prop_i_killed = killed_per_race[3]/38929319.0
        print (prop_i_killed)
    elif i == "N":
        prop_i_killed = killed_per_race[4]/2932248.0
        print (prop_i_killed)
    else:
        prop_i_killed = killed_per_race[5]/22579629.0
        print (prop_i_killed)
    
    prop_killed_per_race.append(prop_i_killed)

In [None]:
plt.figure(figsize=(14,6))
plt.title("People killed as a proportion of their respective race", fontsize=17)
sns.barplot(x=races, y=prop_killed_per_race)

This bar chart shows the number of victims per race as a proportion of the total US population of respective race.
Earlier, when we looked at the total number of people killed, we saw that twice as many Whites were killed as Blacks. However, if you look at the numbers as the proportion of the racial population, Blacks are approximately 3 times as likely to be killed by police than Whites.

#### Total number of people killed, by gender

In [None]:
female = df[df["gender"] == "F"].gender.count()
male = df[df["gender"] == "M"].gender.count()
perc_male = (male*100)/(male+female) 

plt.figure(figsize=(7,5))
sns.countplot(data=df, x="gender")

plt.title("Total number of people killed, by gender", fontsize=17)

print (str(perc_male) + "% " + "of the victims are male.")

#### General age distribution

In [None]:
plt.figure(figsize=(15,7))
age_dist = sns.distplot(df["age"], bins=40)
age_dist.set(xlabel="Age", ylabel="Count")

plt.title("Age distribution", fontsize=17)

#### Comparing age distributions of Blacks, Whites, and Hispanics

In [None]:
# First, create dataset with only Blacks, Whites, Hispanics

three_races = df.loc[(df["race"] == "B") | (df["race"] == "W") | (df["race"] == "H")]

g = sns.FacetGrid(data=three_races, hue="race", aspect=3, size=4)
g.map(sns.kdeplot, "age", shade=True)
g.add_legend(title="Race")


g.set_ylabels("Count")
plt.title("Age distribution, by race", fontsize=17)

The age distribution of Blacks and Hispanics is skewed to the left, whereas the age distribution for Whites is more spread out. On average, Blacks and Hispanics are being killed at a younger age than Whites - which is consistent with the initial hypothesis that black males are subject to police killings at a young age.

In [None]:
avg_age_w = df.age[(df["race"] == "W")].mean() 
avg_age_b = df.age[(df["race"] == "B")].mean() 
avg_age_h = df.age[(df["race"] == "H")].mean() 

print ("Average age of white victims is " + str(avg_age_w))
print ("Average age of black victims is " + str(avg_age_b))
print ("Average age of hispanic victims is " + str(avg_age_h))

#### Number of fatal shootings in each state

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(data=df, x=df.state)
plt.title("Number of police killings, by state", fontsize=27)

California, Texas and Florida are the states in which police killings are most frequent. These are also the three most populous states in the US.

#### Most dangerous cities

In [None]:
city = df.City.value_counts(ascending=False)

df_city = df.filter(["City"], axis=1)
df_city["count"] = 1

grouped_city = df_city.groupby("City", as_index=False,sort=False).sum()
grouped_city.sort_index(ascending=False)

grouped_city = grouped_city.sort_values("count", ascending=False).head(8)                                                       

plt.figure(figsize=(15,8))
sns.barplot(data=grouped_city, x="City", y="count")
plt.title("Most dangerous cities", fontsize=17)

#### Visualizing police shootings using Tableau

In [None]:
%%HTML
<div class='tableauPlaceholder' id='viz1504205405904' style='position: relative'><noscript><a href='#'><img alt='Sheet 1 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;In&#47;InteractivePoliceKillingsMap&#47;Sheet1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='path' value='views&#47;InteractivePoliceKillingsMap&#47;Sheet1?:embed=y&amp;:display_count=y&amp;publish=yes' /> <param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;In&#47;InteractivePoliceKillingsMap&#47;Sheet1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1504205405904');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

#### Most common ways of being armed

In [None]:
armed = df.armed.value_counts(ascending=False)

df_armed = df.filter(["armed"], axis=1)
df_armed["count"] = 1

grouped_armed = df_armed.groupby("armed", as_index=False,sort=False).sum()
grouped_armed.sort_index(ascending=False)

grouped_armed = grouped_armed.sort_values("count", ascending=False).head(8) 

plt.figure(figsize=(15,8))
sns.barplot(data=grouped_armed, x="armed", y="count")
plt.title("Most common ways of being armed", fontsize=17)

## Adding features (census data)
<br>
Using US census data, I have compiled datasets on median household income, poverty rate, high school graduation rate, and the racial demographic in each city. This information is then added to the original dataset. Below I merge these datasets, and apply various machine learning algorithms to explore whether it's possible to predict the race of a victim based on the features.

#### Preprocessing the census data

In [None]:
income = pd.read_csv("../input/MedianHouseholdIncome2015.csv", encoding="windows-1252")
income["City"].replace(["city", "CDP", "town"], "", regex=True, inplace=True)
income["city"] = income["City"] + ", " + income["Geographic Area"]
income.drop(["Geographic Area", "City"], axis=1, inplace=True)

poverty = pd.read_csv("../input/PercentagePeopleBelowPovertyLevel.csv", encoding="windows-1252")
poverty["City"].replace(["city", "CDP", "town"], "", regex=True, inplace=True)
poverty["city"] = poverty["City"] + ", " + poverty["Geographic Area"]
poverty.drop(["Geographic Area", "City"], axis=1, inplace=True)

race = pd.read_csv("../input/ShareRaceByCity.csv", encoding="windows-1252")
race["City"].replace(["city", "CDP", "town"], "", regex=True, inplace=True) 
race["city"] = race["City"] + ", " + race["Geographic area"]
race.drop(["Geographic area", "City"], axis=1, inplace=True)

highschool = pd.read_csv("../input/PercentOver25CompletedHighSchool.csv", encoding="windows-1252")
highschool["City"].replace(["city", "CDP", "town"], "", regex=True, inplace=True)
highschool["city"] = highschool["City"] + ", " + highschool["Geographic Area"]
highschool.drop(["Geographic Area", "City"], axis=1, inplace=True)

#### Merging the datasets

In [None]:
df["city"] = df["City"] + " , " + df["state"] # Creating the same "city" format
merge1 = pd.merge(poverty, race, on="city", how="outer")
merge2 = pd.merge(merge1, income, on="city", how="outer")
merge3 = pd.merge(merge2, highschool, on="city", how="outer")
data = pd.merge(df, merge3, on="city", how="outer")
data.dropna(inplace=True)

data[["Median Income", "poverty_rate", "share_white", "share_black", "share_native_american", "share_asian", 
      "share_hispanic", "percent_completed_hs"]] = data[["Median Income", "poverty_rate", "share_white", "share_black", "share_native_american", "share_asian", 
      "share_hispanic", "percent_completed_hs"]].replace("(X)", np.NaN)
data[["Median Income", "poverty_rate", "share_white", "share_black", "share_native_american", "share_asian", 
      "share_hispanic", "percent_completed_hs"]] = data[["Median Income", "poverty_rate", "share_white", "share_black", "share_native_american", "share_asian", 
      "share_hispanic", "percent_completed_hs"]].replace("-", np.NaN)

data[["Median Income", "poverty_rate", "share_white", "share_black", "share_native_american", "share_asian", 
      "share_hispanic", "percent_completed_hs"]] = data[["Median Income", "poverty_rate", "share_white", "share_black", "share_native_american", "share_asian", 
      "share_hispanic", "percent_completed_hs"]].astype(float)

In [None]:
data.dropna(inplace=True)

In [None]:
# Converting neccesary columns to floats
data["poverty_rate"] = data["poverty_rate"].astype(float)
data["share_white"] = data["share_white"].astype(float)
data["share_black"] = data["share_black"].astype(float)
data["share_native_american"] = data["share_native_american"].astype(float)
data["share_asian"] = data["share_asian"].astype(float)
data["share_hispanic"] = data["share_hispanic"].astype(float)
data["percent_completed_hs"] = data["percent_completed_hs"].astype(float)
data["Median Income"] = data["Median Income"].astype(int)

In [None]:
data.head()

## RANDOM FOREST ALGORITHM TO PREDICT RACE

In [None]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Mapping True/False to 1/0

data["signs_of_mental_illness"] = data["signs_of_mental_illness"].astype(int)
data["body_camera"] = df["body_camera"].astype(int)

# Using LabelEncoder to deal with categorical features

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

le.fit(["armed", "race", "gender", "city", "state", "threat_level", "flee"])

In [None]:
data_log = data.apply(LabelEncoder().fit_transform)

X = data_log
y = data_log["race"]
X.drop(["race", "date", "total_population"], axis=1, inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

In [None]:
rfc_pred = rfc.predict(X_test)
rfc.feature_importances_

In [None]:
feature_data = pd.DataFrame({"feature_name": data_log.columns, "feature_importance": rfc.feature_importances_}) 
feature_data

The most important features in predicting race using the Random Forest algorithm are age and racial demographics.

In [None]:
print(classification_report(y_test, rfc_pred))

In [None]:
from sklearn.metrics import accuracy_score

# Accuracy score
rf_accuracy_score = accuracy_score(y_test, rfc_pred)
rf_accuracy_score

In [None]:
params = {"max_depth": [32,44,50],
         "n_estimators": [15,18,26,32],
          "min_samples_leaf": [40,50,60],
         "criterion": ["gini", "entropy"]}

from sklearn import model_selection

gs_rf = model_selection.GridSearchCV(estimator=rfc,
                                 param_grid=params,
                                 cv=5,
                                 scoring="accuracy")

gs_rf.fit(X_train, y_train)

In [None]:
# Extract the best parameters
gs_rf.best_params_

In [None]:
# Accuracy score after grid search
gs_rf_accuracy_score = gs_rf.best_score_ 
gs_rf_accuracy_score

<br>
## LOGISTIC REGRESSION ALGORITHM TO PREDICT RACE

In [None]:
# Transforming columns into dummy varaibles

dummies = pd.get_dummies(data[["armed", "gender", "city", "City", "state", "threat_level", "flee","signs_of_mental_illness", "body_camera"]], drop_first=True)
dummies = pd.concat([data, dummies], axis=1)

dummies.drop(data[["date", "armed", "gender", "city", "City", "state", "threat_level", "flee", "total_population","signs_of_mental_illness", "body_camera"]], axis=1, inplace=True)
dummies.dropna()
dummies.head()

In [None]:
X = dummies.drop("race", axis=1)
y = dummies["race"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)

In [None]:
predictions = logmodel.predict(X_test)
print (classification_report(y_test, predictions))

In [None]:
# Accuracy score
log_accuracy_score = accuracy_score(y_test, predictions)
log_accuracy_score

In [None]:
params = {"max_iter": [20,30,50],
         "C": [1.0, 2.0, 3.0]}

gs_logmodel = model_selection.GridSearchCV(estimator=logmodel,
                                 param_grid=params,
                                 cv=5,
                                 scoring="accuracy")

gs_logmodel.fit(X_train, y_train)

In [None]:
gs_logmodel.best_params_

In [None]:
# Accuracy score
gs_logmodel_accuracy_score = gs_logmodel.best_score_
gs_logmodel_accuracy_score

<br>
## KNN ALGORITHM

In [None]:
from sklearn.neighbors import KNeighborsClassifier

X = dummies.drop("race", axis=1)
y = dummies["race"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

knn = KNeighborsClassifier(n_neighbors=1) # k=1
knn.fit(X_train, y_train)

In [None]:
pred = knn.predict(X_test)
print(classification_report(y_test, pred))

In [None]:
knn_accuracy_score = accuracy_score(y_test, pred)
knn_accuracy_score

In [None]:
error_rate = []

for i in range(1,30): # Checking every possible k value from 1-30

    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test)) 

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,30), error_rate, color="grey", marker="o", markerfacecolor="red")
plt.title("Error rate vs K value", fontsize=17)
plt.xlabel("K")
plt.ylabel("Error rate")

k=7 gives the lowest error rate, so we try fitting the model again, using this information.

In [None]:
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

print(classification_report(y_test, pred))

In [None]:
# Accuracy score
knn_accuracy_score_iter = accuracy_score(y_test, pred)
knn_accuracy_score_iter

## Comparing accuracy scores

In [None]:
accuracy_pre = {"Random Forest": rf_accuracy_score, 
                "Logistic Regression": log_accuracy_score, 
                "KNN": knn_accuracy_score}

accuracy_post = {"Random Forest": gs_rf_accuracy_score, 
                 "Logistic Regression": gs_logmodel_accuracy_score, 
                 "KNN": knn_accuracy_score_iter}

X = np.arange(len(accuracy_pre))
ax = plt.subplot(111)
ax.bar(X, accuracy_pre.values(), width=0.2, color='b', align='center')
ax.bar(X-0.2, accuracy_post.values(), width=0.2, color='g', align='center')
ax.legend(('Before grid search','After grid search'))
plt.xticks(X, accuracy_pre.keys())
plt.title("Accuracy score", fontsize=17)
plt.show()

Logistic Regression and the Random Forest algorithms yield the highest accuracy score both before and after running grid search. The KNN algorithm performs better after grid search, whereas Logistic Regression and Random Forest don't.
KNN doesn't do much better than random choice, meaning there is a risk that there simply is no connection between features and the target class.

<br>
## FINDINGS - SUMMARY

* Blacks are 3 times more likely to become victims of police shootings than Whites.
* The average age of Black and Hispanic victims is lower (31 and 33 respectively) than that of White victims (40).
* California is the state with the most fatal police shootings, and Los Angeles is the most dangerous city.
* The most common way of being armed is by gun.

**Critical afterthought**

The data has some obvious shortcomings. For instance, it only goes 2.5 years back in time. It would be interesting to look at data from before this period as well, but as previously mentioned, such data is hard to find. Furthermore, this data doesn't track death from other means than by shooting (such as death in police custody and other means of death).
Judging by the accuracy score of the three algorithms, the features don't do very well in explaining the target class.