![Abolish the Police](images/police-cars-revolving-light.png)

# Terry Stop Analysis Project

**Author:** Sierra Stanton
***

## Overview

This project analyses records of police reported stops as documented by the Seatlle Police Department for the Supreme Court case Terry v. Ohio.

In Terry v. Ohio (Links to an external site.), a landmark Supreme Court case in 1967-8, the court found that a police officer was not in violation of the "unreasonable search and seizure" clause of the Fourth Amendment, even though he stopped and frisked a couple of suspects only because their behavior was suspicious. Thus was born the notion of "reasonable suspicion", according to which an agent of the police may e.g. temporarily detain a person, even in the absence of clearer evidence that would be required for full-blown arrests etc. Terry Stops are stops made of suspicious drivers.

## Problem

Data.gov has released a public dataset representing Terry Stops in Seattle, Washington, and the various factors that might influence both the original stop and the outcome of said stop.

We'll build a classifier to help predict whether an arrest was made after a Terry Stop, given various factors like the presence of weapons, the subject's race and gender, and more.

This informative data can not only help us predict whether an arrest would be made based on certain perceived factors, but we'll be able to better evaluate the practice altogether and better understand how perception plays a role in police practices.

## Data Understanding

Data will be used from the following source:
* __[Data.gov](https://catalog.data.gov/dataset/terry-stops)__

Data.gov's Terry Stops (`Terry_Stops.csv`): this dataset represents records of police reported stops under Terry v. Ohio, 392 U.S. 1 (1968). Each row represents a unique stop and contains perceived demographics of the subject, as reported by the officer making the stop and officer demographics as reported to the Seattle Police Department.

We'll import packages from a variety of sources to aid in our exploration and modeling of our data.

In [1]:
# import necessary packages

import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsClassifier

## Import: Terry Stop Data

Let's ensure we import the necessary data set and begin an exploration of our records.

In [2]:
# import Terry_Stops.csv from our data folder

df_ts = pd.read_csv('data/Terry_Stops.csv')
df_ts.head()

Unnamed: 0,Subject Age Group,Subject ID,GO / SC Num,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,...,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,-,-1,20140000120677,92317,Arrest,,7500,1984,M,Black or African American,...,11:32:00,-,-,-,SOUTH PCT 1ST W - ROBERT,N,N,South,O,O2
1,-,-1,20150000001463,28806,Field Contact,,5670,1965,M,White,...,07:59:00,-,-,-,,N,N,-,-,-
2,-,-1,20150000001516,29599,Field Contact,,4844,1961,M,White,...,19:12:00,-,-,-,,N,-,-,-,-
3,-,-1,20150000001670,32260,Field Contact,,7539,1963,M,White,...,04:55:00,-,-,-,,N,N,-,-,-
4,-,-1,20150000001739,33155,Field Contact,,6973,1977,M,White,...,00:41:00,-,-,-,,N,N,-,-,-


In [3]:
df_ts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47297 entries, 0 to 47296
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Subject Age Group         47297 non-null  object
 1   Subject ID                47297 non-null  int64 
 2   GO / SC Num               47297 non-null  int64 
 3   Terry Stop ID             47297 non-null  int64 
 4   Stop Resolution           47297 non-null  object
 5   Weapon Type               47297 non-null  object
 6   Officer ID                47297 non-null  object
 7   Officer YOB               47297 non-null  int64 
 8   Officer Gender            47297 non-null  object
 9   Officer Race              47297 non-null  object
 10  Subject Perceived Race    47297 non-null  object
 11  Subject Perceived Gender  47297 non-null  object
 12  Reported Date             47297 non-null  object
 13  Reported Time             47297 non-null  object
 14  Initial Call Type     

## Data Preparation

From our initial exploration above, we can see that there are many relevant factors to analyze why a terry stop was made, the nature of said stop, and the outcome. Please consult the __[data dictionary](https://data.seattle.gov/Public-Safety/Terry-Stops/28ny-9ts8)__ for additional information on individual columns.

### Clean Up: Terry Stop Data

1. Remove any null values that can skew our analysis
2. Investigate columns in order to isolate the most relevant
3. Identify and make our target variable binary

### Remove null values

In [4]:
# identify columns with null values

df_ts.isna().sum()

Subject Age Group             0
Subject ID                    0
GO / SC Num                   0
Terry Stop ID                 0
Stop Resolution               0
Weapon Type                   0
Officer ID                    0
Officer YOB                   0
Officer Gender                0
Officer Race                  0
Subject Perceived Race        0
Subject Perceived Gender      0
Reported Date                 0
Reported Time                 0
Initial Call Type             0
Final Call Type               0
Call Type                     0
Officer Squad               604
Arrest Flag                   0
Frisk Flag                    0
Precinct                      0
Sector                        0
Beat                          0
dtype: int64

In [5]:
# investigate column with null values to ensure it's not especially relevant to our analysis

df_ts['Officer Squad'].value_counts()

TRAINING - FIELD TRAINING SQUAD      5114
WEST PCT 1ST W - DAVID/MARY          1551
WEST PCT 2ND W - D/M RELIEF          1020
SOUTHWEST PCT 2ND W - FRANK           970
NORTH PCT 2ND WATCH - NORTH BEATS     885
                                     ... 
WEST PCT OPS - COMMERCIAL SEC           1
ROBBERY SQUAD B                         1
DV SQUAD D - ORDER SERVICE              1
TRAF - MOTORCYCLE UNIT - T2 SQUAD       1
CANINE - DAY SQUAD                      1
Name: Officer Squad, Length: 172, dtype: int64

While interesting, we can conclude this column isn't central to our analysis predicting arrests and can be removed.

In [6]:
# delete column with null values

df_ts.drop(columns=["Officer Squad"], inplace=True)

### Simplify relevant columns and make our target variable binary

In [7]:
df_ts['Officer YOB'].describe()
df_ts['Weapon Type'].value_counts()
df_ts['Officer Gender'].value_counts()
df_ts['Subject Perceived Race'].value_counts()
df_ts['Subject Perceived Gender'].value_counts()
df_ts['Frisk Flag'].value_counts()
df_ts['Precinct'].value_counts()

West         11464
North        10403
-             9857
East          6223
South         5665
Southwest     2320
SouthWest     1111
Unknown        200
OOJ             33
FK ERROR        21
Name: Precinct, dtype: int64

In [8]:
df_ts['Officer YOB'].describe()

count    47297.000000
mean      1982.969766
std          9.083073
min       1900.000000
25%       1978.000000
50%       1985.000000
75%       1990.000000
max       1998.000000
Name: Officer YOB, dtype: float64

In [9]:
df_ts['Weapon Type'].value_counts()

None                                    32565
-                                       11935
Lethal Cutting Instrument                1482
Knife/Cutting/Stabbing Instrument         636
Handgun                                   291
Firearm Other                             100
Blunt Object/Striking Implement            86
Club, Blackjack, Brass Knuckles            49
Firearm                                    38
Mace/Pepper Spray                          28
Other Firearm                              23
Firearm (unk type)                         15
Taser/Stun Gun                             10
Club                                        9
Fire/Incendiary Device                      7
Rifle                                       7
None/Not Applicable                         7
Shotgun                                     3
Automatic Handgun                           2
Personal Weapons (hands, feet, etc.)        2
Brass Knuckles                              1
Blackjack                         

In [10]:
stop_racial_perception = df_ts['Subject Perceived Race'].value_counts('normalize=True')

srp = pd.DataFrame(stop_racial_perception)
srp.reset_index(inplace=True)
srp.columns = ['Perceived Race', 'Percent']

sns.barplot(data='srp', x='Percent', y='Perceived Race')


df = sns.load_dataset('titanic')
df.head()

x,y = 'class', 'survived'
relevant_cols = df_ts[['Arrest Flag','Weapon Type','Frisk Flag','Precinct','Officer Gender','Subject Perceived Gender','Subject Perceived Race']]

stop_race_df = df_ts['Subject Perceived Race']

stop_race = df_ts['Subject Perceived Race'].value_counts(normalize=True)
df1 = df1.mul(100)
df1 = df1.rename('percent').reset_index()

g = sns.catplot(x=x,y='percent',hue=y,kind='bar',data=df1)
g.ax.set_ylim(0,100)

for p in g.ax.patches:
    txt = str(p.get_height().round(2)) + '%'
    txt_x = p.get_x() 
    txt_y = p.get_height()
    g.ax.text(txt_x,txt_y,txt)

AttributeError: 'str' object has no attribute 'get'

In [None]:
df_ts['Stop Resolution'].value_counts()

In [None]:
total = 15657 + 11685 + 728 + 19048 + 179
less_impactful = 19048 + 179
percent_less_impactful = less_impactful / total
more_impactful = 15657 + 11685 + 728
percent_more_impactful = more_impactful / total
print(percent_less_impactful)
print(percent_more_impactful)

In [None]:
# visualize distinction of value counts among Stop Resolution values

f = plt.figure(figsize=(15, 6))
sns.set_style("dark")
sns.countplot(data=df_ts, x='Stop Resolution', order=df_ts['Stop Resolution'].value_counts().index, orient="v")
plt.title("Stop Resolution Outcomes Across Terry Stops");

After investigating our column, we can see that `Stop Resolution` is the best way to determine if a Terry Stop resulted in an outcome that could be considered significant or life-altering.

Of the disparate value counts, we concluded that `Field Contact` and `Citation / Infraction` represent 40.6% of results and could be considered less significant to a person's livelihood. In comparison, `Offense Report`, `Arrest`, and `Referred for Prosecution` represented 59.4% of data and each represent a life-altering outcome for the individual stopped.

Since our objective is to help predict whether an arrest was made after a Terry Stop, representing a binary classification problem, we'll want to ensure our target variable, `Stop Resolution`, is prepared for the algorithms we'll use to make this estimation.

In [None]:
df_ts["Stop Resolution"].value_counts().to_dict()

In [None]:
# map to column

df_ts['Stop Resolution'] = df_ts['Stop Resolution'].map({'Field Contact': 0,
                                                   'Offense Report': 1,
                                                   'Arrest': 1,
                                                   'Referred for Prosecution': 1,
                                                   'Citation / Infraction': 0})


Let's also ensure other columns we deem significant are clearly prepared and represented for modeling.

In [None]:
df_ts['Weapon Type'].value_counts()

In [None]:
df_ts["Weapon Type"].value_counts().to_dict()

In [None]:
df_ts['Weapon Type'] = df_ts['Weapon Type'].map({'None': 'NA',
                                           '-': 'NA',
                                           'Lethal Cutting Instrument': 'Non-Firearm',
                                           'Knife/Cutting/Stabbing Instrument': 'Non-Firearm',
                                           'Handgun': 'Firearm',
                                           'Firearm Other': 'Firearm',
                                           'Blunt Object/Striking Implement': 'Non-Firearm',
                                           'Club, Blackjack, Brass Knuckles': 'Non-Firearm',
                                           'Firearm': 'Firearm',
                                           'Mace/Pepper Spray': 'Non-Firearm',
                                           'Other Firearm': 'Firearm',
                                           'Firearm (unk type)': 'Firearm',
                                           'Taser/Stun Gun': 'Non-Firearm',
                                           'Club': 'Non-Firearm',
                                           'Fire/Incendiary Device': 'Non-Firearm',
                                           'None/Not Applicable': 'NA',
                                           'Rifle': 'Firearm',
                                           'Shotgun': 'Firearm',
                                           'Personal Weapons (hands, feet, etc.)': 'Non-Firearm',
                                           'Automatic Handgun': 'Firearm',
                                           'Blackjack': 'Non-Firearm',
                                           'Brass Knuckles': 'Non-Firearm'})

In [None]:
df_ts.drop(columns=['Subject Age Group','Subject ID','GO / SC Num','Terry Stop ID','Officer ID','Officer YOB','Officer Race',
                'Reported Date','Reported Time','Initial Call Type','Final Call Type','Arrest Flag','Call Type','Sector','Beat'], axis=1, inplace=True)

In [None]:
df_ts.info()

## Train Test Split

In [None]:
X = df_ts.loc[:, ['Weapon Type','Frisk Flag','Precinct','Officer Gender','Subject Perceived Gender',
                  'Subject Perceived Race']]
y = df_ts.loc[:, 'Stop Resolution'] #see Smote

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,random_state=42)

In [None]:
# use ohe on training data
ohe = OneHotEncoder()

ohe.fit(X_train)
X_train_ohe = ohe.transform(X_train).toarray()
X_test_ohe = ohe.transform(X_test).toarray()

# create dataframe with training and testing data
ohe_df1 = pd.DataFrame(X_train_ohe, columns=ohe.get_feature_names(X_train.columns))
ohe_df2 = pd.DataFrame(X_test_ohe, columns=ohe.get_feature_names(X_test.columns))
ohe_df = pd.concat([ohe_df1,ohe_df2])

## Create Logistic Regression Model

In [None]:
lr_one = LogisticRegression()
lr_one = lr_one.fit(X_train_ohe, y_train)

# preview model params
print(lr_one)

# predict
y_pred = lr_one.predict(X_test_ohe)

# evaluate model with a classification report
display(classification_report(y_test, y_pred))

In [None]:
# true test is confusion matrix due to 0 across, now that's changed once target did

## Iterate on Logistic Regression Model

In [None]:
lr_two = LogisticRegression(penalty='l1',C='.75')
lr_two = lr_two.fit(X_train_ohe, y_train)

# preview model params
print(lr_two)

# predict
y_pred = lr_two.predict(X_test_ohe)


# evaluate model with a classification report
display(classification_report(y_test, y_pred))

In [None]:
sorted(c_accuracy.items(), key=lambda kv: kv[1], reverse=True)[:10]

In [None]:
ss = StandardScaler()

X_train_sc = ss.fit_transform(X_train)
log_reg = LogisticRegression(C=2, solver='lbfgs', max_iter=5000)
log_reg.fit(X_train_sc, y_train)

y_hat = log_reg.predict(X_train_sc)

In [None]:
fig, ax = plt.subplots()
cm = confusion_matrix(y_train, y_hat)
sns.heatmap(cm, ax=ax, annot=True,  square=True, 
            cbar=False, cmap="coolwarm", fmt='g',
            xticklabels=['B', 'M'],yticklabels=['B', 'M'] )

ax.set_xlabel('Predicted', fontdict={'size': 15})
ax.set_ylabel('True', fontdict={'size': 15})
ax.set_title('Logistic Regression: C2', fontdict={'size': 15})

## Create Decision Tree Model

In [None]:
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(X_train_ohe, y_train)
# dt.score(X_train_ohe, y_train), default accuracy

In [None]:
tree.plot_tree(dt, feature_names=ohe.get_feature_names(), filled=True);

In [None]:
# predict, pass in for F1 score to compare across models
# bring confusion matrix in