# Terry Stops
## Overview
There has been tension between law enforcement and the public due to the `reasonable suspicion` notion. Through this data, I'll analyze the data to predict whether an arrest was made after a `Terry Stop`.

## Objective
Build a model that will predict, given the data, whether an arrest was made after a Terry stop.

## Methodology
1. `Data Exploration` : Examine the data structure and content, identify relevant variables and understand their meaning and distribution.

2. `Analysis` : Generate descriptive statistics and visualizations to gain insight into the patterns of Terry stops.

3. `Predictive Modeling` : Create models that will predict whether an arrest was made after a Terry stop.
 

## Data Understanding
The dataset contains various attributes related to the stops, including demographic information, stop location, stop reasoning, and outcomes.
his data represents records of police reported stops under Terry v. Ohio, 392 U.S. 1 (1968). Each row represents a unique stop.

- Each record contains perceived demographics of the subject, as reported by the officer making the stop and officer demographics as reported to the Seattle Police Department, for employment purposes.
### Data description

`Subject Age Group`: Subject Age Group (10 year increments) as reported by the officer.

`Subject ID`: Key, generated daily, identifying unique subjects in the dataset using a character to character match of first name and last name. "Null" values indicate an "anonymous" or "unidentified" subject. 

`GO / SC Num`: General Offense or Street Check number, relating the Terry Stop to the parent report. This field may have a one to many relationship in the data.

`Terry Stop ID`: Key identifying unique Terry Stop reports.

`Stop Resolution`: Resolution of the stop as reported by the officer.

`Weapon Type`: Type of weapon, if any, identified during a search or frisk of the subject. Indicates "None" if no weapons was found.

`Officer ID`: Key identifying unique officers in the dataset.

`Officer YOB`: Year of birth, as reported by the officer.

`Officer Gender`: Gender of the officer, as reported by the officer.

`Officer Race`: Race of the officer, as reported by the officer.

`Subject Perceived Race`: Perceived race of the subject, as reported by the officer.

`Subject Perceived Gender`: Perceived gender of the subject, as reported by the officer.

`Reported Date`: Date the report was filed in the Records Management System (RMS). Not necessarily the date the stop occurred but generally within 1 day.

`Reported Time`: Time the stop was reported in the Records Management System (RMS). Not the time the stop occurred but generally within 10 hours.

`Initial Call Type`: Initial classification of the call as assigned by 911.

`Final Call Type`: Final classification of the call as assigned by the primary officer closing the event.

`Call Type`: How the call was received by the communication center.

`Officer Squad`: Functional squad assignment (not budget) of the officer as reported by the Data Analytics Platform (DAP).

`Arrest Flag`: Indicator of whether a "physical arrest" was made, of the subject, during the Terry Stop. Does not necessarily reflect a report of an arrest in the Records Management System (RMS).

`Frisk Flag`: Indicator of whether a "frisk" was conducted, by the officer, of the subject, during the Terry Stop.

`Precinct`: Precinct of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.

`Sector`: Sector of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.

`Beat`: Beat of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.


In [179]:
# relevant imports

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
plt.style.use('seaborn')
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

# sklearn imports
# Model Selection and Preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

# Metrics
from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_curve, auc, confusion_matrix, classification_report


# Classifiers
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression


In [180]:
# creating a dataframe
df = pd.read_csv('/content/drive/MyDrive/project_3/Terry_Stops/Terry_Stops.csv')
df.head()

Unnamed: 0,Subject Age Group,Subject ID,GO / SC Num,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,...,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,36 - 45,7726342469,20200000112069,12803715000,Field Contact,-,6953,1968,M,White,...,19:28:18.0000000,"DISTURBANCE, MISCELLANEOUS/OTHER",--DISTURBANCE - OTHER,911,,N,N,North,N,N2
1,46 - 55,17544297314,20210000007572,19456101086,Field Contact,-,6678,1970,M,White,...,06:01:35.0000000,"SUSPICIOUS PERSON, VEHICLE OR INCIDENT",--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,911,,N,N,North,U,U2
2,26 - 35,-1,20150000005079,88327,Field Contact,,6382,1958,M,Nat Hawaiian/Oth Pac Islander,...,16:14:00.0000000,-,-,-,NORTH PCT 2ND W - JOHN - PLATOON 1,N,N,-,-,-
3,-,31307974123,20220000015393,31308022368,Field Contact,-,6799,1976,M,Hispanic or Latino,...,13:34:08.0000000,SUSPICIOUS STOP - OFFICER INITIATED ONVIEW,--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,ONVIEW,,N,N,West,M,M3
4,26 - 35,7727242683,20190000195849,8258954520,Field Contact,-,6953,1968,M,White,...,16:06:48.0000000,"SUSPICIOUS PERSON, VEHICLE OR INCIDENT",--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON,ONVIEW,,N,N,North,N,N3


In [181]:
# function to show column details
def info(dataframe):
    info_df = pd.DataFrame(columns=['Column', 'Missing Percentage', 'Missing Values', 'Length', 'Data type'])
    for col in dataframe.columns:
        percent_missing = round((dataframe[col].isna().sum() / len(dataframe[col])) * 100, 2)
        missing = dataframe[col].isna().sum()
        data_type = dataframe[col].dtype
        total_len = len(dataframe[col])
        info_df = info_df.append({'Column': col, 'Missing Percentage': percent_missing, 'Length':total_len,
                                  'Missing Values': missing, 'Data type': data_type}, ignore_index=True)
    info_df = info_df.sort_values(by='Missing Percentage', ascending=False)
    return info_df

In [182]:
# checking data details
info(df)

Unnamed: 0,Column,Missing Percentage,Missing Values,Length,Data type
17,Officer Squad,38.8,930838,2399030,object
0,Subject Age Group,0.0,0,2399030,object
12,Reported Date,0.0,0,2399030,object
21,Sector,0.0,0,2399030,object
20,Precinct,0.0,0,2399030,object
19,Frisk Flag,0.0,0,2399030,object
18,Arrest Flag,0.0,0,2399030,object
16,Call Type,0.0,0,2399030,object
15,Final Call Type,0.0,0,2399030,object
14,Initial Call Type,0.0,0,2399030,object


- `Officer Squad` has about 38.8 percent of missing data which translates to 930838 rows. From the total number of rows available and what the data in this squad signifies, I'll drop just the missing rows since the remaining data is sufficient.


In [183]:
# dropping null values
df.dropna(inplace=True)

In [184]:
# investigating the Beat column
df.Beat.value_counts()

-     383904
N3     50482
E2     45107
K3     40807
M2     36593
M3     33927
N2     30358
E1     30014
R2     28810
B1     27606
U2     27262
M1     26918
F2     26574
K2     25843
B2     25069
D1     24725
L1     24467
L3     23005
L2     23005
S2     22747
D2     22747
E3     21414
O1     21070
S3     20812
K1     19135
Q3     19135
J1     19049
B3     18490
F3     17888
U1     17587
G2     17544
R1     17458
D3     17329
W2     16426
R3     16297
J3     16168
G3     16125
C3     15394
C1     15351
O2     14663
S1     14190
F1     14147
O3     14104
W1     14018
Q2     13416
N1     13244
J2     12857
C2     11739
G1     11180
U3     11094
W3      9804
Q1      8729
99      2279
S         86
Name: Beat, dtype: int64

It has `-` as the most occuring value, I'll change that to `unknown`

In [185]:
# function to replace a value
def replace(dataframe, column, og_value, rep_value):
  '''
  dataframe - dataframe name
  column - column name
  og_value - value to be replaced
  rep_value - replacing value '''

  dataframe[column].replace(to_replace=og_value, value=rep_value, inplace=True)

  return dataframe[column].value_counts()

In [186]:
# replacing values in column Beat
replace(df, 'Beat', '-', 'Unknown')

Unknown    383904
N3          50482
E2          45107
K3          40807
M2          36593
M3          33927
N2          30358
E1          30014
R2          28810
B1          27606
U2          27262
M1          26918
F2          26574
K2          25843
B2          25069
D1          24725
L1          24467
L3          23005
L2          23005
S2          22747
D2          22747
E3          21414
O1          21070
S3          20812
K1          19135
Q3          19135
J1          19049
B3          18490
F3          17888
U1          17587
G2          17544
R1          17458
D3          17329
W2          16426
R3          16297
J3          16168
G3          16125
C3          15394
C1          15351
O2          14663
S1          14190
F1          14147
O3          14104
W1          14018
Q2          13416
N1          13244
J2          12857
C2          11739
G1          11180
U3          11094
W3           9804
Q1           8729
99           2279
S              86
Name: Beat, dtype: int64

# Feature Generation

Some of the columns contain data that is not useful as it is. I'll use various techniques to create new features and drop the original columns.

In [187]:
#The column for officer year of birth will be dropped and instead will use age
# I'll use the reporting date to calculate the age at the time of reporting
# Convert strings to datetime objects

df['Reported Date'] = pd.to_datetime(df['Reported Date'])
df['Officer YOB'] = pd.to_numeric(df['Officer YOB'], errors='coerce')

df['age'] = df['Reported Date'].dt.year - df['Officer YOB']





In [188]:
# investigating age column
print(f"Youngest: {df['age'].min()}")
print(f"Oldest: {df['age'].max()}")

Youngest: 21
Oldest: 119


Given that in Seattle the minimum age to enroll at the time of this project is 20.5 years and the retirement age is 65, we will drop any that are below 21 and above 65

In [189]:
df = df[(df['age'] <= 65) & (df['age'] >= 21)]


In [190]:
# investigating age column
print(f"Youngest: {df['age'].min()}")
print(f"Oldest: {df['age'].max()}")

Youngest: 21
Oldest: 65


In the `weapon type` column, there are different types that can be grouped together.

In [191]:
df['Weapon Type'].value_counts()

None                               1383611
Lethal Cutting Instrument            62952
Handgun                               9804
Firearm Other                         4257
-                                     2623
Club, Blackjack, Brass Knuckles       2107
Firearm (unk type)                     645
Club                                   387
Rifle                                  172
Shotgun                                129
Automatic Handgun                       86
Blackjack                               43
Brass Knuckles                          43
Name: Weapon Type, dtype: int64

In [192]:
# replacing
replace(df, 'Weapon Type', '-', 'None')
replace(df, 'Weapon Type', ['Handgun', 'Firearm Other', 'Firearm (unk type)',
                            'Other Firearm', 'Rifle', 'Shotgun', 'Automatic Handgun'],'Firearm')
replace(df, 'Weapon Type', ['Club, Blackjack, Brass Knuckles', 'Blackjack', 'Brass Knuckles', 
                            'Club', 'Personal Weapons (hands, feet, etc.)'], 'Striking Object')

None                         1386234
Lethal Cutting Instrument      62952
Firearm                        15093
Striking Object                 2580
Name: Weapon Type, dtype: int64

In [193]:
# function for dropping columns
def drop(df, col):
  '''
  df - dataframe name
  col - column name'''

  df.drop(col, axis=1, inplace=True)
  return print(f'Number of columns after dropping: {df.shape[1]}')

In [194]:
print(f'Number of columns before dropping: {df.shape[1]}')

Number of columns before dropping: 24


In [195]:
cols = ['Subject ID', 'GO / SC Num', 'Terry Stop ID', 'Officer YOB', 'Initial Call Type', 'Arrest Flag', 'Reported Date']
drop(df, cols)

Number of columns after dropping: 17


In [196]:
df.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer ID,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Time,Final Call Type,Call Type,Officer Squad,Frisk Flag,Precinct,Sector,Beat,age
2,26 - 35,Field Contact,,6382,M,Nat Hawaiian/Oth Pac Islander,White,Male,16:14:00.0000000,-,-,NORTH PCT 2ND W - JOHN - PLATOON 1,N,-,-,Unknown,57
6,26 - 35,Field Contact,,7575,M,White,White,Male,16:56:00.0000000,-,-,WEST PCT 2ND W - D/M RELIEF,N,-,-,Unknown,33
7,56 and Above,Offense Report,,7700,M,White,White,Male,17:38:00.0000000,--DV - ASSIST VICTIM BY COURT ORDER,911,SOUTH PCT 2ND W - ROBERT - PLATOON 2,N,South,R,R2,27
9,26 - 35,Offense Report,,8303,M,Hispanic or Latino,White,Male,00:52:00.0000000,"--ASSAULTS, OTHER",ONVIEW,EAST PCT 2ND W - CHARLIE RELIEF,N,East,E,E2,28
10,18 - 25,Offense Report,Lethal Cutting Instrument,8450,M,Hispanic or Latino,White,Male,02:34:00.0000000,--DISTURBANCE - OTHER,911,EAST PCT 3RD W - E/G RELIEF,Y,East,G,G2,26


In [197]:
df['Final Call Type'].value_counts()

-                                           532598
--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON     88064
--PROWLER - TRESPASS                         83549
--DISTURBANCE - OTHER                        76497
--ASSAULTS, OTHER                            68198
                                             ...  
NARCOTICS - FOUND                               43
ASLT - DV                                       43
-OFF DUTY EMPLOYMENT                            43
SHOTS -DELAY/INCLUDES HEARD/NO ASSAULT          43
FIGHT - JO - PHYSICAL (NO WEAPONS)              43
Name: Final Call Type, Length: 160, dtype: int64

In [198]:
replace(df, 'Final Call Type', '-', 'Unknown')

Unknown                                     532598
--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON     88064
--PROWLER - TRESPASS                         83549
--DISTURBANCE - OTHER                        76497
--ASSAULTS, OTHER                            68198
                                             ...  
NARCOTICS - FOUND                               43
ASLT - DV                                       43
-OFF DUTY EMPLOYMENT                            43
SHOTS -DELAY/INCLUDES HEARD/NO ASSAULT          43
FIGHT - JO - PHYSICAL (NO WEAPONS)              43
Name: Final Call Type, Length: 160, dtype: int64

In [199]:
replace(df, 'Precinct', '-', 'Unknown')

Unknown      386011
North        338969
West         289218
East         183868
South        169979
Southwest     98814
Name: Precinct, dtype: int64

In [200]:
replace(df, 'Sector', '-', 'Unknown')

Unknown    383732
M           97438
E           96535
N           93955
K           85699
B           71036
L           70176
D           64801
R           62522
F           58566
S           57749
U           55857
O           49708
J           47945
G           44849
C           42484
Q           41280
W           40248
99           2279
Name: Sector, dtype: int64

In [201]:
replace(df, 'Call Type', '-', 'Unknown')

911                              602817
Unknown                          532598
ONVIEW                           229749
TELEPHONE OTHER, NOT 911          94729
ALARM CALL (NOT POLICE ALARM)      6880
TEXT MESSAGE                         43
SCHEDULED EVENT (RECURRING)          43
Name: Call Type, dtype: int64

In [202]:
df['Reported Time'] = df['Reported Time'].str[:5]

In [203]:
df.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer ID,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Time,Final Call Type,Call Type,Officer Squad,Frisk Flag,Precinct,Sector,Beat,age
2,26 - 35,Field Contact,,6382,M,Nat Hawaiian/Oth Pac Islander,White,Male,16:14,Unknown,Unknown,NORTH PCT 2ND W - JOHN - PLATOON 1,N,Unknown,Unknown,Unknown,57
6,26 - 35,Field Contact,,7575,M,White,White,Male,16:56,Unknown,Unknown,WEST PCT 2ND W - D/M RELIEF,N,Unknown,Unknown,Unknown,33
7,56 and Above,Offense Report,,7700,M,White,White,Male,17:38,--DV - ASSIST VICTIM BY COURT ORDER,911,SOUTH PCT 2ND W - ROBERT - PLATOON 2,N,South,R,R2,27
9,26 - 35,Offense Report,,8303,M,Hispanic or Latino,White,Male,00:52,"--ASSAULTS, OTHER",ONVIEW,EAST PCT 2ND W - CHARLIE RELIEF,N,East,E,E2,28
10,18 - 25,Offense Report,Lethal Cutting Instrument,8450,M,Hispanic or Latino,White,Male,02:34,--DISTURBANCE - OTHER,911,EAST PCT 3RD W - E/G RELIEF,Y,East,G,G2,26


In [204]:
df['Stop Resolution'].unique()

array(['Field Contact', 'Offense Report', 'Arrest',
       'Citation / Infraction', 'Referred for Prosecution'], dtype=object)

The target column can then be engineered to contain binary according to the outcome of the stop.
From the above cell, there are 5 different outcomes, I'll combine them according to the severity of the outcome

In [205]:
replace(df, 'Stop Resolution', ['Field Contact', 'Citation / Infraction'] , 'No Arrest')

replace(df, 'Stop Resolution', ['Offense Report', 'Arrest', 'Referred for Prosecution'], 'Arrest')



1    931165
0    535694
Name: Stop Resolution, dtype: int64

In [211]:
info(df)

Unnamed: 0,Column,Missing Percentage,Missing Values,Length,Data type
0,Subject Age Group,0.0,0,1466859,object
9,Final Call Type,0.0,0,1466859,object
15,Beat,0.0,0,1466859,object
14,Sector,0.0,0,1466859,object
13,Precinct,0.0,0,1466859,object
12,Frisk Flag,0.0,0,1466859,object
11,Officer Squad,0.0,0,1466859,object
10,Call Type,0.0,0,1466859,object
8,Reported Time,0.0,0,1466859,object
1,Stop Resolution,0.0,0,1466859,int64


In [210]:
# function to round off time
def round_time(col):
    hours, minutes = col.split(':')

    hours = int(hours)
    minutes = int(minutes)

    if minutes > 30:
        hours += 1
        minutes = 0

    rounded_col = f'{hours:02d}:00'

    return rounded_col


In [208]:
# Removing minutes in time by rounding to the next hour
df['Reported Time'] = df['Reported Time'].apply(round_time)


In [209]:
df.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer ID,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Time,Final Call Type,Call Type,Officer Squad,Frisk Flag,Precinct,Sector,Beat,age
2,26 - 35,0,,6382,M,Nat Hawaiian/Oth Pac Islander,White,Male,16:00,Unknown,Unknown,NORTH PCT 2ND W - JOHN - PLATOON 1,N,Unknown,Unknown,Unknown,57
6,26 - 35,0,,7575,M,White,White,Male,17:00,Unknown,Unknown,WEST PCT 2ND W - D/M RELIEF,N,Unknown,Unknown,Unknown,33
7,56 and Above,1,,7700,M,White,White,Male,18:00,--DV - ASSIST VICTIM BY COURT ORDER,911,SOUTH PCT 2ND W - ROBERT - PLATOON 2,N,South,R,R2,27
9,26 - 35,1,,8303,M,Hispanic or Latino,White,Male,01:00,"--ASSAULTS, OTHER",ONVIEW,EAST PCT 2ND W - CHARLIE RELIEF,N,East,E,E2,28
10,18 - 25,1,Lethal Cutting Instrument,8450,M,Hispanic or Latino,White,Male,03:00,--DISTURBANCE - OTHER,911,EAST PCT 3RD W - E/G RELIEF,Y,East,G,G2,26
