## Module 3 Final Project Submission

Please fill out:
* Student name: Rachel Beery
* Student pace: Full Time
* Scheduled project review date/time: 
* Instructor name: Rafael
* Blog post URL:


Background: In Terry v. Ohio, a landmark Supreme Court case in 1967-8, the court found that a police officer was not in violation of the "unreasonable search and seizure" clause of the Fourth Amendment, even though he stopped and frisked a couple of suspects only because their behavior was suspicious.

Thus was born the notion of "reasonable suspicion", according to which an agent of the police may e.g. temporarily detain a person, even in the absence of clearer evidence that would be required for full-blown arrests etc. Terry Stops are stops made of suspicious drivers.

Objective: I will be building a classifier to predict whether an arrest was made after a Terry Stop. 

Aprroach: The OSEUMiN data science workflow is utilized to effectively build a classifier to predict whether an arrest was made after a Terry Stop, given information about the presence of weapons, the time of day of the call, etc. Note that this is a binary classification problem. We will also be analyzing whether race (of officer or of subject) plays a role in whether or not an arrest is made.

Data: In my project I will be utilizing the Terry Traffic Stops dataset that was provided by Flatiron School.

-


# Data Frame Column Descriptions

Subject Age Group: Subject Age Group (10 year increments) as reported by the officer. 

Subject ID: Key, generated daily, identifying unique subjects in the dataset using a character to character match of first name and last name. "Null" values indicate an "anonymous" or "unidentified" subject. Subjects of a Terry Stop are not required to present identification. 

GO / SC Num: General Offense or Street Check number, relating the Terry Stop to the parent report. This field may have a one to many relationship in the data. 

Terry Stop ID: Key identifying unique Terry Stop reports. 

Stop Resolution: Resolution of the stop as reported by the officer. 

Weapon Type: Type of weapon, if any, identified during a search or frisk of the subject. Indicates "None" if no weapons was found. 

Officer ID: Key identifying unique officers in the dataset. 

Officer YOB: Year of birth, as reported by the officer. 

Officer Gender: Gender of the officer, as reported by the officer. 

Officer Race: Race of the officer, as reported by the officer. 

Subject Perceived Race: Perceived race of the subject, as reported by the officer. 

Subject Perceived Gender: Perceived gender of the subject, as reported by the officer. 

Reported Date: Date the report was filed in the Records Management System (RMS). Not necessarily the date the stop occurred but generally within 1 day. 

Reported Time: Time the stop was reported in the Records Management System (RMS). Not the time the stop occurred but generally within 10 hours. 

Initial Call Type: Initial classification of the call as assigned by 911. 

Final Call Type: Final classification of the call as assigned by the primary officer closing the event. 

Call Type: How the call was received by the communication center. 

Officer Squad: Functional squad assignment (not budget) of the officer as reported by the Data Analytics Platform (DAP). 

Arrest Flag: Indicator of whether a "physical arrest" was made, of the subject, during the Terry Stop. Does not necessarily reflect a report of an arrest in the Records Management System (RMS). 

Frisk Flag: Indicator of whether a "frisk" was conducted, by the officer, of the subject, during the Terry Stop. 

Precinct: Precinct of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred. 

Sector: Sector of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred. 

Beat: Beat of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.

### Current questions I have for study group?
-What should I be using to fill my nulls for race, etc.?
-

In [3]:
# We will begin by importing all of the packages we anticipate to use
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as mtick
import plotly.express as px
import math
import scipy.stats as stats
import missingno as ms

from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from scipy import stats
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import VarianceThreshold, SelectFromModel
linreg = LinearRegression()

In [8]:
# Setting the display defaults
pd.set_option('display.max_columns', 0)
# pd.set_option('display.max_rows',)

# Turning off scientific notation in pandas
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [9]:
#Import data and see the headers
df = pd.read_csv(r"Terry_Stops.csv")
df.head()

Unnamed: 0,Subject Age Group,Subject ID,GO / SC Num,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,-,-1,20140000120677,92317,Arrest,,7500,1984,M,Black or African American,Asian,Male,2015-10-16T00:00:00,11:32:00,-,-,-,SOUTH PCT 1ST W - ROBERT,N,N,South,O,O2
1,-,-1,20150000001463,28806,Field Contact,,5670,1965,M,White,-,-,2015-03-19T00:00:00,07:59:00,-,-,-,,N,N,-,-,-
2,-,-1,20150000001516,29599,Field Contact,,4844,1961,M,White,White,Male,2015-03-21T00:00:00,19:12:00,-,-,-,,N,-,-,-,-
3,-,-1,20150000001670,32260,Field Contact,,7539,1963,M,White,-,-,2015-04-01T00:00:00,04:55:00,-,-,-,,N,N,-,-,-
4,-,-1,20150000001739,33155,Field Contact,,6973,1977,M,White,Black or African American,Male,2015-04-03T00:00:00,00:41:00,-,-,-,,N,N,-,-,-


In [10]:
# How big is this dataset?
df.shape

(44838, 23)

In [11]:
# Looking at our columns and seeing what data types they are
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44838 entries, 0 to 44837
Data columns (total 23 columns):
Subject Age Group           44838 non-null object
Subject ID                  44838 non-null int64
GO / SC Num                 44838 non-null int64
Terry Stop ID               44838 non-null int64
Stop Resolution             44838 non-null object
Weapon Type                 44838 non-null object
Officer ID                  44838 non-null object
Officer YOB                 44838 non-null int64
Officer Gender              44838 non-null object
Officer Race                44838 non-null object
Subject Perceived Race      44838 non-null object
Subject Perceived Gender    44838 non-null object
Reported Date               44838 non-null object
Reported Time               44838 non-null object
Initial Call Type           44838 non-null object
Final Call Type             44838 non-null object
Call Type                   44838 non-null object
Officer Squad               44258 non-null ob

# Scrub

In [42]:
df['Subject ID'].value_counts()
# May not end up dropping

KeyError: 'Subject ID'

In [17]:
# I'm going to go ahead and start by dropping the columns we will not be using
df = df.drop(columns=['Subject ID', 'GO / SC Num', 'Terry Stop ID', 'Officer ID'], axis=1)

In [18]:
df.head(5)

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,-,Arrest,,1984,M,Black or African American,Asian,Male,2015-10-16T00:00:00,11:32:00,-,-,-,SOUTH PCT 1ST W - ROBERT,N,N,South,O,O2
1,-,Field Contact,,1965,M,White,-,-,2015-03-19T00:00:00,07:59:00,-,-,-,,N,N,-,-,-
2,-,Field Contact,,1961,M,White,White,Male,2015-03-21T00:00:00,19:12:00,-,-,-,,N,-,-,-,-
3,-,Field Contact,,1963,M,White,-,-,2015-04-01T00:00:00,04:55:00,-,-,-,,N,N,-,-,-
4,-,Field Contact,,1977,M,White,Black or African American,Male,2015-04-03T00:00:00,00:41:00,-,-,-,,N,N,-,-,-


In [12]:
df['Subject Age Group'].value_counts()

26 - 35         14905
36 - 45          9460
18 - 25          9069
46 - 55          5768
56 and Above     2283
1 - 17           1935
-                1418
Name: Subject Age Group, dtype: int64

In [19]:
df['Stop Resolution'].value_counts()

Field Contact               17968
Offense Report              15124
Arrest                      10843
Referred for Prosecution      728
Citation / Infraction         175
Name: Stop Resolution, dtype: int64

In [20]:
df['Weapon Type'].value_counts()

None                                 32565
-                                     9671
Lethal Cutting Instrument             1482
Knife/Cutting/Stabbing Instrument      496
Handgun                                281
Firearm Other                          100
Blunt Object/Striking Implement         66
Club, Blackjack, Brass Knuckles         49
Firearm                                 34
Mace/Pepper Spray                       20
Other Firearm                           18
Firearm (unk type)                      15
Club                                     9
Taser/Stun Gun                           7
None/Not Applicable                      7
Rifle                                    7
Fire/Incendiary Device                   4
Shotgun                                  3
Automatic Handgun                        2
Brass Knuckles                           1
Blackjack                                1
Name: Weapon Type, dtype: int64

In [None]:
# Cleaning the nulls and missing values
df['Weapon Type'] = df['Weapon Type'].replace('-', 'None')
df['Weapon Type'] = df['Weapon Type'].replace('None/Not Applicable', 'None')

In [None]:
# Checking to make sure it worked
df['Weapon Type'].value_counts()

In [21]:
df['Officer Gender'].value_counts()

M    39709
F     5100
N       29
Name: Officer Gender, dtype: int64

In [22]:
df['Officer Race'].value_counts()

White                            34109
Hispanic or Latino                2547
Two or More Races                 2487
Asian                             1850
Black or African American         1793
Not Specified                     1244
Nat Hawaiian/Oth Pac Islander      437
American Indian/Alaska Native      311
Unknown                             60
Name: Officer Race, dtype: int64

In [23]:
df['Subject Perceived Race'].value_counts()

White                                        21917
Black or African American                    13359
Unknown                                       2381
-                                             1762
Hispanic                                      1684
Asian                                         1429
American Indian or Alaska Native              1303
Multi-Racial                                   809
Other                                          152
Native Hawaiian or Other Pacific Islander       42
Name: Subject Perceived Race, dtype: int64

In [28]:
df['Initial Call Type'].value_counts()

-                                                 13038
SUSPICIOUS STOP - OFFICER INITIATED ONVIEW         2920
SUSPICIOUS PERSON, VEHICLE OR INCIDENT             2814
DISTURBANCE, MISCELLANEOUS/OTHER                   2302
ASLT - IP/JO - WITH OR W/O WPNS (NO SHOOTINGS)     1890
                                                  ...  
-ASSIGNED DUTY - STAKEOUT                             1
WARRANT PICKUP - FROM OTHER AGENCY                    1
VICE - PORNOGRAPHY                                    1
INJURED -  PERSON/INDUSTRIAL ACCIDENT                 1
MISSING - (ALZHEIMER, ENDANGERED, ELDERLY)            1
Name: Initial Call Type, Length: 165, dtype: int64

In [30]:
df['Final Call Type'].value_counts()

-                                             13038
--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON       3481
--PROWLER - TRESPASS                           3126
--DISTURBANCE - OTHER                          2549
--ASSAULTS, OTHER                              2178
                                              ...  
DOWN - CHECK FOR DOWN PERSON                      1
--CROWD MGMNT (STAND BY ONLY)                     1
THEFT OF SERVICES                                 1
FIGHT - VERBAL/ORAL (NO WEAPONS)                  1
BIAS -RACIAL, POLITICAL, SEXUAL MOTIVATION        1
Name: Final Call Type, Length: 204, dtype: int64

In [29]:
df['Call Type'].value_counts()

911                              19891
-                                13038
ONVIEW                            8493
TELEPHONE OTHER, NOT 911          3117
ALARM CALL (NOT POLICE ALARM)      292
TEXT MESSAGE                         6
SCHEDULED EVENT (RECURRING)          1
Name: Call Type, dtype: int64

In [31]:
df['Officer Squad'].value_counts()

TRAINING - FIELD TRAINING SQUAD        4756
WEST PCT 1ST W - DAVID/MARY            1470
WEST PCT 2ND W - D/M RELIEF             972
SOUTHWEST PCT 2ND W - FRANK             895
NORTH PCT 2ND WATCH - NORTH BEATS       885
                                       ... 
ZOLD CRIME ANALYSIS UNIT - ANALYSTS       1
VICE - GENERAL INVESTIGATIONS SQUAD       1
RECORDS - DAY SHIFT                       1
DV SQUAD D - ORDER SERVICE                1
BURG/THEFT/JUV - NORTH                    1
Name: Officer Squad, Length: 168, dtype: int64

In [32]:
df['Arrest Flag'].value_counts()

N    42203
Y     2635
Name: Arrest Flag, dtype: int64

In [34]:
df['Frisk Flag'].value_counts()

N    34388
Y     9972
-      478
Name: Frisk Flag, dtype: int64

In [35]:
df['Precinct'].value_counts()

West         10578
North         9867
-             9732
East          5931
South         5349
Southwest     2320
SouthWest      816
Unknown        200
OOJ             30
FK ERROR        15
Name: Precinct, dtype: int64

In [36]:
df['Sector'].value_counts()

-         9930
E         2337
M         2270
N         2191
K         1762
B         1658
L         1639
D         1512
K         1470
R         1455
F         1378
S         1348
U         1302
O         1161
J         1119
G         1087
C         1037
M         1027
Q          967
D          957
W          941
E          801
Q          609
N          574
O          497
F          486
R          473
S          416
B          388
G          370
U          367
J          337
W          330
C          298
L          291
99          53
Name: Sector, dtype: int64

In [37]:
df['Beat'].value_counts()

-         9877
N3        1175
E2        1092
M2         852
K3         816
          ... 
C2          63
99          53
99          27
OOJ         20
S            2
Name: Beat, Length: 107, dtype: int64

In [None]:
# Now that we better understand the data we can start cleaning specific columns

In [39]:
# Our target column will be stop resolution
# As this is a binary classification we need to change this variable to binary
df['Stop Resolution'] = df['Stop Resolution'].replace('Arrest', 1.0)
df['Stop Resolution'] = df['Stop Resolution'].map(lambda x: 0.0 if (x != 1.0) else 1.0)

In [40]:
# Let's make sure it worked
df['Stop Resolution'].value_counts()

0.000    33995
1.000    10843
Name: Stop Resolution, dtype: int64