## Seattle Terry Stops Final Project Submission

* Student name: Rebecca Mih
* Student pace: Part Time Online
* Scheduled project review date/time: 
* Instructor name: James Irving
* Blog post URL: 


* Data Source:  https://www.kaggle.com/city-of-seattle/seattle-terry-stops

* Date of last update to the datasource: April 15, 2020


* Key references:
* https://assets.documentcloud.org/documents/6136893/SPDs-2019-Annual-Report-on-Stops-and-Detentions.pdf

* https://www.seattletimes.com/seattle-news/crime/federal-monitor-finds-seattle-police-are-conducting-proper-stops-and-frisks/

<div>
<img src= "Terry Stops Sagepub.com.png"
           width=200"/>
</div


## Background

https://caselaw.findlaw.com/us-supreme-court/392/1.html

This data represents records of police reported stops under Terry v. Ohio, 392 U.S. 1 (1968). Each row represents a unique stop.

 A Terry stop is a seizure under both state and federal law. A Terry stop is
defined in policy as a brief, minimally intrusive seizure of a subject based upon
**articulable reasonable suspicion (ARS) in order to investigate possible criminal activity.**
The stop can apply to people as well as to vehicles. The subject of a Terry stop is
**not** free to leave.

Section 6.220 of the Seattle Police Department (SPD) Manual defines Reasonable Suspicion as:
Specific, objective, articulable facts which, taken together with rational inferences, would
create a  **well-founded suspicion that there is a substantial possibility that a subject has
engaged, is engaging or is about to engage in criminal conduct.**

- Each record contains perceived demographics of the subject, as reported by the officer making the stop and officer demographics as reported to the Seattle Police Department, for employment purposes.
- Where available, data elements from the associated Computer Aided Dispatch (CAD) event (e.g. Call Type, Initial Call Type, Final Call Type) are included.


## Notes on Concealed Weapons in the State of Washington

WHAT ARE WASHINGTON’S CONCEALED CARRY LAWS?
Open carry of a firearm is lawful without a permit in the state of Washington except, according to the law, “under circumstances, and at a time and place that either manifests an intent to intimidate another or that warrants alarm for the safety of other persons.”

However, open carry of a loaded handgun in a vehicle is legal only with a concealed pistol license. Open carry of a loaded long gun in a vehicle is illegal.

The criminal charge of “carrying a concealed firearm” happens in this state when someone carries a concealed firearm without a concealed pistol license. It does not matter if the weapon was discovered in the defendant’s home, vehicle, or on his or her person.

## Objectives
### Target:

   * Identify Terry Stops which lead to Arrest or Prosecution (Binary Classification)
    
### Features:
   * Location (Precinct)
   * Day of the Week (Date)
   * Shift (Time)
   * Initial Call Type
   * Final Call Type
   * Stop Resolution
   * Weapon type
   * Officer Squad
   * Age of officer
   * Age of detainee
    
    
### Optional Features:
   * Race of officer
   * Race of detainee
   * Gender of officer
   * Gender of detainee
    
   

## Definition of Features Provided

Column Names and descriptions provided in the SPD dataset  <br>
* **Subject Age Group**	
Subject Age Group (10 year increments) as reported by the officer. <br><br>

* **Subject ID**	
Key, generated daily, identifying unique subjects in the dataset using a character to character match of first name and last name. "Null" values indicate an "anonymous" or "unidentified" subject. Subjects of a Terry Stop are not required to present identification.  **Not Used** <br><br>

* **GO / SC Num**
General Offense or Street Check number, relating the Terry Stop to the parent report. This field may have a one to many relationship in the data. **Not Used** <br><br>

* **Terry Stop ID**
Key identifying unique Terry Stop reports.  **Not Used**
<br><br>

* **Stop Resolution**
Resolution of the stop**One hot encoding** <br><br>

* **Weapon Type**	
Type of weapon, if any, identified during a search or frisk of the subject. Indicates "None" if no weapons was found.  <br><br>

* **Officer ID**	
Key identifying unique officers in the dataset.
**Not Used** <br><br>

* **Officer YOB**	
Year of birth, as reported by the officer.  <br><br>

* **Officer Gender**	
Gender of the officer, as reported by the officer.
 <br><br>

* **Officer Race**	
Race of the officer, as reported by the officer. <br><br>

* **Subject Perceived Race**	
Perceived race of the subject, as reported by the officer. <br><br>

* **Subject Perceived Gender**	
Perceived gender of the subject, as reported by the officer. <br><br>

* **Reported Date**	
Date the report was filed in the Records Management System (RMS). Not necessarily the date the stop occurred but generally within 1 day.  <br><br>

* **Reported Time**	
Time the stop was reported in the Records Management System (RMS). Not the time the stop occurred but generally within 10 hours.  <br><br>

* **Initial Call Type**	
Initial classification of the call as assigned by 911.  <br><br>

* **Final Call Type**	
Final classification of the call as assigned by the primary officer closing the event.  <br><br>

* **Call Type**	
How the call was received by the communication center.

* **Officer Squad**	
Functional squad assignment (not budget) of the officer as reported by the Data Analytics Platform (DAP). <br><br>

* **Arrest Flag**	
Indicator of whether a "physical arrest" was made, of the subject, during the Terry Stop. Does not necessarily reflect a report of an arrest in the Records Management System (RMS). <br><br>

* **Frisk Flag**	
Indicator of whether a "frisk" was conducted, by the officer, of the subject, during the Terry Stop. <br><br>

* **Precinct**	
Precinct of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred. <br><br>

* **Sector**	
Sector of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred. <br><br>

* **Beat**	
Beat of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred. <br><br>

## Analysis Workflow (OSEMN)

1. **Obtain and Pre-process**
    - [x] Import data
    - [x] Remove unused columns
    - [x] Check data size, NaNs, and # of non-null values which are not valid data 
    - [x] Clean up missing values by imputing values or dropping
    - [x] Replace ? or other non-valid data by imputing values or dropping data
    - [x] Check for duplicates and remove if appropriate
    - [x] Change datatypes of columns as appropriate 
    - [x] Note which features are continuous and which are categorical<br><br>

2. **Data Scoping**
     - [x] Use value_counts() to identify dummy categories such as "-", or "?" for later re-mapping
     - [x] Identify most common word data
     - [x] Decide on which columns (features) to keep for further feature engineering
   
3. **Transformation of data (Feature Engineering)**
    - [x] Re-bin categories to reduce noise
    - [x] Re-map categories as needed
    - [x] Engineer text data to extract common word information
    - [x] Transform categoricals using 1-hot encoding or label encoding/
    - [x] Perform log transformations on continuous variables (if applicable)
    - [x] Normalize continuous variables
    - [x] Use re-sampling if needed to balance the dataset <br> <br>
    
4. **Further Feature Selection**
     - [x] Use .describe() and .hist() histograms
     - [x] Identify outliers (based on auto-scaling of plots) and remove or inpute as needed
     - [x] Perform visualizations on key features to understand  
     - [x] Inspect feature correlations (Pearson correlation) to identify co-linear features**<br><br>

5.  **Create a Vanilla Machine Learning Model**
    - [x] Split into train and test data 
    - [x] Run the model
    - [x] Review Quality indicators of the model <br><br>

6. **Run more advanced models**
    - [x] Compare the model quality
    - [x] Choose one or more models for grid searching <br><br>
    
7. **Revise data inputs if needed to improve quality indicators**
    - [x] By adding created features, and removing colinear features
    - [x] By improving unbalanced datasets through oversampling or undersampling
    - [x] by removing outliers through filters
    - [x] through use of subject matter knowledge <br><br>
    
8. **Write the Report**
    - [X] Explain key findings and recommended next steps



## 1. Obtain and Pre-Process the Data

1. **Obtain and Pre-process**
    - [x] Import data
    - [x] Remove unused columns
    - [x] Check data size, NaNs, and # of non-null values which are not valid data 
    - [x] Clean up missing values by imputing values or dropping
    - [x] Replace ? or other non-valid data by imputing values or dropping data
    - [x] Check for duplicates and remove if appropriate
    - [x] Change datatypes of columns as appropriate 
    - [x] Decide the target column, if not already decided
    - [x] Determine if some data is not relevent to the question (drop columns or rows)
    - [x] Note which features which will need to be re-mapped or encoded 
    - [x] Note which features might require feature engineering (example - date, time) <br><br>

In [1]:
#!pip install -U fsds_100719
from fsds_100719.imports import *
#import pandas as pd
#import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns
import copy
import sklearn
import math
import datetime
pd.options.display.float_format = '{:.1f}'.format
pd.set_option('display.max_columns',0)
pd.set_option('display.max_info_rows',200)
%matplotlib inline


fsds_1007219  v0.6.4 loaded.  Read the docs: https://fsds.readthedocs.io/en/latest/ 


Handle,Package,Description
dp,IPython.display,Display modules with helpful display and clearing commands.
fs,fsds_100719,Custom data science bootcamp student package
mpl,matplotlib,Matplotlib's base OOP module with formatting artists
plt,matplotlib.pyplot,Matplotlib's matlab-like plotting module
np,numpy,scientific computing with Python
pd,pandas,High performance data structures and tools
sns,seaborn,High-level data visualization library based on matplotlib


In [2]:
def feature_plot(tree, top_n=20,figsize=(10,10)):
    importance_df = pd.Series(tree.feature_importances_,index=X_train.columns)
    importance_df.sort_values(ascending=True).tail(top_n).plot(
        kind='barh',figsize=figsize)
    return importance_df

## Write a fucntion to evalute the model
import sklearn.metrics as metrics

def evaluate_model(y_true, y_pred,X_true,clf,cm_kws=dict(cmap="Greens",
                                  normalize='true'),figsize=(10,4),plot_roc_auc=True):
    
    ## Classification Report / Scores 
    print(metrics.classification_report(y_true,y_pred))

    if plot_roc_auc:
        num_cols=2
    else:
        num_cols=1
        
    fig, ax = plt.subplots(figsize=figsize,ncols=num_cols)
    
    if not isinstance(ax,np.ndarray):
        ax=[ax]
    metrics.plot_confusion_matrix(clf,X_true,y_true,ax=ax[0],**cm_kws)
    ax[0].set(title='Confusion Matrix')
    
    if plot_roc_auc:
        try:
            y_score = clf.predict_proba(X_true)[:,1]

            fpr,tpr,thresh = metrics.roc_curve(y_true,y_score)
            # print(f"ROC-area-under-the-curve= {}")
            roc_auc = round(metrics.auc(fpr,tpr),3)
            ax[1].plot(fpr,tpr,color='darkorange',label=f'ROC Curve (AUC={roc_auc})')
            ax[1].plot([0,1],[0,1],ls=':')
            ax[1].legend()
            ax[1].grid()
            ax[1].set(ylabel='True Positive Rate',xlabel='False Positive Rate',
                  title='Receiver operating characteristic (ROC) Curve')
            plt.tight_layout()
            plt.show()
        except:
            pass
    try: 
        df_important = plot_importance(clf)
    except:
        df_important = None
    
#     return df_important
## visualize the decision tree
def visualize_tree(tree,feature_names=None,class_names=['0','1'],
                   kws={},save_filename=None,format_='png',save_and_show=False):
    """Visualizes a sklearn tree using sklearn.tree.export_graphviz"""
    from sklearn.tree import export_graphviz
    from IPython.display import SVG
    import graphviz #import Source
    from IPython.display import display
    
    if feature_names is None:
        feature_names=X_train.columns

    tree_viz_kws =  dict(out_file=None,rounded=True, rotate=False, filled = True)
    tree_viz_kws.update(kws)

    # tree.export_graphviz(dt) #if you wish to save the output to a dot file instead
    tree_data=export_graphviz(tree,feature_names=feature_names, 
                                   class_names=class_names,**tree_viz_kws)
    graph = graphviz.Source(tree_data,format=format_)#'png')
    
    if save_filename is not None:
        graph.render(save_filename)
        if save_and_show:
            display(graph)
        else:
            print(f'[i] Tree saved as {save_filename}.{format_}')
    else:
        display(graph)


In [3]:
df = pd.read_csv('Terry_Stops.csv',low_memory=False)
df.duplicated().sum()


0

In [4]:
df.head()

Unnamed: 0,Subject Age Group,Subject ID,GO / SC Num,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,-,-1,20140000120677,92317,Arrest,,7500,1984,M,Black or African American,Asian,Male,2015-10-16T00:00:00,11:32:00,-,-,-,SOUTH PCT 1ST W - ROBERT,N,N,South,O,O2
1,-,-1,20150000001670,32260,Field Contact,,7539,1963,M,White,-,-,2015-04-01T00:00:00,04:55:00,-,-,-,,N,N,-,-,-
2,-,-1,20150000002451,46430,Field Contact,,7591,1985,M,Hispanic or Latino,-,-,2015-05-25T00:00:00,01:06:00,-,-,-,WEST PCT 3RD W - MARY,N,N,-,-,-
3,-,-1,20150000002815,51725,Field Contact,,7456,1979,M,White,-,-,2015-06-09T00:00:00,19:27:00,-,-,-,NORTH PCT 2ND W - NORA,N,N,-,-,-
4,-,-1,20150000002815,51727,Field Contact,,7456,1979,M,White,-,-,2015-06-09T00:00:00,19:32:00,-,-,-,NORTH PCT 2ND W - NORA,N,N,-,-,-


* Drop Columns which contain IDs, which are not useful features.

In [5]:
df.drop(columns = ['Subject ID', 'GO / SC Num', 'Terry Stop ID', 'Officer ID'], inplace=True)

In [6]:
df.duplicated().sum()
# After dropping some of the columns, some rows appear to be duplicated.
# However, since the date and time of the incident are NOT exact (i.e. the date could be 24 hours later, and the
# time could be 10 hours later), it's possible to get some that are similar on different consecutive dates.

112

In [7]:
df.columns

Index(['Subject Age Group', 'Stop Resolution', 'Weapon Type', 'Officer YOB',
       'Officer Gender', 'Officer Race', 'Subject Perceived Race',
       'Subject Perceived Gender', 'Reported Date', 'Reported Time',
       'Initial Call Type', 'Final Call Type', 'Call Type', 'Officer Squad',
       'Arrest Flag', 'Frisk Flag', 'Precinct', 'Sector', 'Beat'],
      dtype='object')

In [8]:
col_names = df.columns
print(col_names)

Index(['Subject Age Group', 'Stop Resolution', 'Weapon Type', 'Officer YOB',
       'Officer Gender', 'Officer Race', 'Subject Perceived Race',
       'Subject Perceived Gender', 'Reported Date', 'Reported Time',
       'Initial Call Type', 'Final Call Type', 'Call Type', 'Officer Squad',
       'Arrest Flag', 'Frisk Flag', 'Precinct', 'Sector', 'Beat'],
      dtype='object')


In [9]:
df.shape

# The rationale for this is to understand how big the dataset is,  how many features are contained in the data
# This helps with planning for function vs lambda functions,  and whether certain kinds of visualizations will be feasible
# for the analysis (with my computer hardware).  With compute limitations, types of correlation plots cause the kernal to die,
# if there are more than 11 features.

(41104, 19)

* df.isna().sum()

isna().sum() determines how many data are missing from a given feature

* df.info() 

df.info() helps you determine if there missing values or datatypes that need to be modified

* Handy alternate checks if needed **
    - [x] df.isna().any()
    - [x] df.isnull().any()
    - [x] df.shape

In [10]:
df.isna().sum()


Subject Age Group             0
Stop Resolution               0
Weapon Type                   0
Officer YOB                   0
Officer Gender                0
Officer Race                  0
Subject Perceived Race        0
Subject Perceived Gender      0
Reported Date                 0
Reported Time                 0
Initial Call Type             0
Final Call Type               0
Call Type                     0
Officer Squad               535
Arrest Flag                   0
Frisk Flag                    0
Precinct                      0
Sector                        0
Beat                          0
dtype: int64

In [11]:
df['Officer Squad'].fillna('Unknown', inplace=True)

* Findings from isna().sum() *
* Officer Squad has 535 missing data (1.3% of the data)
    * Impute "Unknown"

In [12]:
df.isna().sum()

Subject Age Group           0
Stop Resolution             0
Weapon Type                 0
Officer YOB                 0
Officer Gender              0
Officer Race                0
Subject Perceived Race      0
Subject Perceived Gender    0
Reported Date               0
Reported Time               0
Initial Call Type           0
Final Call Type             0
Call Type                   0
Officer Squad               0
Arrest Flag                 0
Frisk Flag                  0
Precinct                    0
Sector                      0
Beat                        0
dtype: int64

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41104 entries, 0 to 41103
Data columns (total 19 columns):
Subject Age Group           object
Stop Resolution             object
Weapon Type                 object
Officer YOB                 int64
Officer Gender              object
Officer Race                object
Subject Perceived Race      object
Subject Perceived Gender    object
Reported Date               object
Reported Time               object
Initial Call Type           object
Final Call Type             object
Call Type                   object
Officer Squad               object
Arrest Flag                 object
Frisk Flag                  object
Precinct                    object
Sector                      object
Beat                        object
dtypes: int64(1), object(18)
memory usage: 6.0+ MB


In [14]:
df.duplicated().sum()

112

In [15]:
duplicates = df[df.duplicated(keep = False)]
#duplicates.head(118)

#### Use value_counts() - inspect for dummy variables, and determine next steps for data cleaning

1. Rationale:  This analysis is useful for flushing out missing values in the form of question marks, dashes or other symbols or dummy variables <br><br>

2.  It also gives a preliminary view of the number and distribution of categories in each feature, albeit by numbers rather than graphics <br><br>

3. For text data, value_counts serves as a preliminary investigation of the common important word data <br><br>


In [16]:
for col in df.columns:
    print(col, '\n', df[col].value_counts(), '\n')

Subject Age Group 
 26 - 35         13615
36 - 45          8547
18 - 25          8509
46 - 55          5274
56 and Above     1996
1 - 17           1876
-                1287
Name: Subject Age Group, dtype: int64 

Stop Resolution 
 Field Contact               16287
Offense Report              13976
Arrest                       9957
Referred for Prosecution      728
Citation / Infraction         156
Name: Stop Resolution, dtype: int64 

Weapon Type 
 None                                 32565
-                                     6213
Lethal Cutting Instrument             1482
Knife/Cutting/Stabbing Instrument      308
Handgun                                262
Firearm Other                          100
Club, Blackjack, Brass Knuckles         49
Blunt Object/Striking Implement         37
Firearm                                 18
Firearm (unk type)                      15
Other Firearm                           13
Mace/Pepper Spray                       12
Club                          

####  Findings from value_counts() and Next Steps:

1. The "-" is used as a substitute for unknown, in many cases.  Perhaps it would be good to build a function to impute "unknown" for the "-" for multiple features
2. Race and gender need re-mapping
3. Call Types, Weapons need re-binning
4. Officer Squad text can be split and provide the precinct, and the watch.

**Next steps:**
- [x] Investigation of the Stop Resolution, to determine whether the target should be "Stop Resolution - Arrests" or "Arrest Flag", and whether "Frisk Flag" is useful for predicting arrests.

- [x] Decide whether time and location information can be extracted from the "Officer Squad" column instead of the columns for time, Precinct, Sector and Beats
 
    
    

In [17]:
# Viewing the data to get a sense of which Stop Resolutions are correlated to the "Arrest Flag"
df.sort_values(by=['Stop Resolution'], ascending=True).head(100)


Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,-,Arrest,,1984,M,Black or African American,Asian,Male,2015-10-16T00:00:00,11:32:00,-,-,-,SOUTH PCT 1ST W - ROBERT,N,N,South,O,O2
32174,36 - 45,Arrest,,1988,M,White,Multi-Racial,Male,2019-04-13T00:00:00,11:35:00,THREATS - DV - NO ASSAULT,"--ASSAULTS - HARASSMENT, THREATS",911,SOUTH PCT 1ST W - SAM,N,N,South,S,S1
32172,36 - 45,Arrest,,1991,M,White,Black or African American,Female,2019-04-09T00:00:00,01:13:00,ASLT - IP/JO - WITH OR W/O WPNS (NO SHOOTINGS),"--ASSAULTS, OTHER",911,WEST PCT 2ND W - KING,N,N,West,K,K3
7804,18 - 25,Arrest,,1990,M,Hispanic or Latino,Black or African American,Male,2017-08-15T00:00:00,20:36:00,TRAFFIC STOP - OFFICER INITIATED ONVIEW,--TRAFFIC - REFUSE TO STOP (PURSUIT),ONVIEW,WEST PCT 2ND W - D/M RELIEF,N,N,West,D,D2
32171,36 - 45,Arrest,,1986,M,White,White,Male,2019-04-08T00:00:00,22:03:00,BURG - COMM BURGLARY (INCLUDES SCHOOLS),--ROBBERY - STRONG ARM,911,WEST PCT 3RD W - QUEEN,N,N,West,K,K1
32170,36 - 45,Arrest,Firearm Other,1986,M,White,Black or African American,Male,2019-04-07T00:00:00,17:47:00,"WEAPN-IP/JO-GUN,DEADLY WPN (NO THRT/ASLT/DIST)","--WEAPON, PERSON WITH - GUN",911,EAST PCT 2ND W - CHARLIE RELIEF,N,Y,East,G,G3
32165,36 - 45,Arrest,,1982,M,White,Black or African American,Male,2019-03-27T00:00:00,19:03:00,"SUSPICIOUS PERSON, VEHICLE OR INCIDENT",--SUSPICIOUS CIRCUM. - SUSPICIOUS VEHICLE,ONVIEW,SOUTH PCT 2ND W - OCEAN,N,N,South,O,O1
32163,36 - 45,Arrest,,1988,M,White,White,Male,2019-03-23T00:00:00,16:28:00,SUSPICIOUS STOP - OFFICER INITIATED ONVIEW,--NARCOTICS - OTHER,ONVIEW,SOUTH PCT OPS - DAY ACT,N,N,South,O,O1
7809,18 - 25,Arrest,,1984,M,White,White,Male,2017-08-10T00:00:00,15:10:00,"SUSPICIOUS PERSON, VEHICLE OR INCIDENT",--WARRANT SERVICES - MISDEMEANOR,911,EAST PCT 2ND W - E/G RELIEF,N,N,East,E,E1
32162,36 - 45,Arrest,,1991,M,White,White,Male,2019-03-23T00:00:00,08:18:00,SFD - ASSIST ON FIRE OR MEDIC RESPONSE,--WARRANT SERVICES - MISDEMEANOR,"TELEPHONE OTHER, NOT 911",TRAINING - FIELD TRAINING SQUAD,N,N,South,O,O1


In [18]:
# Check out what are the differences between a Stop Resolution of "Arrest" and the "Arrest Flag" 
df.loc[(df['Stop Resolution']=='Arrest') & (df['Arrest Flag']=="N")].shape

# This is the number of cases where the final stop resolution as reported by the officer, was "Arrest" and the
# Arrest Flag was N.  This indicates that many arrests are finalized after the actual Terry Stop

(8210, 19)

In [19]:
df.loc[(df['Stop Resolution']!='Arrest') & (df['Arrest Flag']=="Y")].shape

# Number of times an arrest was not made,  but the arrest flag was yes (an arrest was made during the Terry Stop)

(2, 19)

In [20]:
df.loc[(df['Stop Resolution']=='Arrest') & (df['Arrest Flag']=="Y")].shape

# These are the number of arrests DURING the Terry stop,  that had a final resolution of arrest

# Conclusion:  Use the Stop Resolution of Arrest to capture all the arrests made arising from a Terry stop
# The total number of arrests as repored by the officers is 8210 + 1747 or ~ 25% of the total # of Terry stops

(1747, 19)

In [21]:
# Check to see whether the Frisk Flag has usefulness
df.loc[(df['Stop Resolution']=='Arrest') & (df['Frisk Flag']=="Y")].shape

# Out of 10,000 arrests (and ~ 9000 Frisks), the number of arrest, that were frisked was ~30%
# It would appear that the 'Frisk Flag' is not helpful for predicting arrests.  Drop the 'Frisk Flag'

(3235, 19)

In [22]:
# CheckType whether 'Call Type' has usefulness
df.loc[(df['Stop Resolution']=='Arrest') & (df['Call Type']=="911")].shape

# Out of ~10,000 arrests roughly 50% came through 911.  Doesn't appear to be particularly useful for predicting arrests
# Drop the 'Call Type'

(5888, 19)

In [23]:
df.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,-,Arrest,,1984,M,Black or African American,Asian,Male,2015-10-16T00:00:00,11:32:00,-,-,-,SOUTH PCT 1ST W - ROBERT,N,N,South,O,O2
1,-,Field Contact,,1963,M,White,-,-,2015-04-01T00:00:00,04:55:00,-,-,-,Unknown,N,N,-,-,-
2,-,Field Contact,,1985,M,Hispanic or Latino,-,-,2015-05-25T00:00:00,01:06:00,-,-,-,WEST PCT 3RD W - MARY,N,N,-,-,-
3,-,Field Contact,,1979,M,White,-,-,2015-06-09T00:00:00,19:27:00,-,-,-,NORTH PCT 2ND W - NORA,N,N,-,-,-
4,-,Field Contact,,1979,M,White,-,-,2015-06-09T00:00:00,19:32:00,-,-,-,NORTH PCT 2ND W - NORA,N,N,-,-,-


## 2. Data Scoping 

1. Which is better to use the "Arrest Flag" column or the "Stop Resolution column as the target?: <br><br>

* Arrest Flag is a'1' only when there was an actual arrest during the Terry Stop.  Which may not be easy to do, resulting in a lower number (1747) <br><br>

* Stop Resolution records ~10,000 arrests, roughly 25% of the total dataset.  Since Stop Resolution is about officers recording the resolution of the Terry Stop, and with a likely performance target for officers,  they are likely to record this more accurately. <br><br>

* A quick check of "Frisk Flag" which is an indicator of those Terry stops where a Frisk was performed, does not seem well correlated with arrests.  Recomend to drop "Frisk Flag" <br><br>

#### Conclusion: Use "Stop Resolution" Arrests as the target

  - [x] Create a new column called "Arrests" which encodes Stop Resolution Arrests as a "1" and all others "0".  
  - [x] Drop the "Arrest Flag" column
  - [x] Drop the "Frisk Flag" column <br><br>
    
2. Location data, there are a number of columns which relate to location such as "Precincts", "Officer Squad", "Sector", "Beat", but are indirect measures of the actual location of the Terry Stop. Inspection of the "Officer Squad" text shows the Location assignment of the officer making the report. In ~10% of cases, Terry stops were performed by field training units or other units which are not captured by precinct (hence roughly 25% of the precincps are unknown). The training unit information is captured in the "Officer Squad" column.  <br><br>

3. For time data there is a "Reported Time" -- which is the time when the officer report was submitted, and according to the documentation could be delayed up to 10 hours, rather than the time of the actual Terry stop. <br><br> 

    However, inspection of the text in "Officer Squad" shows that the reporting officer's watch is recorded. In the Seattle police squad there are 3 watches to cover each 24 hour period. Watch 1 (03:00 - 11:00), Watch 2 (11:00 - 19:00), and Watch 3 (19:00 - 03:00).  Since officer performance is rated based on number of cases and crimes prevented or apprehended, likely the "Officer Squad" data which comes from the report is likely to be the most reliable in terms of time.
    
#### Conclusion: Use "Officer Squad" text data for time and location

- [x] Parse the "Officer Squad" data to capture the location and time based on officer assignments, creating columns for location and watch. <br><br>

- [x] Drop the "Reported Time", "Precincts", "Sector", and "Beat" columns <br><br>


In [24]:
df.drop(columns=['Arrest Flag', 'Frisk Flag', 'Reported Time', 'Precinct', 'Sector', 'Beat'], inplace = True)

In [25]:
# Re-Check for duplicates
#duplicates = seattle_df[seattle_df.duplicated(subset =['id'], keep = False)]
#duplicates.sort_values(by=['id']).head()
duplicates = df[df.duplicated(keep = False)]
df.duplicated().sum()

2285

#### Finding from duplicated():
- If you look at the beginning of the analysis, I checked for duplications with the entire dataset (before removing columns of data, such as "ID"),  there were no duplicates. But after dropping the ID,  there are 118 rows in duplication, 59 pairs. <br><br>

- Because the date and time are not exact (the documentation says sometimes the date could have been entered 24 hours later, or the time could be off by 10 hours, so that actually unique Terry stops could have the same data (when the ID columns are removed).<br><br>

- There are a few that are arrests.  Still open to decide whether to remove the duplicated data or not.  <br><br>

- What is curious is that the index number is not always consecutive between different pairs of duplicates.  This suggests that perhaps the data was input twice -- maybe due to some computer or internet glitches?

##  3. Data Transformation

   * Officer data: YOB, race, gender
   * Subject data- Age Group, race, gender
   * Stop Resolution (target column)
   * Weapons
   * Type of potential crime: Call type Initial and Final 
   * Date to day of week
   * Location and time: from Officer Squad
   

### A. Transform Age, Race and Gender Using Dictionary Mapping

In [26]:
# Re-mapping gender categories. 0 = Male, 1 = Female, 2 = Unknown

# officer_gender
officer_gender = {'M':0, 'F':1, 'N':2}
df['Officer Gender'] = df['Officer Gender'].map(officer_gender)

# subject perceived gender
subject_gender = {'Male':0, 'Female':1, 'Unknown':2,  '-':2, 
                 'Unable to Determine':2, 'Gender Diverse (gender non-conforming and/or transgender)':2}
df['Subject Perceived Gender'] = df['Subject Perceived Gender'].map(subject_gender)

In [27]:
#Check the mapping
df.loc[(df['Officer Gender']== 0.0)].shape, df.loc[(df['Subject Perceived Gender']== 0.0)].shape
df['Officer Gender'].value_counts()

0    36504
1     4593
2        7
Name: Officer Gender, dtype: int64

In [28]:
df['Subject Perceived Gender'].value_counts()

0    32049
1     8468
2      587
Name: Subject Perceived Gender, dtype: int64

In [29]:
df.loc[(df['Stop Resolution']=='Arrest') & (df['Subject Perceived Gender']== np.nan)].shape
# Checking to see if those arrested were gender different.  In this case none

(0, 13)

In [30]:
# Check the mapping
df['Officer Gender'].isna().sum(), df['Subject Perceived Gender'].isna().sum()

# NAs are not found in value_counts....

(0, 0)

In [31]:
# Re-mapping subject age categories
subject_age = {'1 - 17':1, '18 - 25':2, '26 - 35':3, '36 - 45':4, '46 - 55':5, '56 and Above':6, '-':0}
df['Subject Age Group'] = df['Subject Age Group'].map(subject_age)

In [32]:
df['Subject Age Group'].isna().sum()

0

In [33]:
df['Subject Age Group'].value_counts()

3    13615
4     8547
2     8509
5     5274
6     1996
1     1876
0     1287
Name: Subject Age Group, dtype: int64

In [34]:
# Checking to see of those arrested, how many had an unknown age group
# There are 193 arrests of people whose age is unknown
df.loc[(df['Stop Resolution']=='Arrest') & (df['Subject Age Group']== 0)].shape

(193, 13)

In [35]:
# Check how many arrested had unknown race (or - or other)

df.loc[(df['Stop Resolution']=='Arrest') & (df['Subject Perceived Race']== "Unknown")].shape
#df.loc[(df['Stop Resolution']=='Arrest') & (df['Subject Perceived Race']== "-")].shape
#df.loc[(df['Stop Resolution']=='Arrest') & (df['Subject Perceived Race']== "Other")].shape
df['Subject Perceived Race'].value_counts()

White                                        20192
Black or African American                    12243
Unknown                                       2073
Hispanic                                      1684
-                                             1422
Asian                                         1278
American Indian or Alaska Native              1224
Multi-Racial                                   809
Other                                          152
Native Hawaiian or Other Pacific Islander       27
Name: Subject Perceived Race, dtype: int64

In [36]:
race_map = {'White': 'White', 'Black or African American':'African American', 'Hispanic':'Hispanic',
            'Hispanic or Latino':'Hispanic', 'Two or More Races':'Multi-Racial','Multi-Racial':'Multi-Racial',
           'American Indian or Alaska Native':'Native', 'American Indian/Alaska Native':'Native',  
            'Native Hawaiian or Other Pacific Islander':'Native', 'Nat Hawaiian/Oth Pac Islander':'Native',
           '-':'Unknown', 'Other':'Unknown', 'Not Specified':'Unknown','Unknown':'Unknown',
           'Asian': 'Asian',}

df['Subject Perceived Race'] = df['Subject Perceived Race'].map(race_map)
df['Officer Race'] = df['Officer Race'].map(race_map)

In [37]:
df['Officer Race'].value_counts()

White               31805
Hispanic             2255
Multi-Racial         2158
African American     1674
Asian                1563
Unknown               921
Native                728
Name: Officer Race, dtype: int64

In [38]:
df['Subject Perceived Race'].value_counts()

White               20192
African American    12243
Unknown              3647
Hispanic             1684
Asian                1278
Native               1251
Multi-Racial          809
Name: Subject Perceived Race, dtype: int64

### B. Transform Stop Resolution Using Dictionary and .map()

In [39]:
# Now address the Stop Resolution categories
df['Stop Resolution'].value_counts()

Field Contact               16287
Offense Report              13976
Arrest                       9957
Referred for Prosecution      728
Citation / Infraction         156
Name: Stop Resolution, dtype: int64

In [40]:
# Re-map the Stop Resolution, to combine categories Arrest and Referred for Prosecution
# Map Arrest and Referred for Prosecution to 1,  and all others 0
stop_resolution = {'Field Contact': 0, 'Offense Report': 0, 'Arrest': 1,
             'Referred for Prosecution': 1, 'Citation / Infraction': 0}

df['Stop Resolution']=df['Stop Resolution'].map(stop_resolution)
df['Stop Resolution'].value_counts()

0    30419
1    10685
Name: Stop Resolution, dtype: int64

### C. Transform Weapon Type Using a Dictionary and .map()

In [41]:
df.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Initial Call Type,Final Call Type,Call Type,Officer Squad
0,0,1,,1984,0,African American,Asian,0,2015-10-16T00:00:00,-,-,-,SOUTH PCT 1ST W - ROBERT
1,0,0,,1963,0,White,Unknown,2,2015-04-01T00:00:00,-,-,-,Unknown
2,0,0,,1985,0,Hispanic,Unknown,2,2015-05-25T00:00:00,-,-,-,WEST PCT 3RD W - MARY
3,0,0,,1979,0,White,Unknown,2,2015-06-09T00:00:00,-,-,-,NORTH PCT 2ND W - NORA
4,0,0,,1979,0,White,Unknown,2,2015-06-09T00:00:00,-,-,-,NORTH PCT 2ND W - NORA


In [42]:
# Now re-map Weapon Type feature.  First check the categories of Weapons
df['Weapon Type'].value_counts()

None                                 32565
-                                     6213
Lethal Cutting Instrument             1482
Knife/Cutting/Stabbing Instrument      308
Handgun                                262
Firearm Other                          100
Club, Blackjack, Brass Knuckles         49
Blunt Object/Striking Implement         37
Firearm                                 18
Firearm (unk type)                      15
Other Firearm                           13
Mace/Pepper Spray                       12
Club                                     9
Rifle                                    5
None/Not Applicable                      4
Taser/Stun Gun                           4
Shotgun                                  3
Automatic Handgun                        2
Blackjack                                1
Brass Knuckles                           1
Fire/Incendiary Device                   1
Name: Weapon Type, dtype: int64

In [43]:
weapon_type = {'None':'None', 'None/Not Applicable':'None', 'Fire/Incendiary Device':'Incendiary',
              'Lethal Cutting Instrument':'Lethal Blade', 'Knife/Cutting/Stabbing Instrument':'Lethal Blade',
              'Handgun':'Firearm', 'Firearm Other':'Firearm','Firearm':'Firearm', 'Firearm (unk type)':'Firearm',
              'Other Firearm':'Firearm', 'Rifle':'Firearm', 'Shotgun':'Firearm', 'Automatic Handgun':'Firearm',
              'Club, Blackjack, Brass Knuckles':'Blunt Force', 'Club':'Blunt Force', 
              'Brass Knuckles':'Blunt Force', 'Blackjack':'Blunt Force',
              'Blunt Object/Striking Implement':'Blunt Force', '-':'Unknown',
              'Taser/Stun gun':'Taser', 'Mace/Pepper Spray':'Spray',}

df['Weapon Type']=df['Weapon Type'].map(weapon_type)
df['Weapon Type'].value_counts()

None            32569
Unknown          6213
Lethal Blade     1790
Firearm           418
Blunt Force        97
Spray              12
Incendiary          1
Name: Weapon Type, dtype: int64

### D. Transform the Date using to_datetime, .weekday, and .day

* Calculate the reported date of the week
    - [x] Day of the week: 0 = Monday, 6 = Sunday
    <br><br>
    
* Calculate the first, mid and last weeks of the month because perhaps more crimes / arrests are made when the bills come due
    - [x] Time of month: 1 = First week, 2 = 2nd and 3rd weeks,  4 = last week of the month



In [44]:
df['Reported Date'].head()

0    2015-10-16T00:00:00
1    2015-04-01T00:00:00
2    2015-05-25T00:00:00
3    2015-06-09T00:00:00
4    2015-06-09T00:00:00
Name: Reported Date, dtype: object

In [45]:
# Transform the Reported date into a day of the week,  or the time of month 
# Day of the week: 0 = Monday, 6 = Sunday
# Time of month: 1 = First week, 2 = 2nd and 3rd weeks,  4 = last week of the month

df['Reported Date']=pd.to_datetime(df['Reported Date'])
df['Weekday']=df['Reported Date'].dt.weekday

df['Time of Month'] = df['Reported Date'].dt.day

month_map = {1:1, 2:1,3:1,4:1, 5:1, 6:1, 7:1,8:2, 9:2, 10:2, 11:2, 12:2, 13:2, 14:2, 15:2, 
                     16:2, 17:2, 18:2, 19:2, 20:2, 21:2, 22:2, 23:3, 24:3, 25:3, 26:3, 27:3, 28:3, 29:3, 30:3, 31:3}

df['Time of Month'] = df['Time of Month'].map(month_map)


In [46]:
df.isna().sum()

Subject Age Group           0
Stop Resolution             0
Weapon Type                 4
Officer YOB                 0
Officer Gender              0
Officer Race                0
Subject Perceived Race      0
Subject Perceived Gender    0
Reported Date               0
Initial Call Type           0
Final Call Type             0
Call Type                   0
Officer Squad               0
Weekday                     0
Time of Month               0
dtype: int64

### E. Use Officer Squad data to create the location information (Precinct or Officer Team) and the time of day of the arrest (Officer Watch)

* Use Pandas Regex .str.extract to get the name of the precinct and the Watch if available

* Analyse if some precincts / units never make arrests 

* The Officer Squad text data is likely more reliable estimate assuming use the information provided is the squad name / location, and the watch that handled the reports, not a specific person schedule or squad. <br><br>
* With the Reported Date and Time, since the reports can come 1 day, or 10 hours later, the recorded time is not the actual Terry stop time. <br><br>
* Features created from Officer Squad: <br><br>
    - [x] Precinct or Squad name following the Terry stop
    - [x] Watch: <br>
        0 = Unknown, if the watch is not normally recorded<br>
        1 = Watch 1 03:00 - 11:00<br>
        2 = Watch 2 11:00 - 19:00<br>
        3 = Watch 3 19:00 - 03:00<br>
  

In [47]:
df.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Initial Call Type,Final Call Type,Call Type,Officer Squad,Weekday,Time of Month
0,0,1,,1984,0,African American,Asian,0,2015-10-16,-,-,-,SOUTH PCT 1ST W - ROBERT,4,2
1,0,0,,1963,0,White,Unknown,2,2015-04-01,-,-,-,Unknown,2,1
2,0,0,,1985,0,Hispanic,Unknown,2,2015-05-25,-,-,-,WEST PCT 3RD W - MARY,0,3
3,0,0,,1979,0,White,Unknown,2,2015-06-09,-,-,-,NORTH PCT 2ND W - NORA,1,2
4,0,0,,1979,0,White,Unknown,2,2015-06-09,-,-,-,NORTH PCT 2ND W - NORA,1,2


In [48]:
# Use Python Regex commands to clean up the Call Types and Officer Squad

In [49]:
df['Officer Squad'].value_counts()

df['Precinct'] = df['Officer Squad'].str.extract(r'(\w+)')

In [50]:
df['Watch'] = df['Officer Squad'].str.extract(pat = '([\d])').fillna(0)
df.head(100)

# Some Officer Quads do not recorde the Watch number 
# Don't leave the NaNs in the Watch column, fill with 0
# Watch definition: 0 = Unknown, 1 = 1st Watch, 2 = 2nd Watch, 3 = 3rd Watch

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Initial Call Type,Final Call Type,Call Type,Officer Squad,Weekday,Time of Month,Precinct,Watch
0,0,1,,1984,0,African American,Asian,0,2015-10-16,-,-,-,SOUTH PCT 1ST W - ROBERT,4,2,SOUTH,1
1,0,0,,1963,0,White,Unknown,2,2015-04-01,-,-,-,Unknown,2,1,Unknown,0
2,0,0,,1985,0,Hispanic,Unknown,2,2015-05-25,-,-,-,WEST PCT 3RD W - MARY,0,3,WEST,3
3,0,0,,1979,0,White,Unknown,2,2015-06-09,-,-,-,NORTH PCT 2ND W - NORA,1,2,NORTH,2
4,0,0,,1979,0,White,Unknown,2,2015-06-09,-,-,-,NORTH PCT 2ND W - NORA,1,2,NORTH,2
5,0,0,,1969,0,White,Native,0,2015-06-11,-,-,-,WEST PCT 3RD W - K/Q RELIEF,3,2,WEST,3
6,0,0,,1984,0,African American,Unknown,2,2015-06-12,-,-,-,SOUTH PCT 1ST W - ROBERT,4,2,SOUTH,1
7,0,0,,1983,0,White,Unknown,2,2015-06-12,-,-,-,SOUTH PCT 1ST W - ROBERT,4,2,SOUTH,1
8,0,0,,1966,0,Hispanic,Unknown,2,2015-06-27,-,-,-,SOUTH PCT 1ST W - R/S RELIEF,5,3,SOUTH,1
9,0,0,,1973,0,White,Unknown,0,2015-07-02,-,-,-,WEST PCT OPS - ACT NIGHT,3,1,WEST,0


In [51]:
df.isna().sum()

Subject Age Group           0
Stop Resolution             0
Weapon Type                 4
Officer YOB                 0
Officer Gender              0
Officer Race                0
Subject Perceived Race      0
Subject Perceived Gender    0
Reported Date               0
Initial Call Type           0
Final Call Type             0
Call Type                   0
Officer Squad               0
Weekday                     0
Time of Month               0
Precinct                    0
Watch                       0
dtype: int64

In [52]:
# Identify the Precincts are not typically making arrests, by comparing the number of arrests (Stop Resolution = Arrest)
# to the total number of Terry stops. 


arrest_df = df.loc[df['Stop Resolution'] == 1]  # Dataframe only for those Terry stops that resulted in arrests

arrest_df['Precinct'].value_counts(), df['Precinct'].value_counts()  # compare the value_counts for both dataframes

# Subsetting to only the Stop Resolution of arrest 

(WEST         3061
 NORTH        2261
 EAST         1879
 SOUTH        1481
 TRAINING     1058
 SOUTHWEST     733
 Unknown       115
 TRAF           25
 CRISIS         19
 GANG           18
 CANINE          8
 SWAT            6
 AUTO            4
 DV              3
 SAU             3
 PAWN            2
 JOINT           2
 MAJOR           2
 BURG            2
 HR              1
 ROBBERY         1
 NARC            1
 Name: Precinct, dtype: int64, WEST          10735
 NORTH         10079
 EAST           5976
 SOUTH          5475
 TRAINING       4312
 SOUTHWEST      3576
 Unknown         535
 TRAF             88
 GANG             64
 CRISIS           54
 CANINE           38
 MAJOR            33
 SWAT             28
 SAU              16
 HARBOR           16
 BURG             13
 JOINT            10
 HR               10
 DV                8
 AUTO              8
 NAVIGATION        6
 NARC              5
 COMMUNITY         5
 PAWN              3
 OPS               3
 PUBLIC            2
 ROBBE

In [53]:
# Caculate the # of precincts that have arrests by dividing the arrest_df to the total number of terry stops

arrest_percentage = arrest_df['Precinct'].value_counts() / df['Precinct'].value_counts()
print(f'The percentage of arrests based on terry stops, by squad \n\n',arrest_percentage)

The percentage of arrests based on terry stops, by squad 

 AUTO         0.5
BURG         0.2
CANINE       0.2
COMM         nan
COMMUNITY    nan
CRISIS       0.4
DV           0.4
EAST         0.3
GANG         0.3
HARBOR       nan
HR           0.1
JOINT        0.2
MAJOR        0.1
NARC         0.2
NAVIGATION   nan
NORTH        0.2
OPS          nan
PAWN         0.7
PUBLIC       nan
RECORDS      nan
ROBBERY      0.5
SAU          0.2
SOUTH        0.3
SOUTHWEST    0.2
SWAT         0.2
TRAF         0.3
TRAINING     0.2
Unknown      0.2
VICE         nan
WEST         0.3
ZOLD         nan
Name: Precinct, dtype: float64


In [54]:
# Create a dictionary for mapping the squads which have successful arrest.  Those officer squads which have
# reported Terry stops with no arrests will be dropped from the dataset
successful_arrest_map=arrest_percentage.to_dict()
# successful_arrest_map # Take a look at the dictionary

df['Precinct Success']=df['Precinct'].map(successful_arrest_map)

In [55]:
df.isna().sum()

Subject Age Group            0
Stop Resolution              0
Weapon Type                  4
Officer YOB                  0
Officer Gender               0
Officer Race                 0
Subject Perceived Race       0
Subject Perceived Gender     0
Reported Date                0
Initial Call Type            0
Final Call Type              0
Call Type                    0
Officer Squad                0
Weekday                      0
Time of Month                0
Precinct                     0
Watch                        0
Precinct Success            36
dtype: int64

In [56]:
# There are 36 units / precincts which do not have any arrests since 2015
# Likely these units are not expected to make arrests

df.to_csv('terry_stops_cleanup1.csv')

In [57]:
# Drop out the units Terry stops which do not routinely make arrests

df.dropna(inplace=True)  #Drop the rows where the NaNs are

#Drop Unneeded columns - 'Initial Call Type', 'Call Type', 
df.drop(columns=['Call Type', 'Reported Date', 'Officer Squad'], inplace = True)


In [58]:
df.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Initial Call Type,Final Call Type,Weekday,Time of Month,Precinct,Watch,Precinct Success
0,0,1,,1984,0,African American,Asian,0,-,-,4,2,SOUTH,1,0.3
1,0,0,,1963,0,White,Unknown,2,-,-,2,1,Unknown,0,0.2
2,0,0,,1985,0,Hispanic,Unknown,2,-,-,0,3,WEST,3,0.3
3,0,0,,1979,0,White,Unknown,2,-,-,1,2,NORTH,2,0.2
4,0,0,,1979,0,White,Unknown,2,-,-,1,2,NORTH,2,0.2


### F. Transform Initial or Final Call Types

In [None]:
def clean_call_types(col_name,)

In [None]:
df['Final Call Type'].value_counts(dropna=False)

In [None]:
# Create an index of the true and false values for the condition == '-'
idx = df['Final Call Type'] =='-'

In [None]:
# Use true/false index - Boolean index
# Pass in the index and the column name to replace the - with Unknown

df.loc[idx,'Final Call Type'] = 'Unknown'

In [None]:
final_calls = df['Final Call Type']

In [None]:
df['Final Re-map'] = final_calls.apply(lambda x:x.replace('--','').split('-')[0].strip())

In [None]:
df['Final Re-map'].value_counts(dropna=False).sort_index()

In [None]:
df.isna().sum()

In [None]:
df['Final Re-map'] = df['Final Re-map'].str.extract(r'(\w+)')


In [None]:
df['Final Re-map'] = df['Final Re-map'].str.lower()

In [None]:
last_map = df['Final Re-map'].value_counts().to_dict()
df['Final Re-map'].isna().sum()

In [None]:
last_map = {'unknown': 'unknown',
             'suspicious': 'suspicious',
             'assaults': 'assault',
             'disturbance': 'disturbance',
             'prowler': 'trespass',
             'dv': 'domestic violence',
             'warrant': 'warrant',
             'theft': 'theft',
             'narcotics': 'under influence',
             'robbery': 'theft',
             'burglary': 'theft',
             'traffic': 'traffic',
             'property': 'property damage',
             'weapon': 'weapon',
             'crisis': 'person in crisis',
             'automobiles': 'auto',
             'assist': 'assist others',
             'sex': 'vice',
             'mischief': 'mischief',
             'arson': 'arson',
             'fraud': 'fraud',
             'vice': 'vice',
             'drive': 'auto',
             'misc': 'misdemeanor',
             'premise': 'trespass',
             'alarm': 'suspicious',
             'intox': 'under influence',
             'rape': 'rape',
             'child': 'child',
             'trespass': 'trespass',
             'person': 'person in crisis',
             'homicide': 'homicide',
             'burg': 'theft',
             'kidnap': 'kidnap',
             'animal': 'animal',
             'hazards': 'hazard',
             'aslt': 'assault',
             'casualty': 'homicide',
             'fight': 'disturbance',
             'shoplift': 'theft',
             'auto': 'auto', 
             'haras': 'disturbance',
             'purse': 'theft',
             'weapn': 'weapon',
             'fireworks': 'arson',
             'follow': 'disturbance',
             'dist': 'disturbance',
             'haz': 'hazard',
             'nuisance': 'mischief',
             'threats': 'disturbance',
             'liquor': 'under influence',
             'mvc': 'auto',
             'shots': 'weapon',
             'harbor': 'auto',
             'down': 'homicide',
             'service': 'unknown',
             'hospital': 'unknown',
             'bomb': 'arson',
             'undercover': 'under influence',
             'burn': 'arson',
             'lewd': 'vice',
             'dui': 'under influence',
             'crowd': 'unknown',
             'order': 'assist',
             'escape': 'assist',
             'commercial': 'trespass',
             'noise': 'disturbance'}

In [None]:
df['Final Re-map'] = df['Final Re-map'].map(last_map)
df['Final Re-map'].value_counts(dropna=False)

In [None]:
df.isna().sum()

In [None]:
#Drop all NaNs
df.dropna(inplace=True)

df.to_csv('terry_stops_cleanup2.csv')
#df.dropna(inplace=True)
#df.shape

In [None]:
df.reset_index(inplace=True)

In [None]:
df.head(100)

In [None]:
df.drop(columns = ['Initial Call Type', 'Final Call Type', 'Precinct Success'], inplace=True)

In [None]:
df.info()

## 4. Optional Feature Engineering for Training data only

# Calculate how successful particular precincts were at making arrests
#arrest_percentage = arrest_df['Precinct'].value_counts() / df['Precinct'].value_counts()
#print(f'The percentage of arrests based on terry stops, by squad \n\n',arrest_percentage)

# Create a dictionary for mapping the squads which have successful arrest.  Those officer squads which have
# reported Terry stops with no arrests will be dropped from the dataset
#successful_arrest_map=arrest_percentage.to_dict()
# successful_arrest_map # Take a look at the dictionary

#df['Precinct Success']=df['Precinct'].map(successful_arrest_map) # map the dictionary to the dataframe with a new column3

# Perform the same analysis to see which call types lead to more arrests

#arrest_df = df.loc[df['Stop Resolution'] == 'Arrest'] # Re-Create the arrest_df in case there were removals earlier
#arrest_df['Final Call'].value_counts(),  df['Final Call'].value_counts()

#arrest_categories = arrest_df['Final Call Type'].value_counts() / df['Final Call Type'].value_counts() 
#arrest_map = arrest_categories.to_dict()
#arrest_map # look at the dictionary 

#df['Final Call Success'] = df['Final Call Type'].map(arrest_map)

## 5. Processing Chosen Feature Columns


In [None]:
df.columns

In [None]:
category_cols = ['Subject Age Group', 'Weapon Type', 'Officer YOB',
       'Officer Gender', 'Officer Race',
       'Subject Perceived Race', 'Subject Perceived Gender', 'Weekday',
       'Time of Month', 'Precinct', 'Watch', 'Final Re-map']

target_col = ['Stop Resolution']

In [None]:
df_to_split = pd.DataFrame()

from sklearn.preprocessing import MinMaxScaler

# Convert catogories to cat.codes

for header in category_cols:
    df_to_split[header] = df[header].astype('category').cat.codes
    
df_to_split.info()

In [None]:
df_to_split.head()

## 6. Fitting an Initial  CatBoostClassifier Model



In [None]:
y = df['Stop Resolution']
X = df.drop('Stop Resolution',axis=1)

In [None]:
from sklearn.model_selection import train_test_split

## Train test split
X_train, X_test, y_train,y_test  = train_test_split(X,y,test_size=.3,
                                                    random_state=42)#,stratify=y)
display(y_train.value_counts(normalize=False),y_test.value_counts(normalize=False))

In [None]:
## Define X and y to split
#X = df_to_split
#y = pd.Series(df[target_col].to_numpy().ravel())
#y.name = 'Stop Resolution'

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

Seattle Police Deparment Section 6.220 of the Police Manual

http://www.seattle.gov/police-manual/title-6---arrests-search-and-seizure/6220---voluntary-contacts-terry-stops-and-detentions

http://www.seattle.gov/police/information-and-data/terry-stops/terry-stops-dashboard

Effective Date: 01/01/2020

This policy applies to all sworn employees conducting voluntary contacts and/or stops/detentions based upon reasonable suspicion (Terry).

This policy does not apply to detentions based upon probable cause and community caretaking functions pursuant to RCW 71.05.153.

6.220 - POL – 1 Definitions

Seizure: A seizure occurs any time an officer, by means of physical force or show of authority, restricts the liberty of a person.  A seizure may also occur if an officer uses words, actions, or demeanor that would make a reasonable person believe that they are not free to leave.

Voluntary Contacts: During voluntary contacts, officers will not use any words, actions, demeanor, or other show of authority that would indicate that a person is not free to leave; voluntary contacts are not seizures.

Voluntary Contacts fall under two categories:

Social Contact: A voluntary and consensual encounter between the police and a subject with the intent of engaging in casual and/or non-investigative conversation. The subject is free to leave and/or decline any of the officer’s requests at any point; social contacts are not seizures.

Non-Custodial Interview: A voluntary and consensual investigatory interview that an officer conducts with a subject during which the subject is free to leave and/or decline any of the officer’s requests at any point; non-custodial interviews are not seizures.

Terry Stop: A brief, minimally intrusive seizure of a subject based upon articulable reasonable suspicion in order to investigate possible criminal activity. The stop can apply to people as well as vehicles. The subject of a Terry stop is not free to leave. A Terry stop is a seizure under both the state and federal constitutions.

- A Terry stop is a detention, based on reasonable suspicion, during which an officer may develop facts to establish probable cause or dispel suspicion.

- Stops and detentions initiated under probable cause will be made pursuant to Manual Sections:

- 6.010- Arrests;

- 6.280-Warrant Arrests;

- 16.230-Issuing Tickets and Traffic Contact Reports;

- 16.110-Crisis Intervention or;

- 15.020 - Charge-By-Officer

Reasonable Suspicion: Specific, objective, articulable facts, which, taken together with rational inferences, would create a well-founded suspicion that there is a substantial possibility that a subject has engaged, is engaging or is about to engage in criminal conduct.

-The reasonableness of a Terry stop is considered in view of the totality of the circumstances, the officer’s training and experience, and what the officer knew before the stop.

- During a stop, an officer may learn new information that can lead to additional reasonable suspicion or probable cause that a crime has occurred, but that new information cannot provide the justification for the original stop.

6.220 - POL – 2 Conducting a Terry Stop

1. Terry Stops are Seizures Based Upon Reasonable Suspicion

This policy prohibits Terry stops when an officer lacks reasonable suspicion that a subject has been, is, or is about to engage in the commission of a crime.

Searches and seizures by officers are lawful to the extent they meet the requirements of the 4th Amendment (see Terry v. Ohio, 392 U.S. 1 (1968), and Washington Constitution Art. 1, Section 7.

2. During a Terry Stop, Officers Will Limit the Seizure to a Reasonable Scope

Officers will articulate in their Report, the justification for the initiation, scope and duration of a Terry stop.

Actions that would indicate to a reasonable person that they are under arrest or indefinitely detained may convert a Terry stop to an arrest; however, taking any of these actions does not necessarily turn a Terry stop into an arrest.

Unless justified by the articulable reasons for the original stop, officers must have additional articulable justification for further limiting a person’s freedom during a Terry stop, such as:

- Taking a subject’s identification or driver license away from the immediate vicinity

- Ordering a motorist to exit a vehicle

- Putting a pedestrian up against a wall

- Directing a person to stand or remain standing, or to sit on a patrol car bumper or any other place not of their choosing

- Directing a person to lie or sit on the ground

- Applying handcuffs

- Transporting any distance away from the scene of the initial stop, including for the purpose of witness identification

- Placing a subject into a police vehicle

- Pointing a firearm at a person or occupied vehicle

- Frisking for weapons

- De minimis force

3. During a Terry Stop, Officers Will Limit the Seizure to a Reasonable Amount of Time

Subjects may be seized for only that period of time necessary to effect the purpose of the stop. Any delays in completing the necessary actions will be objectively reasonable.

Officers may not extend a detention solely to await the arrival of a supervisor.

4. During all Terry Stops, Officers Will Take Reasonable Steps to Be Courteous and Professional

When reasonable, as early in the contact as safety permits, the officer making contact with the subject (contact officer) will inform the suspect of the following:

- The officer’s name;

- The officer’s rank or title;

- The fact that the officer is a Seattle Police Officer;

- The reason for the stop; and

- That the stop is being recorded, if applicable (See 16.090 – In-Car and Body Worn Video).

When releasing a person at the end of a Terry stop, officers will advise the person that they are free to leave, offer an explanation of the circumstances and reasons for the Terry stop, and provide the person a business card with the event number as a receipt. Officers will not extend a detention to explain the Terry stop or provide a receipt.

5. Officers Cannot Require Subjects to Identify Themselves or Answer Questions on a Terry Stop

During a Terry stop, officers may request identification; however, subjects are not obligated to provide identification or information upon request.

Exceptions: As listed in 6.220—POL-3 Conducting a Detention to Issue a Notice of Infraction, Issue a Citation, and Other Exceptions.

6. Officers May Conduct a Frisk of Stopped Subject(s) Only if They Have an Articulable and Reasonable Safety Concern that the Person is Armed and Presently Dangerous

The purpose and scope of a frisk is to discover weapons or other items which pose a danger to the officer or those nearby. It is not a generalized search of the entire person. The decision to conduct a frisk is based upon the totality of the circumstances and the reasonable conclusions drawn from the officer’s training and experience.  Generally, the frisk will be limited to a pat-down of outer clothing.  Once the officer ascertains that no weapon is present after the frisk is completed, the officer’s limited authority to frisk is completed (i.e. the frisk will stop).

- A weapons frisk is a limited search determined by the state and federal constitutions.

- All consent searches will be conducted and memorialized via body-worn video, in-car video or signed consent form pursuant to Manual Section 6.180.

- Officers will not frisk for weapons on a social contact or noncustodial interview.

- A frisk will not be used as a pretext to search for incriminating evidence.

- The fact that a Terry stop occurs in a high-crime area is not by itself sufficient to justify a frisk.

Frisk factors may include, but are not limited to:

- Prior knowledge that the subject carries a weapon;

- Suspicious behavior, such as failure to comply with instructions to keep hands in sight; and

- Observations, such as suspicious bulges, consistent with carrying a concealed weapon.

7. Under Washington State Law, Traffic Violations Will Not Be Used as a Pretext to Investigate Unrelated Crimes

- Pretext is stopping a suspect for an infraction to investigate criminal activity for which the officer has neither reasonable suspicion nor probable cause.

- The Washington State Constitution forbids use of pretext as a justification for a warrantless search or seizure.

- Officers will consciously, and independently determine that a traffic stop is reasonably necessary in order to address a suspected traffic infraction.

8. Supervisors Will Screen All Incidents In-Person When an Officer Places Handcuffs on a Subject

Officers will not extend a detention solely to await the arrival of a supervisor.

When un-handcuffing a subject for release, the officer will immediately notify a supervisor, inform the subject that they are free to leave and inform them that a sergeant is en route to the scene.

- If the subject declines to speak with a supervisor or wishes to leave before the supervisor arrives, the officer will attempt to offer the subject the supervisor's contact information.

- If the subject decides to wait for the supervisor, the officer will wait at the location for the supervisor to arrive.

- If the subject does not wish to remain on-scene to speak with the supervisor, the officer may arrange to meet the supervisor at another location to screen the incident.

9. When Making an Arrest, Officers May Seize Non-Arrested Companions for Articulable and Reasonable Officer Safety Concerns

Officers will only maintain the seizure of non-arrested companions based on safety concerns for as long as the objective rationale for the seizure continues to exist.  The scope and nature of the seizure must be objectively reasonable based on the factors justifying the detention.

Officers will articulate objective safety concerns for the officers, the arrestee, their companions, or other persons when seizing non-arrested companions.

Factors to consider when seizing non-arrested companions include (but are not limited to):

- The type of arrest;

- The number of officers;

- The number of people present at the scene of the arrest;

- The time of day;

- The behavior of those present at the scene;

- The location of the arrest;

- The presence or suspected presence of a weapon;

- Officer knowledge of the arrestee or the companions; and/or

- Potentially affected persons

This is not an exhaustive list. Justification to detain non-arrested companions will be made based upon the totality of the circumstances.

6.220 - POL – 3 Conducting a Detention to Issue a Notice of Infraction, Issue a Citation, and Other Exceptions

1. Certain Statutory Exceptions Require the Subject to Provide Identification:

- When the subject is a driver stopped for a traffic infraction investigation (RCW 46.61.021) failure to provide identification is a misdemeanor.

- When the subject is attempting to purchase liquor (RCW 66.20.180).

- When the subject is carrying a concealed pistol (RCW 9.41.050) failure to provide CPL is a civil infraction.

Officers may not transport a person to any police facility or jail for the sole purpose of identifying them unless they have probable cause for arrest.

While investigating a crime or possible crime, executing a search or arrest warrant, or issuing a citation or parks exclusion notice, officers may arrest subjects for false reporting SMC 12A.16.040(D) when subjects provide false written or oral identification.

2. Officers Can Detain Subjects to Identify Them in Order to Issue a Notice of Infraction

Under SMC 12A.02.140 and RCW 7.80.060, when an officer has probable cause to issue a Notice of Infraction for any City ordinance violation, the officer may detain the subject for a reasonable period of time to identify the subject.

When officers have probable cause to issue a Notice of Infraction, and the subject refuses to identify themselves, the officer may request that a fingerprinting kit or Mobile ID be delivered to the scene and detain the subject for a reasonable amount of time to facilitate the fingerprinting.

6.220 - POL – 4 Documenting a Terry Stop

1. Officers Will Document All Terry Stops

The documentation should contain all information requested in the Field Contact, but at a minimum will contain at least the following elements:

- The original, objective facts justifying the reasonable suspicion for the stop or detention;

- Any subsequent, objective facts that lengthen the original detention;

- The scope and duration of the stop or detention;

- The disposition of the stop or detention, including whether an arrest resulted;

- Whether a frisk or consensual search was conducted;

- The facts justifying the frisk or consensual search; and

- The results of the frisk or consensual search

- Demographic information pertaining to the subject, including perceived race, perceived age, and perceived gender; and

- Any complications or delays that contributed to an inability to fill out all information on the Field Contact.

Officers will clearly articulate the objective facts they rely upon in determining reasonable suspicion and probable cause.

Officers will document all Terry stops on a Field Contact. Officers will use a separate Field Contact for each person seized during a Terry stop.

- Officers are required to complete a Field Contact regardless of the outcome of the Terry stop.

- Where an officer develops probable cause for arrest during the course of the stop, a Field Contact is still required.

2. Officers Will Submit all Field Contacts Before They Leave at the End of their Shift

Exception: Field Contacts documenting Off-Duty Terry Stops that do not lead to an arrest will be submitted by the completion of the next Department work shift.

3. Officers Will Document All Other Detentions

- Social contacts are not detentions and do not require documentation per this policy.

- If the scope of the social contact evolves into a Terry stop, the officer will document the detention via a Field Contact.

- Detentions based on probable cause do not require a Field Contact but require the officer to document the stop via a Report, Infraction/Citation, Traffic Contact Report, Trespass Warning, or Parks Trespass Warning/Exclusion.

Supervisors will ensure the correct documentation of all detentions.

4. Supervisors Will Review the Documentation of Terry Stops

Absent extenuating circumstances, by the end of each shift, supervisors will review their officers’ Reports and Field Contacts that document the Terry stops made during the shift to determine if they were supported by reasonable suspicion and are consistent with SPD policy, federal, and state law.

If a supervisor concludes that a Terry stop appears to be inconsistent with SPD policy, the supervisor, in consultation with their chain of command, shall address the concern and make the appropriate referral pursuant to Section 5.002. Such action may include PAS documentation and/or referral to OPA.  The supervisor shall document these concerns and any actions taken on a Supplement when approving the Report or Field Contact.

- If a supervisor finds the documentation of the detention insufficient, the supervisor will return the documentation to the officer for corrections before the end of that shift.

