## Seattle Terry Stops Final Project Submission

* Student name: Rebecca Mih
* Student pace: Part Time Online
* Scheduled project review date/time: May 5, 2020 12:00pm
* Instructor name: James Irving
* Blog post URL: https://github.com/sn95033/Terry-Stops-Analysis


* **Data Source:**  https://www.kaggle.com/city-of-seattle/seattle-terry-stops

    * Date of last update to the datasource: April 15, 2020


<div>
<img src= "Seattle Police Dept.jpg"
           width=200"/>
</div


## Background

https://caselaw.findlaw.com/us-supreme-court/392/1.html

This data represents records of police reported stops under Terry v. Ohio, 392 U.S. 1 (1968). Each row represents a unique stop.

 A Terry stop is a seizure under both state and federal law. A Terry stop is
defined in policy as a brief, minimally intrusive seizure of a subject based upon
**articulable reasonable suspicion (ARS) in order to investigate possible criminal activity.**
The stop can apply to people as well as to vehicles. The subject of a Terry stop is
**not** free to leave.

Section 6.220 of the Seattle Police Department (SPD) Manual defines Reasonable Suspicion as:
Specific, objective, articulable facts which, taken together with rational inferences, would
create a  **well-founded suspicion that there is a substantial possibility that a subject has
engaged, is engaging or is about to engage in criminal conduct.**

- Each record contains perceived demographics of the subject, as reported by the officer making the stop and officer demographics as reported to the Seattle Police Department, for employment purposes.
- Where available, data elements from the associated Computer Aided Dispatch (CAD) event (e.g. Call Type, Initial Call Type, Final Call Type) are included.


## Notes on Concealed Weapons in the State of Washington

WHAT ARE WASHINGTON’S CONCEALED CARRY LAWS?
Open carry of a firearm is lawful without a permit in the state of Washington except, according to the law, “under circumstances, and at a time and place that either manifests an intent to intimidate another or that warrants alarm for the safety of other persons.”

**However, open carry of a loaded handgun in a vehicle is legal only with a concealed pistol license. Open carry of a loaded long gun in a vehicle is illegal.**

The criminal charge of “carrying a concealed firearm” happens in this state when someone carries a concealed firearm **without a concealed pistol license**. It does not matter if the weapon was discovered in the defendant’s home, vehicle, or on his or her person.

## Objectives
### Target:

   * Build a classifier which predicts Terry Stops that lead to Arrest (Binary Classification), given information about the presence of weapons, the time of day of the call, etc.  
    
### Features:
   * Report Date
   * Report time
   * Initial Call Type
   * Final Call Type
   * Call Type
   * Stop Resolution
   * Weapon type
   * Officer Squad
   * Officer Year of Birth
   * Perceived Age of subject
   * Race of officer
   * Perceived Race of subject
   * Gender of officer
   * Perceived Gender of subject
   
### Engineered Features:
    * Day of the week
    * Monthly cadence
    * Precinct
    * Watch 
    * Officer Age
    
 ### Experiments:
 * **The Next Steps will be a set of experiments to look how the models can improve based on:**
    - [1] Feature Selection:  Initial Call Type Versus Final Call Type 
    - [2] Model type:  XGBoost-RF  vs CatBoost
    - [3] Balancing the dataset from best model of [1] and [2] using either SMOTE or SMOTENC. SmoteNC is an algorithm specifically developed for categorical and continuous variables. <br><br>
    
* **The Next Experiment will be (Bold Type):** <br><br>


* The data set is one-hot encoded, prior to modelling <br><br>


    - A = Vanilla Model = XBG + Initial Call Type
    - B = XBG + Final Call Type
    - C = Catboost + Initial Call Type
    - D = Catboost + Final Call Type
    - E = SMOTE + Catboost + Final Call Type
    - F = SMOTE + XGB + Final Call Type
    - G = SMOTE + SVM + Final Call Type
    - H = SMOTENC + Catboost + Final Call Type
    - I = CBC-search + Final Call Type
    - J = SMOTENC + CBC-search + Final Call Type

 
 

## Definition of Features Provided

Column Names and descriptions provided in the SPD dataset  <br>
* **Subject Age Group**	
Subject Age Group (10 year increments) as reported by the officer. <br><br>

* **Subject ID**	
Key, generated daily, identifying unique subjects in the dataset using a character to character match of first name and last name. "Null" values indicate an "anonymous" or "unidentified" subject. Subjects of a Terry Stop are not required to present identification.  **Not Used** <br><br>

* **GO / SC Num**
General Offense or Street Check number, relating the Terry Stop to the parent report. This field may have a one to many relationship in the data. **Not Used** <br><br>

* **Terry Stop ID**
Key identifying unique Terry Stop reports.  **Not Used**
<br><br>

* **Stop Resolution**
Resolution of the stop**One hot encoding** <br><br>

* **Weapon Type**	
Type of weapon, if any, identified during a search or frisk of the subject. Indicates "None" if no weapons was found.  <br><br>

* **Officer ID**	
Key identifying unique officers in the dataset.
**Not Used** <br><br>

* **Officer YOB**	
Year of birth, as reported by the officer.  <br><br>

* **Officer Gender**	
Gender of the officer, as reported by the officer.
 <br><br>

* **Officer Race**	
Race of the officer, as reported by the officer. <br><br>

* **Subject Perceived Race**	
Perceived race of the subject, as reported by the officer. <br><br>

* **Subject Perceived Gender**	
Perceived gender of the subject, as reported by the officer. <br><br>

* **Reported Date**	
Date the report was filed in the Records Management System (RMS). Not necessarily the date the stop occurred but generally within 1 day.  <br><br>

* **Reported Time**	
Time the stop was reported in the Records Management System (RMS). Not the time the stop occurred but generally within 10 hours.  <br><br>

* **Initial Call Type**	
Initial classification of the call as assigned by 911.  <br><br>

* **Final Call Type**	
Final classification of the call as assigned by the primary officer closing the event.  <br><br>

* **Call Type**	
How the call was received by the communication center.

* **Officer Squad**	
Functional squad assignment (not budget) of the officer as reported by the Data Analytics Platform (DAP). <br><br>

* **Arrest Flag**	
Indicator of whether a "physical arrest" was made, of the subject, during the Terry Stop. Does not necessarily reflect a report of an arrest in the Records Management System (RMS). <br><br>

* **Frisk Flag**	
Indicator of whether a "frisk" was conducted, by the officer, of the subject, during the Terry Stop. <br><br>

* **Precinct**	
Precinct of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred. <br><br>

* **Sector**	
Sector of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred. <br><br>

* **Beat**	
Beat of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred. <br><br>

## Analysis Workflow (OSEMN)

1. **Obtain and Pre-process**
    - [x] Import data
    - [x] Remove unused columns
    - [x] Check data size, NaNs, and # of non-null values which are not valid data 
    - [x] Clean up missing values by imputing values or dropping
    - [x] Replace ? or other non-valid data by imputing values or dropping data
    - [x] Check for duplicates and remove if appropriate
    - [x] Change datatypes of columns as appropriate 
    - [x] Note which features are continuous and which are categorical<br><br>

2. **Data Scoping**
     - [x] Use value_counts() to identify dummy categories such as "-", or "?" for later re-mapping
     - [x] Identify most common word data
     - [x] Decide on which columns (features) to keep for further feature engineering
   
3. **Transformation of data (Feature Engineering)**
    - [x] Re-bin categories to reduce noise
    - [x] Re-map categories as needed
    - [x] Engineer text data to extract common word information
    - [x] Transform categoricals using 1-hot encoding or label encoding/
    - [x] Perform log transformations on continuous variables (if applicable)
    - [x] Normalize continuous variables
    - [x] Use re-sampling if needed to balance the dataset <br> <br>
    
4. **Further Feature Selection**
     - [x] Use .describe() and .hist() histograms
     - [x] Identify outliers (based on auto-scaling of plots) and remove or inpute as needed
     - [x] Perform visualizations on key features to understand  
     - [x] Inspect feature correlations (Pearson correlation) to identify co-linear features**<br><br>

5.  **Create a Vanilla Machine Learning Model**
    - [x] Split into train and test data 
    - [x] Run the model
    - [x] Review Quality indicators of the model <br><br>

6. **Run more advanced models**
    - [x] Compare the model quality
    - [x] Choose one or more models for grid searching <br><br>
    
7. **Revise data inputs if needed to improve quality indicators**
    - [x] By adding created features, and removing colinear features
    - [x] By improving unbalanced datasets through oversampling or undersampling
    - [x] by removing outliers through filters
    - [x] through use of subject matter knowledge <br><br>
    
8. **Write the Report**
    - [X] Explain key findings and recommended next steps



## 1. Obtain and Pre-Process the Data

1. **Obtain and Pre-process**
    - [x] Import data
    - [x] Remove unused columns
    - [x] Check data size, NaNs, and # of non-null values which are not valid data 
    - [x] Clean up missing values by imputing values or dropping
    - [x] Replace ? or other non-valid data by imputing values or dropping data
    - [x] Check for duplicates and remove if appropriate
    - [x] Change datatypes of columns as appropriate 
    - [x] Decide the target column, if not already decided
    - [x] Determine if some data is not relevent to the question (drop columns or rows)
    - [x] Note which features which will need to be re-mapped or encoded 
    - [x] Note which features might require feature engineering (example - date, time) <br><br>
  

In [1]:
#!pip install -U fsds_100719
from fsds_100719.imports import *
#import pandas as pd
#import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns
import copy
import sklearn
import math
import datetime
#import plotly.express as px
#import plotly.graphy_objects as go
import warnings
warnings.filterwarnings('ignore')

#!pip3 install xgboost
import xgboost as xbg
from xgboost import XGBRFClassifier,XGBClassifier

import sklearn.metrics as metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score, roc_curve

pd.options.display.float_format = '{:.2f}'.format
pd.set_option('display.max_columns',0)
pd.set_option('display.max_info_rows',200)
%matplotlib inline


fsds_1007219  v0.7.21 loaded.  Read the docs: https://fsds.readthedocs.io/en/latest/ 


Handle,Package,Description
dp,IPython.display,Display modules with helpful display and clearing commands.
fs,fsds_100719,Custom data science bootcamp student package
mpl,matplotlib,Matplotlib's base OOP module with formatting artists
plt,matplotlib.pyplot,Matplotlib's matlab-like plotting module
np,numpy,scientific computing with Python
pd,pandas,High performance data structures and tools
sns,seaborn,High-level data visualization library based on matplotlib


[i] Pandas .iplot() method activated.


In [2]:
# Write a function which evaluates the model, and returns
def evaluate_model(y_true, y_pred,X_true,clf,
                    cm_kws=dict(cmap="YlGnBu",normalize='true'),figsize=(10,4),plot_roc_auc=True, 
                    expt_name='Model'):
    
    '''Function which evaluates each model, stores the result and figures
    Inputs: 
        y_true: target output of the model based on test data
        y_pred: target input to the model based on train data
        X_true: result output of the model based on test data
        
        clf:  classification learning function utilized for the model (examples: xgb-rf, Catboost)
        
        cm_kws: keyword settings for plotting and normalization
                Defaults: cmap="Blues", normalize = "true"
                figsize: size of the plot,  default=(10,10)

        expt_name: Pass in the experiment name, so that the saved feature importance image will be unique
                  default = A  
    Outputs:  
              result_df: dataframe which contains the classification metrics 
                (precision, recall, f1-score, weighted average, accuracy)
                
              df_important: The top 5 feature importances, the accuracy and AUC
    
    Saves:   roc_auc plot - plot of AUC for the model
             Feature importance plot
    '''
    
    ## Get the Accuracy metrics
    
    accuracy_result = round(accuracy_score(y_true, y_pred),3)
    metrics_report = metrics.classification_report(y_true,y_pred, 
                                            target_names = ['Not Arrested', 'Arrested'],
                                           output_dict=True)
    print('Model Name = ', expt_name)
    print(f'Accuracy Score = {accuracy_result:.3}')

    ## Save scores into the results dataframe
    result_df = pd.DataFrame(metrics_report).transpose()
    result_df.drop(labels='accuracy',axis = 0, inplace=True)
    #result_df.drop(labels='support', axis = 1, inplace=True)
    # Swap Rows  https://stackoverflow.com/questions/55439469/swapping-two-rows-together-with-index-within-the-same-pandas-dataframe
    # result_df.iloc[np.r_[0:len(result_df) -2, -1, -2]] 

    result_df.rename(index= {'weighted avg':'Weighted Avg', 'accuracy':'Accuracy',
                            'macro avg': 'Macro Avg',}, inplace=True)

    result_df.rename(columns = {'precision': 'Precision', 'recall':'Recall', 
                                'f1-score':'F1 Score','support':'Support'}, inplace=True)
    display(result_df)

    
    if plot_roc_auc:
        num_cols=2
    else:
        num_cols=1
        
    fig, ax = plt.subplots(figsize=figsize,ncols=num_cols)
  
    
    if not isinstance(ax,np.ndarray):
        ax=[ax]
        
    try:
        metrics.plot_confusion_matrix(clf,X_true,y_true,ax=ax[0],**cm_kws)
        ax[0].grid(False)
        ax[0].set(title='Confusion Matrix')
        plt.savefig(("Confusion Matrix {}.png").format(expt_name))
    except:
            print('Confusion Matrix Not Working')
        
    
    if plot_roc_auc:
        try:
            y_score = clf.predict_proba(X_true)[:,1]

            fpr,tpr,thresh = metrics.roc_curve(y_true,y_score)
            roc_auc = round(metrics.auc(fpr,tpr),3)
            
            ax[1].plot(fpr,tpr,color='teal',label=f'ROC Curve (AUC={roc_auc})')
            ax[1].plot([0,1],[0,1],ls=':')
            ax[1].legend()
            ax[1].grid(b=False)
            ax[1].set(ylabel='True Positive Rate',xlabel='False Positive Rate',
                  title='Receiver operating characteristic (ROC) Curve')
            plt.tight_layout()
            plt.show()
            plt.savefig(("ROC Curve {}.png").format(expt_name))  
            print("AUC = ", roc_auc)
    
        except:
            print('ROC-AUC not working')
    try: 
        df_important = plot_importance(clf, expt_name = expt_name)
        df_important = df_important.sort_values(ascending=False).head()
        df_important["Accuracy"] = accuracy_result
        df_important['AUC'] = roc_auc
        df_important = df_important.reset_index()
        df_important = df_important.rename(columns = {'index':'Description', 0:expt_name})
       
      
    except:
        df_important = None
        print('importance plotting not working')
    
    return result_df, df_important



In [3]:

def plot_importance(tree, top_n=20,figsize=(10,10),expt_name='Model'):
    
    '''Feature Selection tool, which plots the feature importance based on results
    
    Inputs:
      tree: classification learning function utilized
      top_n: top n features contributing to the model, default = 20
      figsize:  size of the plot,  default=(10,10)
      expt_name: Pass in the experiment name, so that the saved feature importance image will be unique
                  default = Model
                  
    Returns: df_importance - series of the model features sorted by importance
    Saves:  Feature importance figure as  "Feature expt_name.png", Default expt_name = "Model" '''
    
    df_importance = pd.Series(tree.feature_importances_,index=X_train.columns).sort_values(ascending=True)
    df_importance.tail(top_n).plot(kind='barh',figsize=figsize, color='teal')
    plt.savefig(("Feature {}.png").format(expt_name))
    #plt.savefig("Feature Importance 2.png", transparent = True)
    
    
    return df_importance


In [4]:
def clean_call_types(df_to_clean, col_name, new_col):
    '''Transform Call Type text into a single identifier
    
    Inputs:  df,  col_name -  column which has the Call type,  and a new column name
    
    Outputs: The dataframe with a new column name, and a dictionary which can be used for .map()'''
    idx = df_to_clean[col_name] == '-' # Create an index of the true and false values for the condition == '-'
    df_to_clean.loc[idx, col_name] = 'Unknown'
    column_series = df_to_clean[col_name]
    df_to_clean[new_col] = column_series.apply(lambda x:x.replace('--','').split('-')[0].strip())
    #df_to_clean[new_col].value_counts(dropna=False).sort_index()
    #df_to_clean.isna().sum()
    df_to_clean[new_col] = df_to_clean[new_col].str.extract(r'(\w+)')
    df_to_clean[new_col] = df_to_clean[new_col].str.lower()
    last_map = df_to_clean[new_col].value_counts().to_dict()
    return last_map
    

In [5]:
df = pd.read_csv('Terry_Stops.csv',low_memory=False)
df.duplicated().sum()


0

In [6]:
#df.head()

* Drop Columns which contain IDs, which are not useful features.

In [7]:
df.drop(columns = ['Subject ID', 'GO / SC Num', 'Terry Stop ID', 'Officer ID'], inplace=True)

In [8]:
df.duplicated().sum()
# After dropping some of the columns, some rows appear to be duplicated.
# However, since the date and time of the incident are NOT exact (i.e. the date could be 24 hours later, and the
# time could be 10 hours later), it's possible to get some that are similar on different consecutive dates.

112

In [9]:
#df.columns

In [10]:
col_names = df.columns
print(col_names)

Index(['Subject Age Group', 'Stop Resolution', 'Weapon Type', 'Officer YOB',
       'Officer Gender', 'Officer Race', 'Subject Perceived Race',
       'Subject Perceived Gender', 'Reported Date', 'Reported Time',
       'Initial Call Type', 'Final Call Type', 'Call Type', 'Officer Squad',
       'Arrest Flag', 'Frisk Flag', 'Precinct', 'Sector', 'Beat'],
      dtype='object')


In [11]:
df.shape

# The rationale for this is to understand how big the dataset is,  how many features are contained in the data
# This helps with planning for function vs lambda functions,  and whether certain kinds of visualizations will be feasible
# for the analysis (with my computer hardware).  With compute limitations, types of correlation plots cause the kernal to die,
# if there are more than 11 features.

(41104, 19)

* df.isna().sum()

isna().sum() determines how many data are missing from a given feature

* df.info() 

df.info() helps you determine if there missing values or datatypes that need to be modified

* Handy alternate checks if needed **
    - [x] df.isna().any()
    - [x] df.isnull().any()
    - [x] df.shape

In [12]:
df.isna().sum()


Subject Age Group             0
Stop Resolution               0
Weapon Type                   0
Officer YOB                   0
Officer Gender                0
Officer Race                  0
Subject Perceived Race        0
Subject Perceived Gender      0
Reported Date                 0
Reported Time                 0
Initial Call Type             0
Final Call Type               0
Call Type                     0
Officer Squad               535
Arrest Flag                   0
Frisk Flag                    0
Precinct                      0
Sector                        0
Beat                          0
dtype: int64

In [13]:
df['Officer Squad'].fillna('Unknown', inplace=True)

* Findings from isna().sum() *
* Officer Squad has 535 missing data (1.3% of the data)
    * Impute "Unknown"

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41104 entries, 0 to 41103
Data columns (total 19 columns):
 #   Column                    Dtype 
---  ------                    ----- 
 0   Subject Age Group         object
 1   Stop Resolution           object
 2   Weapon Type               object
 3   Officer YOB               int64 
 4   Officer Gender            object
 5   Officer Race              object
 6   Subject Perceived Race    object
 7   Subject Perceived Gender  object
 8   Reported Date             object
 9   Reported Time             object
 10  Initial Call Type         object
 11  Final Call Type           object
 12  Call Type                 object
 13  Officer Squad             object
 14  Arrest Flag               object
 15  Frisk Flag                object
 16  Precinct                  object
 17  Sector                    object
 18  Beat                      object
dtypes: int64(1), object(18)
memory usage: 6.0+ MB


In [15]:
df.duplicated().sum()

112

In [16]:
duplicates = df[df.duplicated(keep = False)]


#### Use value_counts() - inspect for dummy variables, and determine next steps for data cleaning

1. Rationale:  This analysis is useful for flushing out missing values in the form of question marks, dashes or other symbols or dummy variables <br><br>

2.  It also gives a preliminary view of the number and distribution of categories in each feature, albeit by numbers rather than graphics <br><br>

3. For text data, value_counts serves as a preliminary investigation of the common important word data <br><br>


In [17]:
for col in df.columns:
    print(col, '\n', df[col].value_counts(), '\n')
    
# Most of the Stop resolutions do not result in resist - they result in field contact, or an offense report
# We will combine those Arrested 9957  with those forward for Prosecution (728)
# under the assumption that the police department and prosecutors are aware of the legal proof needed for arrest, and that the 
# case will likely result in arrest

Subject Age Group 
 26 - 35         13615
36 - 45          8547
18 - 25          8509
46 - 55          5274
56 and Above     1996
1 - 17           1876
-                1287
Name: Subject Age Group, dtype: int64 

Stop Resolution 
 Field Contact               16287
Offense Report              13976
Arrest                       9957
Referred for Prosecution      728
Citation / Infraction         156
Name: Stop Resolution, dtype: int64 

Weapon Type 
 None                                 32565
-                                     6213
Lethal Cutting Instrument             1482
Knife/Cutting/Stabbing Instrument      308
Handgun                                262
Firearm Other                          100
Club, Blackjack, Brass Knuckles         49
Blunt Object/Striking Implement         37
Firearm                                 18
Firearm (unk type)                      15
Other Firearm                           13
Mace/Pepper Spray                       12
Club                          

####  Findings from value_counts() and Next Steps:

1. The "-" is used as a substitute for unknown, in many cases.  Perhaps it would be good to build a function to impute "unknown" for the "-" for multiple features
2. Race and gender need re-mapping
3. Call Types, Weapons need re-binning
4. Officer Squad text can be split and provide the precinct, and the watch.

**Next steps:**
- [x] Investigation of the Stop Resolution, to determine whether the target should be "Stop Resolution - Arrests" or "Arrest Flag", and whether "Frisk Flag" is useful for predicting arrests.

- [x] Decide whether time and location information can be extracted from the "Officer Squad" column instead of the columns for time, Precinct, Sector and Beats
 
    
    

In [18]:
# Viewing the data to get a sense of which Stop Resolutions are correlated to the "Arrest Flag"
df.sort_values(by=['Stop Resolution'], ascending=True).head(100)


Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,-,Arrest,,1984,M,Black or African American,Asian,Male,2015-10-16T00:00:00,11:32:00,-,-,-,SOUTH PCT 1ST W - ROBERT,N,N,South,O,O2
32174,36 - 45,Arrest,,1988,M,White,Multi-Racial,Male,2019-04-13T00:00:00,11:35:00,THREATS - DV - NO ASSAULT,"--ASSAULTS - HARASSMENT, THREATS",911,SOUTH PCT 1ST W - SAM,N,N,South,S,S1
32172,36 - 45,Arrest,,1991,M,White,Black or African American,Female,2019-04-09T00:00:00,01:13:00,ASLT - IP/JO - WITH OR W/O WPNS (NO SHOOTINGS),"--ASSAULTS, OTHER",911,WEST PCT 2ND W - KING,N,N,West,K,K3
7804,18 - 25,Arrest,,1990,M,Hispanic or Latino,Black or African American,Male,2017-08-15T00:00:00,20:36:00,TRAFFIC STOP - OFFICER INITIATED ONVIEW,--TRAFFIC - REFUSE TO STOP (PURSUIT),ONVIEW,WEST PCT 2ND W - D/M RELIEF,N,N,West,D,D2
32171,36 - 45,Arrest,,1986,M,White,White,Male,2019-04-08T00:00:00,22:03:00,BURG - COMM BURGLARY (INCLUDES SCHOOLS),--ROBBERY - STRONG ARM,911,WEST PCT 3RD W - QUEEN,N,N,West,K,K1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23853,26 - 35,Arrest,Firearm,1989,M,White,Native Hawaiian or Other Pacific Islander,Male,2020-03-08T00:00:00,18:45:44,SUSPICIOUS STOP - OFFICER INITIATED ONVIEW,"--WEAPON, PERSON WITH - GUN",ONVIEW,TRAINING - FIELD TRAINING SQUAD,Y,Y,East,C,C1
7719,18 - 25,Arrest,,1979,M,White,Black or African American,Male,2017-07-10T00:00:00,01:13:00,FIGHT - IP - PHYSICAL (NO WEAPONS),"--ASSAULTS, OTHER",ONVIEW,WEST PCT 3RD W - DAVID BEATS,N,N,West,M,M2
32277,36 - 45,Arrest,,1990,M,Hispanic or Latino,Black or African American,Female,2019-05-02T00:00:00,22:27:00,ASLT - IP/JO - WITH OR W/O WPNS (NO SHOOTINGS),"--ASSAULTS - HARASSMENT, THREATS",911,TRAINING - FIELD TRAINING SQUAD,N,N,South,R,R2
7722,18 - 25,Arrest,,1971,M,White,Black or African American,Male,2017-07-20T00:00:00,01:41:00,ASLT - WITH OR W/O WEAPONS (NO SHOOTINGS),"--ASSAULTS, OTHER",911,TRAINING - FIELD TRAINING SQUAD,N,N,North,J,J1


In [19]:
# Check out what are the differences between a Stop Resolution of "Arrest" and the "Arrest Flag" 
df.loc[(df['Stop Resolution']=='Arrest') & (df['Arrest Flag']=="N")].shape

# This is the number of cases where the final stop resolution as reported by the officer, was "Arrest" and the
# Arrest Flag was N.  This indicates that many arrests are finalized after the actual Terry Stop

(8210, 19)

In [20]:
df.loc[(df['Stop Resolution']!='Arrest') & (df['Arrest Flag']=="Y")].shape

# Number of times an arrest was not made,  but the arrest flag was yes (an arrest was made during the Terry Stop)

(2, 19)

In [21]:
df.loc[(df['Stop Resolution']=='Arrest') & (df['Arrest Flag']=="Y")].shape

# These are the number of arrests DURING the Terry stop,  that had a final resolution of arrest

# Conclusion:  Few Terry stops can result in arrest during the stop. Followup investigation is needed.
# Use the Stop Resolution of Arrest to capture all the arrests made arising from a Terry stop
# The total number of arrests as reported by the officers is 8210 + 1747 or ~ 25% of the total # of Terry stops

(1747, 19)

In [22]:
# Check to see whether the Frisk Flag has usefulness
df.loc[(df['Stop Resolution']=='Arrest') & (df['Frisk Flag']=="Y")].shape

# Out of 10,000 arrests (and ~ 9000 Frisks), the number of arrest, that were frisked was ~
# May still have value as a prediction

(3235, 19)

In [23]:
# CheckType whether 'Call Type' has usefulness
df.loc[(df['Stop Resolution']=='Arrest') & (df['Call Type']=="911")].shape

# Out of ~10,000 arrests roughly 50% came through 911.  Doesn't appear to be particularly useful for predicting arrests
# Drop the 'Call Type'

(5888, 19)

In [24]:
df.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,-,Arrest,,1984,M,Black or African American,Asian,Male,2015-10-16T00:00:00,11:32:00,-,-,-,SOUTH PCT 1ST W - ROBERT,N,N,South,O,O2
1,-,Field Contact,,1963,M,White,-,-,2015-04-01T00:00:00,04:55:00,-,-,-,Unknown,N,N,-,-,-
2,-,Field Contact,,1985,M,Hispanic or Latino,-,-,2015-05-25T00:00:00,01:06:00,-,-,-,WEST PCT 3RD W - MARY,N,N,-,-,-
3,-,Field Contact,,1979,M,White,-,-,2015-06-09T00:00:00,19:27:00,-,-,-,NORTH PCT 2ND W - NORA,N,N,-,-,-
4,-,Field Contact,,1979,M,White,-,-,2015-06-09T00:00:00,19:32:00,-,-,-,NORTH PCT 2ND W - NORA,N,N,-,-,-


## 2. Data Scoping 

1. Which is better to use the "Arrest Flag" column or the "Stop Resolution column as the target?: <br><br>

* Arrest Flag is a'1' only when there was an actual arrest during the Terry Stop.  Which may not be easy to do, resulting in a lower number (1747) <br><br>

* Stop Resolution records ~10,000 arrests, roughly 25% of the total dataset.  Since Stop Resolution is about officers recording the resolution of the Terry Stop, and with a likely performance target for officers,  they are likely to record this more accurately. <br><br>

* A quick check of "Frisk Flag" which is an indicator of those Terry stops where a Frisk was performed, does not seem well correlated with arrests.  Recomend to drop "Frisk Flag" <br><br>

#### Conclusion: Use "Stop Resolution" Arrests as the target

  - [x] Create a new column called "Arrests" which encodes Stop Resolution Arrests as a "1" and all others "0".  
  - [x] Drop the "Arrest Flag" column
  - [x] Drop the "Frisk Flag" column <br><br>
    
2. Location data, there are a number of columns which relate to location such as "Precincts", "Officer Squad", "Sector", "Beat", but are indirect measures of the actual location of the Terry Stop. Inspection of the "Officer Squad" text shows the Location assignment of the officer making the report. In ~10% of cases, Terry stops were performed by field training units or other units which are not captured by precinct (hence roughly 25% of the precincps are unknown). The training unit information is captured in the "Officer Squad" column.  <br><br>

3. For time data there is a "Reported Time" -- which is the time when the officer report was submitted, and according to the documentation could be delayed up to 10 hours, rather than the time of the actual Terry stop. <br><br> 

    However, inspection of the text in "Officer Squad" shows that the reporting officer's watch is recorded. In the Seattle police squad there are 3 watches to cover each 24 hour period. Watch 1 (03:00 - 11:00), Watch 2 (11:00 - 19:00), and Watch 3 (19:00 - 03:00).  Since officer performance is rated based on number of cases and crimes prevented or apprehended, likely the "Officer Squad" data which comes from the report is likely to be the most reliable in terms of time.
    
#### Conclusion: Use "Officer Squad" text data for time and location

- [x] Parse the "Officer Squad" data to capture the location and time based on officer assignments, creating columns for location and watch. <br><br>

- [x] Drop the "Reported Time", "Precincts", "Sector", and "Beat" columns <br><br>


In [25]:
df.drop(columns=['Arrest Flag', 'Frisk Flag', 'Reported Time', 'Precinct', 'Sector', 'Beat'], inplace = True)

In [26]:
# Re-Check for duplicates
#duplicates = seattle_df[seattle_df.duplicated(subset =['id'], keep = False)]
#duplicates.sort_values(by=['id']).head()
duplicates = df[df.duplicated(keep = False)]
df.duplicated().sum()

2285

#### Finding from duplicated():
- If you look at the beginning of the analysis, I checked for duplications with the entire dataset (before removing columns of data, such as "ID"),  there were no duplicates. But after dropping the ID,  there are 118 rows in duplication, 59 pairs. <br><br>

- Because the date and time are not exact (the documentation says sometimes the date could have been entered 24 hours later, or the time could be off by 10 hours, so that actually unique Terry stops could have the same data (when the ID columns are removed).<br><br>

- There are a few that are arrests.  Still open to decide whether to remove the duplicated data or not.  <br><br>

- What is curious is that the index number is not always consecutive between different pairs of duplicates.  This suggests that perhaps the data was input twice -- maybe due to some computer or internet glitches?

##  3. Data Transformation

   * Officer data: YOB, race, gender
   * Subject data- Age Group, race, gender
   * Stop Resolution (target column)
   * Weapons
   * Type of potential crime: Call type Initial and Final 
   * Date to day of week
   * Location and time: from Officer Squad
   

### A. Transform Gender Using Dictionary Mapping .map()
   

In [27]:
# Re-mapping gender categories.  we will be doing 1 hot encoding, so leave in the text

# officer_gender
officer_gender = {'M': 0, 'F': 1, 'N': 2}
df['Officer Gender'] = df['Officer Gender'].map(officer_gender)

# subject perceived gender
subject_gender = {'Male': 0, 'Female':1, 'Unknown':2,  '-':2, 
                 'Unable to Determine':2, 'Gender Diverse (gender non-conforming and/or transgender)':2}
df['Subject Perceived Gender'] = df['Subject Perceived Gender'].map(subject_gender)

In [28]:
#Check the mapping
df.loc[(df['Officer Gender']== 'Male')].shape, df.loc[(df['Subject Perceived Gender']== 'Male')].shape
df['Officer Gender'].value_counts()

0    36504
1     4593
2        7
Name: Officer Gender, dtype: int64

In [29]:
df['Subject Perceived Gender'].value_counts()

0    32049
1     8468
2      587
Name: Subject Perceived Gender, dtype: int64

In [30]:
# Check the mapping
df['Officer Gender'].isna().sum(), df['Subject Perceived Gender'].isna().sum()  #NAs are not found 

(0, 0)

### B. Transform Age Using Dictionary Mapping .map() and binning (.cut)

In [31]:
# We will forgo remapping because we will be doing one-hot encoding
subject_age = {'1 - 17':'1-17', '18 - 25':'18-25', '26 - 35':'26-35', '36 - 45':'36-45', '46 - 55':'46-55', 
               '56 and Above':'56+', '-':'Unknown'}
df['Subject Age Group'] = df['Subject Age Group'].map(subject_age)

In [32]:
df['Subject Age Group'].isna().sum()

0

In [33]:
df['Subject Age Group'].value_counts()

26-35      13615
36-45       8547
18-25       8509
46-55       5274
56+         1996
1-17        1876
Unknown     1287
Name: Subject Age Group, dtype: int64

In [34]:
# Calculated the Officers Age, and bin into same bins as the subject age
df['Reported Year']=pd.to_datetime(df['Reported Date']).dt.year
df['Reported Year']

0        2015
1        2015
2        2015
3        2015
4        2015
         ... 
41099    2020
41100    2020
41101    2020
41102    2020
41103    2020
Name: Reported Year, Length: 41104, dtype: int64

In [35]:
df['Officer Age'] = df['Reported Year'] - df['Officer YOB']
df['Officer Age'].value_counts(dropna=False)

31     2842
30     2592
32     2515
33     2443
29     2246
34     2206
28     2140
26     1923
27     1869
35     1723
25     1544
24     1425
36     1226
37     1184
38     1096
39      970
40      945
42      806
41      753
23      730
45      698
46      679
44      664
48      618
43      593
47      582
49      543
50      533
54      436
51      399
53      377
52      310
55      308
56      242
22      241
57      203
58      177
59       88
60       69
61       39
63       32
21       19
62       19
65       16
64       13
67       12
68        4
118       3
116       2
119       2
70        1
69        1
117       1
66        1
115       1
Name: Officer Age, dtype: int64

In [36]:

df['Officer Age'] =pd.cut(x=df['Officer Age'], bins=[1,18,25,35,45,55,70,120], labels = ['1-17', 
                          '18-25','26-35','36-45', '46-55', '56+', 'Unknown'])
df['Officer Age'].value_counts(dropna=False)

26-35      22499
36-45       8935
46-55       4785
18-25       3959
56+          917
Unknown        9
1-17           0
Name: Officer Age, dtype: int64

In [37]:
df.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Initial Call Type,Final Call Type,Call Type,Officer Squad,Reported Year,Officer Age
0,Unknown,Arrest,,1984,0,Black or African American,Asian,0,2015-10-16T00:00:00,-,-,-,SOUTH PCT 1ST W - ROBERT,2015,26-35
1,Unknown,Field Contact,,1963,0,White,-,2,2015-04-01T00:00:00,-,-,-,Unknown,2015,46-55
2,Unknown,Field Contact,,1985,0,Hispanic or Latino,-,2,2015-05-25T00:00:00,-,-,-,WEST PCT 3RD W - MARY,2015,26-35
3,Unknown,Field Contact,,1979,0,White,-,2,2015-06-09T00:00:00,-,-,-,NORTH PCT 2ND W - NORA,2015,36-45
4,Unknown,Field Contact,,1979,0,White,-,2,2015-06-09T00:00:00,-,-,-,NORTH PCT 2ND W - NORA,2015,36-45


### C. Transform Gender using Dictionary Mapping

In [38]:
# Check how many arrested had unknown race (or - or other)

df.loc[(df['Stop Resolution']=='Arrest') & (df['Subject Perceived Race']== "Unknown")].shape
#df.loc[(df['Stop Resolution']=='Arrest') & (df['Subject Perceived Race']== "-")].shape
#df.loc[(df['Stop Resolution']=='Arrest') & (df['Subject Perceived Race']== "Other")].shape
df['Subject Perceived Race'].value_counts()

White                                        20192
Black or African American                    12243
Unknown                                       2073
Hispanic                                      1684
-                                             1422
Asian                                         1278
American Indian or Alaska Native              1224
Multi-Racial                                   809
Other                                          152
Native Hawaiian or Other Pacific Islander       27
Name: Subject Perceived Race, dtype: int64

In [39]:
race_map = {'White': 'White', 'Black or African American':'African American', 'Hispanic':'Hispanic',
            'Hispanic or Latino':'Hispanic', 'Two or More Races':'Multi-Racial','Multi-Racial':'Multi-Racial',
           'American Indian or Alaska Native':'Native', 'American Indian/Alaska Native':'Native',  
            'Native Hawaiian or Other Pacific Islander':'Native', 'Nat Hawaiian/Oth Pac Islander':'Native',
           '-':'Unknown', 'Other':'Unknown', 'Not Specified':'Unknown','Unknown':'Unknown',
           'Asian': 'Asian',}

df['Subject Perceived Race'] = df['Subject Perceived Race'].map(race_map)
df['Officer Race'] = df['Officer Race'].map(race_map)

In [40]:
df['Officer Race'].value_counts(dropna = False)

White               31805
Hispanic             2255
Multi-Racial         2158
African American     1674
Asian                1563
Unknown               921
Native                728
Name: Officer Race, dtype: int64

In [41]:
df['Subject Perceived Race'].value_counts(dropna=False)

White               20192
African American    12243
Unknown              3647
Hispanic             1684
Asian                1278
Native               1251
Multi-Racial          809
Name: Subject Perceived Race, dtype: int64

### D. Transform Stop Resolution Using Dictionary Mapping .map()

In [42]:
# Now address the Stop Resolution categories
df['Stop Resolution'].value_counts(dropna=False)

Field Contact               16287
Offense Report              13976
Arrest                       9957
Referred for Prosecution      728
Citation / Infraction         156
Name: Stop Resolution, dtype: int64

In [43]:
# Re-map the Stop Resolution, to combine categories Arrest and Referred for Prosecution
# Map Arrest and Referred for Prosecution to 1,  and all others 0
stop_resolution = {'Field Contact': 0, 'Offense Report': 0, 'Arrest': 1,
             'Referred for Prosecution': 1, 'Citation / Infraction': 0}

df['Stop Resolution']=df['Stop Resolution'].map(stop_resolution)
df['Stop Resolution'].value_counts(dropna=False)

0    30419
1    10685
Name: Stop Resolution, dtype: int64

### E. Transform Weapon Type Using a Dictionary and .map()

In [44]:
df.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Initial Call Type,Final Call Type,Call Type,Officer Squad,Reported Year,Officer Age
0,Unknown,1,,1984,0,African American,Asian,0,2015-10-16T00:00:00,-,-,-,SOUTH PCT 1ST W - ROBERT,2015,26-35
1,Unknown,0,,1963,0,White,Unknown,2,2015-04-01T00:00:00,-,-,-,Unknown,2015,46-55
2,Unknown,0,,1985,0,Hispanic,Unknown,2,2015-05-25T00:00:00,-,-,-,WEST PCT 3RD W - MARY,2015,26-35
3,Unknown,0,,1979,0,White,Unknown,2,2015-06-09T00:00:00,-,-,-,NORTH PCT 2ND W - NORA,2015,36-45
4,Unknown,0,,1979,0,White,Unknown,2,2015-06-09T00:00:00,-,-,-,NORTH PCT 2ND W - NORA,2015,36-45


In [45]:
# Now re-map Weapon Type feature.  First check the categories of Weapons
df['Weapon Type'].value_counts(dropna = False)

None                                 32565
-                                     6213
Lethal Cutting Instrument             1482
Knife/Cutting/Stabbing Instrument      308
Handgun                                262
Firearm Other                          100
Club, Blackjack, Brass Knuckles         49
Blunt Object/Striking Implement         37
Firearm                                 18
Firearm (unk type)                      15
Other Firearm                           13
Mace/Pepper Spray                       12
Club                                     9
Rifle                                    5
None/Not Applicable                      4
Taser/Stun Gun                           4
Shotgun                                  3
Automatic Handgun                        2
Brass Knuckles                           1
Fire/Incendiary Device                   1
Blackjack                                1
Name: Weapon Type, dtype: int64

In [46]:
weapon_type = {'None':'Unknown', 'None/Not Applicable':'Unknown', 'Fire/Incendiary Device':'Incendiary',
              'Lethal Cutting Instrument':'Lethal Blade', 'Knife/Cutting/Stabbing Instrument':'Lethal Blade',
              'Handgun':'Firearm', 'Firearm Other':'Firearm','Firearm':'Firearm', 'Firearm (unk type)':'Firearm',
              'Other Firearm':'Firearm', 'Rifle':'Firearm', 'Shotgun':'Firearm', 'Automatic Handgun':'Firearm',
              'Club, Blackjack, Brass Knuckles':'Blunt Force', 'Club':'Blunt Force', 
              'Brass Knuckles':'Blunt Force', 'Blackjack':'Blunt Force', 'Incendiary':'Incendiary',
              'Blunt Object/Striking Implement':'Blunt Force', '-':'Unknown', 'Unknown': 'Unknown',
              'Taser/Stun Gun':'Taser', 'Mace/Pepper Spray':'Spray', 'Blunt Force':'Blunt Force',
              "Taser":"Taser", "Spray":'Spray', 'Lethal Blade':'Lethal Blade' }

df['Weapon Type']=df['Weapon Type'].map(weapon_type)
df['Weapon Type'].value_counts(dropna=False)

Unknown         38782
Lethal Blade     1790
Firearm           418
Blunt Force        97
Spray              12
Taser               4
Incendiary          1
Name: Weapon Type, dtype: int64

### F. Transform the Date using to_datetime, .weekday, and .day

* Calculate the reported date of the week
    - [x] Day of the week: 0 = Monday, 6 = Sunday
    <br><br>
    
* Calculate the first, mid and last weeks of the month because perhaps more crimes / arrests are made when the bills come due
    - [x] Time of month: 1 = First week, 2 = 2nd and 3rd weeks,  4 = last week of the month



In [47]:
df['Reported Date'].head()

0    2015-10-16T00:00:00
1    2015-04-01T00:00:00
2    2015-05-25T00:00:00
3    2015-06-09T00:00:00
4    2015-06-09T00:00:00
Name: Reported Date, dtype: object

In [48]:
# Transform the Reported date into a day of the week,  or the time of month 
# Day of the week: 0 = Monday, 6 = Sunday
# Time of month: 1 = First week, 2 = 2nd and 3rd weeks,  4 = last week of the month

df['Reported Date']=pd.to_datetime(df['Reported Date'])  # Processed earlier for Officer YOB calculation
df['Weekday']=df['Reported Date'].dt.weekday

df['Time of Month'] = df['Reported Date'].dt.day

month_map = {1:1, 2:1,3:1,4:1, 5:1, 6:1, 7:1,8:2, 9:2, 10:2, 11:2, 12:2, 13:2, 14:2, 15:2, 
                     16:2, 17:2, 18:2, 19:2, 20:2, 21:2, 22:2, 23:3, 24:3, 25:3, 26:3, 27:3, 28:3, 29:3, 30:3, 31:3}

df['Time of Month'] = df['Time of Month'].map(month_map)


In [49]:
df.isna().sum()

Subject Age Group           0
Stop Resolution             0
Weapon Type                 0
Officer YOB                 0
Officer Gender              0
Officer Race                0
Subject Perceived Race      0
Subject Perceived Gender    0
Reported Date               0
Initial Call Type           0
Final Call Type             0
Call Type                   0
Officer Squad               0
Reported Year               0
Officer Age                 0
Weekday                     0
Time of Month               0
dtype: int64

In [50]:
df.head()

Unnamed: 0,Subject Age Group,Stop Resolution,Weapon Type,Officer YOB,Officer Gender,Officer Race,Subject Perceived Race,Subject Perceived Gender,Reported Date,Initial Call Type,Final Call Type,Call Type,Officer Squad,Reported Year,Officer Age,Weekday,Time of Month
0,Unknown,1,Unknown,1984,0,African American,Asian,0,2015-10-16,-,-,-,SOUTH PCT 1ST W - ROBERT,2015,26-35,4,2
1,Unknown,0,Unknown,1963,0,White,Unknown,2,2015-04-01,-,-,-,Unknown,2015,46-55,2,1
2,Unknown,0,Unknown,1985,0,Hispanic,Unknown,2,2015-05-25,-,-,-,WEST PCT 3RD W - MARY,2015,26-35,0,3
3,Unknown,0,Unknown,1979,0,White,Unknown,2,2015-06-09,-,-,-,NORTH PCT 2ND W - NORA,2015,36-45,1,2
4,Unknown,0,Unknown,1979,0,White,Unknown,2,2015-06-09,-,-,-,NORTH PCT 2ND W - NORA,2015,36-45,1,2


### G. Use Officer Squad data to create the location information (Precinct or Officer Team) and the time of day of the arrest (Officer Watch)

* Use Pandas Regex .str.extract to get the name of the precinct and the Watch if available

* Analyse if some precincts / units never make arrests 

* The Officer Squad text data is likely more reliable estimate assuming use the information provided is the squad name / location, and the watch that handled the reports, not a specific person schedule or squad. <br><br>
* With the Reported Date and Time, since the reports can come 1 day, or 10 hours later, the recorded time is not the actual Terry stop time. <br><br>
* Features created from Officer Squad: <br><br>
    - [x] Precinct or Squad name following the Terry stop
    - [x] Watch: <br>
        0 = Unknown, if the watch is not normally recorded<br>
        1 = Watch 1 03:00 - 11:00<br>
        2 = Watch 2 11:00 - 19:00<br>
        3 = Watch 3 19:00 - 03:00<br>
  

In [51]:
# Use Python Regex commands to clean up the Call Types and Officer Squad

In [52]:
df['Officer Squad'].value_counts()

df['Precinct'] = df['Officer Squad'].str.extract(r'(\w+)')

In [53]:
df['Watch'] = df['Officer Squad'].str.extract(pat = '([\d])')
df['Watch'].value_counts(dropna=False)

2      14196
3      11806
1       8423
NaN     6676
4          3
Name: Watch, dtype: int64

In [54]:
watch_map = { "1" : "First Watch", "2": "2nd Watch", "3":"3rd Watch", np.nan:'Unknown', "4":'Unknown'}

df['Watch'] = df['Watch'].map(watch_map)
df['Watch'].value_counts(dropna=False)


# Some Officer Quads do not recorde the Watch number 
# Don't leave the NaNs in the Watch column, fill with 0
# Watch definition: 0 = Unknown, 1 = 1st Watch, 2 = 2nd Watch, 3 = 3rd Watch

2nd Watch      14196
3rd Watch      11806
First Watch     8423
Unknown         6679
Name: Watch, dtype: int64

In [None]:
df.isna().sum()

In [None]:
# Identify the Precincts are not typically making arrests, by comparing the number of arrests (Stop Resolution = Arrest)
# to the total number of Terry stops. 


arrest_df = df.loc[df['Stop Resolution'] == 1]  # Dataframe only for those Terry stops that resulted in arrests

arrest_df['Precinct'].value_counts(), df['Precinct'].value_counts()  # compare the value_counts for both dataframes

# Subsetting to only the Stop Resolution of arrest 

In [None]:
# Caculate the # of precincts that have arrests by dividing the arrest_df to the total number of terry stops

arrest_percentage = arrest_df['Precinct'].value_counts() / df['Precinct'].value_counts()
print(f'The percentage of arrests based on terry stops, by squad \n\n',arrest_percentage)

In [None]:
# arrest_percentage.fillna(0, inplace=True)- Drop those precincts that don't typically make arrests (or havent made an arrest 
# to date)

display(f'The percentage of arrests based on terry stops, by squad',arrest_percentage *100)

In [None]:
# Create a dictionary for mapping the squads which have successful arrest.  Those officer squads which have
# reported Terry stops with no arrests will be dropped from the dataset
successful_arrest_map=arrest_percentage.to_dict()
# successful_arrest_map # Take a look at the dictionary

df['Precinct Success']=df['Precinct'].map(successful_arrest_map)

In [None]:
df.isna().sum()

In [None]:
# There are 36 units / precincts which do not have any arrests since 2015
# Likely these units are not expected to make arrests

#df.to_csv('terry_stops_cleanup3.csv') #save with all manipulations except for Call Types, without dropping

In [None]:
# Drop out the units Terry stops which do not routinely make arrests
df.dropna(inplace=True)  # Drop the squads with no arrests
df.reset_index(inplace=True)  # Reset the Index
df.drop(columns=['Call Type', 'Reported Date', 'Officer Squad'], inplace = True) # Drop Processed Columns
#df.to_csv('terry_stops_cleanup4.csv') #Save after dropping squads with no arrests and columns and reset index

In [None]:
df.shape

### H. Transform Initial or Final Call Types

In [None]:
final_map = clean_call_types(df,'Final Call Type', 'Final Call Re-map')

initial_map = clean_call_types(df, 'Initial Call Type', 'Initial Call Re-map')

In [None]:
final_map

In [None]:
initial_map

In [None]:
# Check to see if keys of the two dictionaries are the same
diff = set(final_map) - set(initial_map)  # the keys in final_map and not in initial_map
diff2 = set(initial_map) - set(final_map) # the keys that are in initial_map, and not in final_map

diff, diff2
# Expand the existing call map to include additional keys

In [None]:
#  This call dictionary was built on the final calls,  not the initial calls text.  So add the initial calls and input values

call_dictionary = {'unknown': 'unknown',
             'suspicious': 'suspicious',
             'assaults': 'assault',
             'disturbance': 'disturbance',
             'prowler': 'trespass',
             'dv': 'domestic violence',
             'warrant': 'warrant',
             'theft': 'theft',
             'narcotics': 'under influence',
             'robbery': 'theft',
             'burglary': 'theft',
             'traffic': 'traffic',
             'property': 'property damage',
             'weapon': 'weapon',
             'crisis': 'person in crisis',
             'automobiles': 'auto',
             'assist': 'assist others',
             'sex': 'vice',
             'mischief': 'mischief',
             'arson': 'arson',
             'fraud': 'fraud',
             'vice': 'vice',
             'drive': 'auto',
             'misc': 'misdemeanor',
             'premise': 'trespass',
             'alarm': 'suspicious',
             'intox': 'under influence',
             'rape': 'rape',
             'child': 'child',
             'trespass': 'trespass',
             'person': 'person in crisis',
             'homicide': 'homicide',
             'burg': 'theft',
             'kidnap': 'kidnap',
             'animal': 'animal',
             'hazards': 'hazard',
             'aslt': 'assault',
             'casualty': 'homicide',
             'fight': 'disturbance',
             'shoplift': 'theft',
             'auto': 'auto', 
             'haras': 'disturbance',
             'purse': 'theft',
             'weapn': 'weapon',
             'fireworks': 'arson',
             'follow': 'disturbance',
             'dist': 'disturbance',
             'haz': 'hazard',
             'nuisance': 'mischief',
             'threats': 'disturbance',
             'liquor': 'under influence',
             'mvc': 'auto',
             'shots': 'weapon',
             'harbor': 'auto',
             'down': 'homicide',
             'service': 'unknown',
             'hospital': 'unknown',
             'bomb': 'arson',
             'undercover': 'under influence',
             'burn': 'arson',
             'lewd': 'vice',
             'dui': 'under influence',
             'crowd': 'unknown',
             'order': 'assist',
             'escape': 'assist',
             'commercial': 'trespass',
             'noise': 'disturbance',
             'narcotics': 'under influence',
             'awol': 'kidnap',
              'bias': 'unknown',
              'carjacking': 'kidnap',
              'demonstrations':'disturbance',
              'directed':'unknown',
              'doa':'assist',
              'explosion':'arson',
              'foot': 'trespass',
              'found':'unknown',
              'gambling': 'vice',
              'help':'assist',
              'illegal':'assist',
              'injured':'assist',
              'juvenile':'child',
              'littering': 'nuisance',
              'missing': 'kidnap',
              'off':'suspicious',
              'open':'unknown',
              'overdose':'under influence',
              'panhandling':'disturbance',
              'parking':'disturbance',
              'parks':'disturbance',
              'peace':'disturbance',
              'pedestrian':'disturbance',
              'phone':'disturbance',
              'request':'assist',
              'sfd':'assist',
              'sick':'assist',
              'sleeper':'disturbance',
              'suicide':'assist'}

In [None]:
df['Final Call Re-map'] = df['Final Call Re-map'].map(call_dictionary)
df['Final Call Re-map'].value_counts(dropna=False)

In [None]:
df['Initial Call Re-map'] = df['Initial Call Re-map'].map(call_dictionary)
df['Initial Call Re-map'].value_counts(dropna=False)

In [None]:
df.isna().sum()

In [None]:
#Drop all NaNs
df.dropna(inplace=True)
df.reset_index(inplace=True)
df.to_csv('terry_stops_cleanup4.csv')


In [None]:
df.shape

In [None]:
df.drop(columns = ['Initial Call Type', 'Final Call Type', 'Precinct Success', 'Officer YOB',
                  'Reported Year', 'level_0', 'index'], inplace=True)

In [None]:
cat_df = df.copy()

df.to_csv('terry_stops_categorical.csv')
cat_df.dtypes

In [None]:
df['Precinct'].value_counts(dropna=False)

## 4. Vanilla Model

    


### Initial Model: 1-hot encoded, XGBoost with Initial Call data


In [None]:
initial_call_df_to_split = df.drop(columns = ['Final Call Re-map','Stop Resolution'])

In [None]:
category_cols = initial_call_df_to_split.columns

target_col = df['Stop Resolution']

In [None]:

from sklearn.model_selection import train_test_split

X = pd.get_dummies(initial_call_df_to_split, drop_first=True)


In [None]:
X.dtypes

In [None]:
X

# Save the correlations information
plt.savefig("Correlation.png")
plt.savefig("Correlation 2.png", transparent = True)

cat_cols = df.columns
for header in cat_cols:
    df[header] = df[header].astype('category').cat.codes

sns.axes_style("white")

pearson = df.corr(method = 'pearson')

sns.set(rc={'figure.figsize':(20,12)})

# Generate a mask for the upper triangle
mask = np.zeros_like(pearson)
mask[np.triu_indices_from(mask)] = True

ax = sns.heatmap(data=pearson, mask=mask, cmap="YlGnBu", 
                 annot=True, square=True, cbar_kws={'shrink': 0.5})

### Check the Correlation Matrix for Multi-Collinear features

* **Conclusions**

    - [x] Well, this is categorical data, and maybe you aren't going to see strong correlations ?
    - [x] If you leave both Initial Call Type and Final Call type in the dataframe,  of course they are correlated, so you can't keep both in the model. 
    

y = df_to_split['Stop Resolution']
X = df_to_split.drop('Stop Resolution',axis=1)

In [None]:
#from sklearn.model_selection import train_test_split

#X = pd.get_dummies(df_to_split, drop_first=True)

y = df['Stop Resolution']
y.dtypes

In [None]:


## Train test split
X_train, X_test, y_train,y_test  = train_test_split(X,y,test_size=.3,
                                                    random_state=42,)#,stratify=y)
display(y_train.value_counts(normalize=False),y_test.value_counts(normalize=False))


In [None]:
xgb_rf = XGBRFClassifier()
xgb_rf.fit(X_train, y_train)
print('Training Accuracy score: ' ,round(xgb_rf.score(X_train,y_train),2))
print('Test Accuracy score: ',round(xgb_rf.score(X_test,y_test),2))
y_hat_test = xgb_rf.predict(X_test)

In [None]:
metrics_df, comps = evaluate_model(y_test,y_hat_test, X_test, xgb_rf, expt_name='Test A')

In [None]:
metrics_summary = pd.DataFrame()

metrics_summary = metrics_df.copy() # Initialize the Metric Summary
#metrics_summery = pd.merge(metrics_summary, metrics_df, how = 'left', on = metrics_summary['key'])
metrics_summary

In [None]:
comparison = pd.DataFrame()
comparison = comps.copy() # Initialize the comparison table
comparison


## 5. Vanilla Model Results & Experimental Plan

* Results: 
    - [x] "Initial Call Type" is the most important feature, with "Weapon" and "Officer Age" as the 2nd and 3rd most important features, respectively. <br>
    - [x] Training accuracy of 0.76, and testing accuracy of 0.74 <br>
    - [x] However, the Confusion Matrix shows the main reason is that the "Non-arrests" are better classified than the arrests.  For arrests, the true negatives were well predicted (96%), and the true positives were poorly predicted (14%), while false positives were 86%.
    - [x] This seems to make sense given the class imbalance (only 25% of the data were arrests) <br>
    - [x] The AOC was well above random chance  but still low (< 0.8) <br> <br>



### B - XGB + Final Call Type
 * **The Next Steps will be a set of experiments to look how the models can improve based on:**
    - [1] Feature Selection:  Initial Call Type Versus Final Call Type 
    - [2] Model type:  XGBoost-RF  vs CatBoost
    - [3] Balancing the dataset from best model of [1] and [2] using either SMOTE or SMOTENC. SmoteNC is an algorithm specifically developed for categorical and continuous variables. <br><br>
    
* **The Next Experiment will be (Bold Type):** <br><br>


* The data set is one-hot encoded, prior to modelling <br><br>


    - A = Vanilla Model = XBG + Initial Call Type
    - B = XBG + Final Call Type
    - C = Catboost + Initial Call Type
    - D = Catboost + Final Call Type
    - E = SMOTE + Catboost + Final Call Type
    - F = SMOTE + XGB + Final Call Type
    - G = SMOTE + SVM + Final Call Type
    - H = SMOTENC + Catboost + Final Call Type
    - I = CBC-search + Final Call Type
    - J = SMOTENC + CBC-search + Final Call Type

 
 

In [None]:
#df = backup_df.copy()
#df

In [None]:
# Change input to drop Initial Call Type and keep Final Call Type
final_call_df_to_split = df.drop(columns = ['Initial Call Re-map','Stop Resolution'])
#category_cols = df_to_split.columns
#target_col = ['Stop Resolution']

X = pd.get_dummies(final_call_df_to_split,)# drop_first=True)
# Convert catogories to cat.codes

#for header in category_cols:#    df_to_split[header] = df[header].astype('category').cat.codes
#df_to_split.head()

In [None]:
#y = df['Stop Resolution']


X_train, X_test, y_train,y_test  = train_test_split(X,y,test_size=.3,
                                                    random_state=42,)#,stratify=y)
display(y_train.value_counts(normalize=False),y_test.value_counts(normalize=False))

xgb_rf = XGBRFClassifier()
xgb_rf.fit(X_train, y_train)
print('Training Accuracy score: ' ,round(xgb_rf.score(X_train,y_train),2))
print('Test Accuracy score: ',round(xgb_rf.score(X_test,y_test),2))

y_hat_test = xgb_rf.predict(X_test)

metrics_df, comps = evaluate_model(y_test,y_hat_test, X_test, xgb_rf, expt_name='Test B')

## 6. CatBoost with Final Call Type


In [None]:
#metrics_summary = pd.DataFrame()
metrics_summary = pd.concat([metrics_summary, metrics_df], axis = 1)
metrics_summary
#test = pd.concat([metrics_summary, metrics_df], axis = 1)
#test

In [None]:
#test = pd.merge(comparison,comps, left_index = True, right_index=True)
# test
comparison = pd.merge(comparison,comps, left_index = True, right_index=True)
comparison

### C - Catboost + Final Call Type
 * **The Next Steps will be a set of experiments to look how the models can improve based on:**
    - [1] Feature Selection:  Initial Call Type Versus Final Call Type 
    - [2] Model type:  XGBoost-RF  vs CatBoost
    - [3] Balancing the dataset from best model of [1] and [2] using either SMOTE or SMOTENC. SmoteNC is an algorithm specifically developed for categorical and continuous variables. <br><br>
    
* **The Next Experiment will be (Bold Type):** <br><br>


* The data set is one-hot encoded, prior to modelling <br><br>


    - A = Vanilla Model = XBG + Initial Call Type
    - B = XBG + Final Call Type
    - C = Catboost + Initial Call Type
    - D = Catboost + Final Call Type
    - E = SMOTE + Catboost + Final Call Type
    - F = SMOTE + XGB + Final Call Type
    - G = SMOTE + SVM + Final Call Type
    - H = SMOTENC + Catboost + Final Call Type
    - I = CBC-search + Final Call Type
    - J = SMOTENC + CBC-search + Final Call Type

 
 

In [None]:
#!pip install -U catboost
from catboost import CatBoostClassifier

In [None]:
clf = CatBoostClassifier()
clf.fit(X_train,y_train,logging_level='Silent')
print('Training score: ' ,round(clf.score(X_train,y_train),2))
print('Test score: ',round(clf.score(X_test,y_test),2))

y_hat_test = clf.predict(X_test)

metrics_df, comps = evaluate_model(y_test,y_hat_test, X_test, clf, expt_name='Test C')

metrics_summary = pd.concat([metrics_summary, metrics_df], axis = 1)
metrics_summary

In [None]:
comparison = pd.merge(comparison,comps, left_index = True, right_index=True)
comparison

## 7. Catboost with Initial Call Type


### D - Catboost + Initial Call Type
 * **The Next Steps will be a set of experiments to look how the models can improve based on:**
    - [1] Feature Selection:  Initial Call Type Versus Final Call Type 
    - [2] Model type:  XGBoost-RF  vs CatBoost
    - [3] Balancing the dataset from best model of [1] and [2] using either SMOTE or SMOTENC. SmoteNC is an algorithm specifically developed for categorical and continuous variables. <br><br>
    
* **The Next Experiment will be (Bold Type):** <br><br>


* The data set is one-hot encoded, prior to modelling <br><br>


    - A = Vanilla Model = XBG + Initial Call Type
    - B = XBG + Final Call Type
    - C = Catboost + Initial Call Type
    - D = Catboost + Final Call Type
    - E = SMOTE + Catboost + Final Call Type
    - F = SMOTE + XGB + Final Call Type
    - G = SMOTE + SVM + Final Call Type
    - H = SMOTENC + Catboost + Final Call Type
    - I = CBC-search + Final Call Type
    - J = SMOTENC + CBC-search + Final Call Type

 
 

In [None]:
#df_to_split = df.drop(columns = 'Final Call Re-map')
#category_cols = df_to_split.columns
#target_col = ['Stop Resolution']
#df_to_split = pd.DataFrame()
#  Use the dataframe calculated in Test A

initial_call_df_to_split = df.drop(columns = ['Final Call Re-map', 'Stop Resolution'])

X = pd.get_dummies(initial_call_df_to_split,)# drop_first=True)   


X_train, X_test, y_train,y_test  = train_test_split(X,y,test_size=.3,
                                                    random_state=42,)#,stratify=y)

In [None]:
clf = CatBoostClassifier()
clf.fit(X_train,y_train,logging_level='Silent')
print('Training score: ' ,round(clf.score(X_train,y_train),2))
print('Test score: ',round(clf.score(X_test,y_test),2))

y_hat_test = clf.predict(X_test)

metrics_df, comps = evaluate_model(y_test,y_hat_test, X_test, clf, expt_name='Test D')



In [None]:
metrics_summary = pd.concat([metrics_summary, metrics_df], axis = 1)
metrics_summary

In [None]:
comparison = pd.merge(comparison,comps, left_index = True, right_index=True)
comparison

## 8. SMOTE + Best of Others



### E - SMOTE + Catboost + Final Call
 * **The Next Steps will be a set of experiments to look how the models can improve based on:**
    - [1] Feature Selection:  Initial Call Type Versus Final Call Type 
    - [2] Model type:  XGBoost-RF  vs CatBoost
    - [3] Balancing the dataset from best model of [1] and [2] using either SMOTE or SMOTENC. SmoteNC is an algorithm specifically developed for categorical and continuous variables. <br><br>
    
* **The Next Experiment will be (Bold Type):** <br><br>


* The data set is one-hot encoded, prior to modelling <br><br>


    - A = Vanilla Model = XBG + Initial Call Type
    - B = XBG + Final Call Type
    - C = Catboost + Initial Call Type
    - D = Catboost + Final Call Type
    - E = SMOTE + Catboost + Final Call Type
    - F = SMOTE + XGB + Final Call Type
    - G = SMOTE + SVM + Final Call Type
    - H = SMOTENC + Catboost + Final Call Type
    - I = CBC-search + Final Call Type
    - J = SMOTENC + CBC-search + Final Call Type

 
 

In [None]:
#!pip install -U imbalanced-learn
# Use the final_call dataframe calculated earlier

In [None]:
# Shouldn't need the y, because we can re-use if from above.  But just in case
#y = df['Stop Resolution']
#y.dtypes

X = pd.get_dummies(final_call_df_to_split, )#drop_first=True)
    

X_train, X_test, y_train,y_test  = train_test_split(X,y,test_size=.3,
                                                    random_state=42,)#stratify=y)

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()

X_train, y_train = smote.fit_sample(X_train, y_train)
display(y_train.value_counts(normalize=False),y_test.value_counts(normalize=False))

In [None]:
clf = CatBoostClassifier()
clf.fit(X_train,y_train,logging_level='Silent')
print('Training score: ' ,round(clf.score(X_train,y_train),2))
print('Test score: ',round(clf.score(X_test,y_test),2))

y_hat_test = clf.predict(X_test)

metrics_df, comps = evaluate_model(y_test,y_hat_test, X_test, clf, expt_name='Test E')


In [None]:
metrics_summary = pd.concat([metrics_summary, metrics_df], axis = 1)
metrics_summary

In [None]:
comparison = pd.merge(comparison,comps, left_index = True, right_index=True)
comparison

### F. SMOTE + XGB + Final Call
 * **The Next Steps will be a set of experiments to look how the models can improve based on:**
    - [1] Feature Selection:  Initial Call Type Versus Final Call Type 
    - [2] Model type:  XGBoost-RF  vs CatBoost
    - [3] Balancing the dataset from best model of [1] and [2] using either SMOTE or SMOTENC. SmoteNC is an algorithm specifically developed for categorical and continuous variables. <br><br>
    
* **The Next Experiment will be (Bold Type):** <br><br>


* The data set is one-hot encoded, prior to modelling <br><br>


    - A = Vanilla Model = XBG + Initial Call Type
    - B = XBG + Final Call Type
    - C = Catboost + Initial Call Type
    - D = Catboost + Final Call Type
    - E = SMOTE + Catboost + Final Call Type
    - F = SMOTE + XGB + Final Call Type
    - G = SMOTE + SVM + Final Call Type
    - H = SMOTENC + Catboost + Final Call Type
    - I = CBC-search + Final Call Type
    - J = SMOTENC + CBC-search + Final Call Type

 
 

In [None]:
# The SMOTED data on XBG-RF,  just for fun
xgb_rf = XGBRFClassifier()
xgb_rf.fit(X_train, y_train)
print('Training score: ' ,round(xgb_rf.score(X_train,y_train),2))
print('Test score: ',round(xgb_rf.score(X_test,y_test),2))

y_hat_test = xgb_rf.predict(X_test)

#evaluate_model(y_test,y_hat_test, X_test, xgb_rf)
metrics_df, comps = evaluate_model(y_test,y_hat_test, X_test, xgb_rf, expt_name='Test F')

In [None]:
metrics_summary = pd.concat([metrics_summary, metrics_df], axis = 1)
metrics_summary

In [None]:
comparison = pd.merge(comparison,comps, left_index = True, right_index=True)
comparison

### G. SMOTE + SVM + Final Call Type
 * **The Next Steps will be a set of experiments to look how the models can improve based on:**
    - [1] Feature Selection:  Initial Call Type Versus Final Call Type 
    - [2] Model type:  XGBoost-RF  vs CatBoost
    - [3] Balancing the dataset from best model of [1] and [2] using either SMOTE or SMOTENC. SmoteNC is an algorithm specifically developed for categorical and continuous variables. <br><br>
    
* **The Next Experiment will be (Bold Type):** <br><br>


* The data set is one-hot encoded, prior to modelling <br><br>


    - A = Vanilla Model = XBG + Initial Call Type
    - B = XBG + Final Call Type
    - C = Catboost + Initial Call Type
    - D = Catboost + Final Call Type
    - E = SMOTE + Catboost + Final Call Type
    - F = SMOTE + XGB + Final Call Type
    - G = SMOTE + SVM + Final Call Type
    - H = SMOTENC + Catboost + Final Call Type
    - I = CBC-search + Final Call Type
    - J = SMOTENC + CBC-search + Final Call Type

 
 

In [None]:
# Try a Support Vector Machine,  for the heck of it


In [None]:
from sklearn.svm import SVC,LinearSVC,NuSVC
clf = SVC()
clf.fit(X_train,y_train)
y_hat_test = clf.predict(X_test)

#evaluate_model(y_test,y_hat_test,X_test,clf)
metrics_df, comps = evaluate_model(y_test,y_hat_test, X_test, xgb_rf, expt_name='Test G')

In [None]:
metrics_summary = pd.concat([metrics_summary, metrics_df], axis = 1)
metrics_summary

In [None]:
comparison = pd.merge(comparison,comps, left_index = True, right_index=True)
comparison

### H. SMOTENC + CatBoost + Final Call Type

* **SMOTENC is a version of SMOTE specifically for continuous and categorical features.**<br><br>

 * **The Next Steps will be a set of experiments to look how the models can improve based on:**
    - [1] Feature Selection:  Initial Call Type Versus Final Call Type 
    - [2] Model type:  XGBoost-RF  vs CatBoost
    - [3] Balancing the dataset from best model of [1] and [2] using either SMOTE or SMOTENC. SmoteNC is an algorithm specifically developed for categorical and continuous variables. <br><br>
    
* **The Next Experiment will be (Bold Type):** <br><br>


* The data set is one-hot encoded, prior to modelling <br><br>


    - A = Vanilla Model = XBG + Initial Call Type
    - B = XBG + Final Call Type
    - C = Catboost + Initial Call Type
    - D = Catboost + Final Call Type
    - E = SMOTE + Catboost + Final Call Type
    - F = SMOTE + XGB + Final Call Type
    - G = SMOTE + SVM + Final Call Type
    -**H = SMOTENC + Catboost + Final Call Type**
    - I = CBC-search + Final Call Type
    - J = SMOTENC + CBC-search + Final Call Type
    
    
   ** Experiment H and J are abandoned because even using the 

 



In [None]:
cat_df_to_split = cat_df.drop(columns = ['Initial Call Re-map', 'Stop Resolution'])
category_cols = cat_df_to_split.columns
category_cols

In [None]:
X = cat_df_to_split
y = cat_df['Stop Resolution']

In [None]:
X_train, X_test, y_train,y_test  = train_test_split(X,y,test_size=.3,
                                                    random_state=42,)#stratify=y)

In [None]:
X_train

X_temp = X_train.copy().drop(columns = ['Officer Gender', 'Subject Perceived Gender', 'Weekday', 'Time of Month', 'Watch'],)
X_temp.astype('category')
X_train_cat = X_temp.astype('category')
X_train_cat['Officer Gender'] = X_train['Officer Gender']
X_train_cat['Subject Perceived Gender'] = X_train['Subject Perceived Gender']
X_train_cat['Weekday'] = X_train['Weekday']
X_train_cat['Time of Month'] = X_train['Time of Month']
X_train['Watch'].astype(int)
X_train_cat['Watch'] = X_train['Watch']
X_train_cat.dtypes

In [None]:
from imblearn.over_sampling import SMOTENC

In [None]:

# Now modify the training data by oversampling (SMOTENC)

# Try Categorical SMOTE
#

smote_nc = SMOTENC(categorical_features = [0,1,2,3,4,5,6,7,8])

X_train, y_train = smote_nc.fit_sample(X_train_cat, y_train)
display(y_train.value_counts(normalize=False),y_test.value_counts(normalize=False))


In [None]:

clf = CatBoostClassifier()
clf.fit(X_train,y_train,cat_features = [0,1,2,3,4,5,6,7,8], logging_level='Silent')
print('Training score: ' ,round(clf.score(X_train,y_train),2))
print('Test score: ',round(clf.score(X_test,y_test),2))

y_hat_test = clf.predict(X_test)

#evaluate_model(y_test,y_hat_test,X_test,clf)
#metrics_df, comps = evaluate_model(y_test,y_hat_test, X_test, xgb_rf, expt_name='Test H')

In [None]:
display(y_test, y_hat_test, X_test)

In [None]:
importance = pd.concat([X, y], axis=1).corr().loc['Stop Resolution']
importance.plot(kind='barh', figsize=(10,24));

In [None]:
y

In [None]:
pearson = pd.concat([y,X], axis=1)#.corr(method = 'pearson')
pearson

In [None]:
# Check the correlation matrix to see the autocorrelated variables and plot it ou
# Will run the correlation matrix for the last kernel run

   
pearson = pd.concat([y,X], axis=1).corr(method = 'pearson')

sns.set(rc={'figure.figsize':(20,12)})

    # Generate a mask for the upper triangle
mask = np.zeros_like(pearson)
mask[np.triu_indices_from(mask)] = True

ax = sns.heatmap(data=pearson, mask=mask, cmap="YlGnBu", 
                 annot=False, square=True, cbar_kws={'shrink': 0.5})
    
  

## 9. CBC-search + Final Call Type with or without SMOTENC


### SMOTE + CBC-Search + Final Call type
 * **The Next Steps will be a set of experiments to look how the models can improve based on:**
    - [1] Feature Selection:  Initial Call Type Versus Final Call Type 
    - [2] Model type:  XGBoost-RF  vs CatBoost
    - [3] Balancing the dataset from best model of [1] and [2] using either SMOTE or SMOTENC. SmoteNC is an algorithm specifically developed for categorical and continuous variables. <br><br>
    
* **The Next Experiment will be (Bold Type):** <br><br>


* The data set is one-hot encoded, prior to modelling <br><br>


    - A = Vanilla Model = XBG + Initial Call Type
    - B = XBG + Final Call Type
    - C = Catboost + Initial Call Type
    - D = Catboost + Final Call Type
    - E = SMOTE + Catboost + Final Call Type
    - F = SMOTE + XGB + Final Call Type
    - G = SMOTE + SVM + Final Call Type
    - H = SMOTENC + Catboost + Final Call Type
    - I = SMOTE + CBC-search + Final Call Type
    - J = SMOTENC + CBC-search + Final Call Type

 
 

In [None]:
#df_to_split5 = pd.DataFrame()
#no need to re-calculate
#final_call_df_to_split = df.drop(columns = ['Initial Call Re-map','Stop Resolution'])
#X = pd.get_dummies(final_call_df_to_split, drop_first=True)

In [None]:
final_call_df_to_split.dtypes

In [None]:
X = pd.get_dummies(final_call_df_to_split, drop_first=True)
category_cols = X.columns
category_cols

# not needed y = df['Stop Resolution']

In [None]:
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

from imblearn.over_sampling import SMOTE
smote = SMOTE()

X_train, y_train = smote.fit_sample(X_train, y_train)
display(y_train.value_counts(normalize=False),y_test.value_counts(normalize=False))

In [None]:
from catboost import Pool, CatBoostClassifier

category_cols = X.columns
train_pool =  Pool(data=X_train, label=y_train, cat_features=category_cols)
test_pool = Pool(data=X_test, label=y_test,  cat_features=category_cols)

In [None]:
cb_base = CatBoostClassifier(iterations=500, depth=12,
                            boosting_type='Ordered',
                            learning_rate=0.03,
                            thread_count=-1,
                            eval_metric='AUC',
                            silent=True,
                            allow_const_label=True)#,
                           #task_type='GPU')

In [None]:
cb_base.fit(train_pool,eval_set=test_pool, plot=True, early_stopping_rounds=10)
cb_base.best_score_

In [None]:
print('Training score: ' ,round(cb_base.score(X_train,y_train),2))
print('Test score: ',round(cb_base.score(X_test,y_test),2))

y_hat_test = cb_base.predict(X_test)

#evaluate_model(y_test,y_hat_test,X_test,cb_base)
metrics_df, comps = evaluate_model(y_test,y_hat_test, X_test, cb_base, expt_name='Test I')

In [None]:
metrics_summary = pd.concat([metrics_summary, metrics_df], axis = 1)
metrics_summary

In [None]:
comparison = pd.merge(comparison,comps, left_index = True, right_index=True)
comparison

In [None]:
importance = pd.DataFrame(cb_base.feature_importances_, index = X.columns)
importance.rename(columns = {0: "Importance"}, inplace = True)
importance


In [None]:
importance1 = pd.Series(cb_base.feature_importances_, index = X.columns)
importance1.sort_values(ascending = False, inplace = True)
top_features1 = importance1.head(10).index
importance2 = pd.concat([X,y], axis=1)[[*top_features1, 'Stop Resolution']].corr().loc['Stop Resolution']
importance2.drop('Stop Resolution',inplace = True)
importance2
importance2.plot(kind = 'barh',);


### J. SMOTENC  + CBC-search + Final Call type

 * **The Next Steps will be a set of experiments to look how the models can improve based on:**
    - [1] Feature Selection:  Initial Call Type Versus Final Call Type 
    - [2] Model type:  XGBoost-RF  vs CatBoost
    - [3] Balancing the dataset from best model of [1] and [2] using either SMOTE or SMOTENC. SmoteNC is an algorithm specifically developed for categorical and continuous 9variables. <br><br>
    
* **The Next Experiment will be (Bold Type):** <br><br>


* The data set is one-hot encoded, prior to modelling <br><br>


    - A = Vanilla Model = XBG + Initial Call Type
    - B = XBG + Final Call Type
    - C = Catboost + Initial Call Type
    - D = Catboost + Final Call Type
    - E = SMOTE + Catboost + Final Call Type
    - F = SMOTE + XGB + Final Call Type
    - G = SMOTE + SVM + Final Call Type
    - H = SMOTENC + Catboost + Final Call Type
    - I = CBC-search + Final Call Type
    - J = SMOTENC + CBC-search + Final Call Type

 
 


In [None]:
# Try the same approach, with SMOTENC first

from catboost import Pool, CatBoostClassifier

In [None]:
cat_df_to_split = cat_df.drop(columns = ['Initial Call Re-map', 'Stop Resolution'])
category_cols = cat_df_to_split.columns
category_cols

y = df['Stop Resolution']



In [None]:
X_train, X_test, y_train,y_test  = train_test_split(X,y,test_size=.3,
                                                    random_state=42,)# stratify=y)#



### SMOTENC Bug - doesn't work with all categorical data

** Below is a "workaround" described in Github, which also doesn't work, see the link **<br><br>

X["temp"] = 0
n_features = X.shape[1] - 1
n_features, X

indices = range(n_features)
print(indices)
smote_nc = SMOTENC(categorical_features = indices)
X_resampled, y_resampled = smote_nc.fit_sample(X_train, y_train)

X_resampled.drop(columns = 'temp', inplace = True)
    
#display(y_train.value_counts(normalize=False),y_test.value_counts(normalize=False))
X_resampled

#### According to a PR in Github, SMOTENC cannot work if all columns are categorical
#### https://github.com/scikit-learn-contrib/imbalanced-learn/issues/562
#### For now use 1 hot encoding and SMOTE.  Forget about SMOTENC

In [None]:
#final_call_df_to_split = df.drop(columns = ['Initial Call Re-map', 'Stop Resolution'])
#X = pd.get_dummies(final_call_df_to_split, drop_first=True)

#y = df['Stop Resolution']


#X_train, X_test, y_train,y_test  = train_test_split(X,y,test_size=.3,
#                                                    random_state=42,)#stratify=y)#

#smote_nc = SMOTENC(categorical_features = [0,1,3,4,6,9,11])

X_train, y_train = smote_nc.fit_sample(X_train, y_train)
display(y_train.value_counts(normalize=False),y_test.value_counts(normalize=False))



In [None]:
X_train.columns

In [None]:
#category_cols = X.columns

train_pool =  Pool(data=X_train, label=y_train, cat_features=category_cols)
test_pool = Pool(data=X_test, label=y_test,  cat_features=category_cols)
cb_base = CatBoostClassifier(iterations=500, depth=12,
                            boosting_type='Ordered',
                            learning_rate=0.03,
                            thread_count=-1,
                            eval_metric='AUC',
                            silent=True,
                            allow_const_label=True)#,
                           #task_type='GPU')

cb_base.fit(train_pool,eval_set=test_pool, plot=True, early_stopping_rounds=10)
cb_base.best_score_

In [None]:
print('Training score: ' ,round(cb_base.score(X_train,y_train),2))
print('Test score: ',round(cb_base.score(X_test,y_test),2))

y_hat_test = cb_base.predict(X_test)

#evaluate_model(y_test,y_hat_test,X_test,cb_base)

metrics_df, comps = evaluate_model(y_test,y_hat_test, X_test, cb_base, expt_name='Test J')

In [None]:
metrics_summary = pd.concat([metrics_summary, metrics_df], axis = 1)
metrics_summary_t = metrics_summary.transpose()
metrics_summary_t

In [None]:
comparison = pd.merge(comparison,comps, left_index = True, right_index=True)
comparison

In [None]:
#display(cb_base.feature_importances_)
importance #= pd.drop(rows = 'Stop Resolution')
#importance = pd.Series(cb_base.feature_importances_, index = X.columns)
#importance.sort_values(ascending = True, inplace = True)
#top_features = importance.head(10).index
#importance = pd.concat([X,y], axis=1)[[*top_features, 'Stop Resolution']].corr().loc['Stop Resolution']
#importance.plot(kind='barh');

## 10.  Calculate the correlation between the top features found by CatBoost


### Use SMOTENC with CATBOOST Built-in Features

 ** 
     - [x]  1-hot encoding built-in (pass in categoricals
     - [x]  Grid search

In [None]:
#df_to_split = df.drop(columns = ['Initial Call Re-map', 'Stop Resolution'])
#category_cols = df_to_split.columns


# Convert catogories to cat.codes
#X = pd.get_dummies(final_call_df_to_split, drop_first=True)

#y = df['Stop Resolution']

#for header in category_cols:
#    df_to_split[header] = df[header].astype('category').cat.codes
    

X_train, X_test, y_train,y_test  = train_test_split(X,y,test_size=.3,
                                                    random_state=42,)#stratify=y)#

smote_nc = SMOTENC(categorical_features = [0,11])
X_train, y_train = smote_nc.fit_sample(X_train, y_train)

display(y_train.value_counts(normalize=False),y_test.value_counts(normalize=False))


In [None]:
category_cols = X.columns
train_pool =  Pool(data=X_train, label=y_train, cat_features=category_cols)
test_pool = Pool(data=X_test, label=y_test,  cat_features=category_cols)
cb_base = CatBoostClassifier(iterations=500, depth=12,
                            boosting_type='Ordered',
                            learning_rate=0.03,
                            thread_count=-1,
                            eval_metric='AUC',
                            silent=True,
                            allow_const_label=True)#,
                           #task_type='GPU')

cb_base.fit(train_pool,eval_set=test_pool, plot=True, early_stopping_rounds=10)
cb_base.best_score_

In [None]:
print('Training score: ' ,round(cb_base.score(X_train,y_train),2))
print('Test score: ',round(cb_base.score(X_test,y_test),2))

y_hat_test = cb_base.predict(X_test)

#evaluate_model(y_test,y_hat_test,X_test,cb_base)

metrics_df, comps = evaluate_model(y_test,y_hat_test, X_test, cb_base, expt_name='Test J')

In [None]:
metrics_summary = pd.concat([metrics_summary, metrics_df], axis = 1)
metrics_summary_t = metrics_summary.transpose()
metrics_summary_t

In [None]:
comparison = pd.merge(comparison,comps, left_index = True, right_index=True)
comparison

## Conclusions on ML Models


### The following experiments were performed:

* The data set is one-hot encoded, prior to modelling <br><br>


    - A = Vanilla Model = XBG + Initial Call Type
    - B = XBG + Final Call Type
    - C = Catboost + Initial Call Type
    - D = Catboost + Final Call Type
    - E = SMOTE + Catboost + Final Call Type
    - F = SMOTE + XGB + Final Call Type
    - G = SMOTE + SVM + Final Call Type
    - H = SMOTENC + Catboost + Final Call Type
    - I = CBC-search + Final Call Type
    - J = SMOTENC + CBC-search + Final Call Type

 
 

    
 - [x] Which features are best:  There was a clear improvement in the prediction(several % more accuracy) when  "Final Call Type" was used versus "Initial Call Type". Reasons could be a) human difficulties to encode the rather cryptic call text, coming in initial calls from 911. For initial calls I found it difficult to determine the correct code assignment,  for ~500 samples.  In contrast the final call type was more well behaved, probably because officers can re-code the call fairly uniformly. (4-8% improvement!)
 
 - [x] CatBoost was a better algorithm than XGB-RF and SVM, presumably because it is a better classifer for all categorical data,  which I had. (1-3% improvement)
 
 - [x] Correcting imbalances in the samples using SMOTE and SMOTENC (for categoricals), helped improve the false negatives with similar accuracy.(1-2% improvement)
 
 - [x] 

## Appendix

### Other Analyses

In [None]:
### The key concept is that in training we know that the some precincts are more successful than others at getting to an arrest.  Instead of imputed a 1-hot encoded value,  use the percentage of successful arrests as the values for the precinct.

#Calculate how successful particular precincts were at making arrests

arrest_percentage = arrest_df['Precinct'].value_counts() / df['Precinct'].value_counts()
print(f'The percentage of arrests based on terry stops, by squad \n\n',arrest_percentage)

In [None]:
### Create a dictionary for mapping the squads which have successful arrest.  Those officer squads which have <br><br>
### reported Terry stops with no arrests will be dropped from the dataset
successful_arrest_map=arrest_percentage.to_dict()

### successful_arrest_map # Take a look at the dictionary

df['Precinct Success']=df['Precinct'].map(successful_arrest_map) # map the dictionary to the dataframe with a new column3

In [None]:
df.head()

In [None]:
''' Perform the same analysis to see which call types lead to more arrests

arrest_df = df.loc[df['Stop Resolution'] == 'Arrest'] # Re-Create the arrest_df in case there were removals earlier
arrest_df['Final Call Type'].value_counts(),  df['Final Call Type'].value_counts()

arrest_categories = arrest_df['Final Call Type'].value_counts() / df['Final Call Type'].value_counts() 
arrest_map = arrest_categories.to_dict()
arrest_map # look at the dictionary '''

#df['Final Call Success'] = df['Final Call Type'].map(arrest_map)

In [None]:
'''# Create dataframe eval_df and metrics_df to hold the values from each model
eval_df = pd.DataFrame(
          [["Model", "XGB", "XGB", "CATBOOST", "CATBOOST", "SMOTE + CB", "SMOTE + XGB", "SMOTE+1hot+CB", 
          "SMOTE+CBC", "SMOTE+1hot+CBC"], ['Feature', 'Initial Call Type', 'Final Call Type', 
         'Final Call Type', 'Initial Call Type', 'Final Call Type', 'Final Call Type', 'Final Call Type', 
         'Final Call Type', "Final Call Type"], ['Encoding', 'Categorical', 'Categorical', 'Categorical', 
         'Categorical','Categorical', 'Categorical','1 Hot', 'Categorical', '1 Hot'], ['SMOTE', 'No', 'No', 'No', 'No', 
         'Yes', 'Yes', 'Yes', 'No', 'Yes']], index = [1,2,3,4], columns=['Description','Test A', 
         'Test B','Test C', 'Test D', 'Test E', 'Test F', 'Test G', 'Test H', 'Test I',])

eval_df.head()'''

## Key references:


### 


* https://assets.documentcloud.org/documents/6136893/SPDs-2019-Annual-Report-on-Stops-and-Detentions.pdf <br><br>

* https://www.seattletimes.com/seattle-news/crime/federal-monitor-finds-seattle-police-are-conducting-proper-stops-and-frisks/ <br> <br>
* https://catboost.ai/docs/concepts/python-reference_catboost_grid_search.html<br><br>
* https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db