In [15]:
import pandas as pd
import numpy as np

This data contains arrests made by the Charleston PD between January 1, 2018 - December 31, 2018. The charge listed is the first charge on the bookings sheet and does not reflect multiple charges that may be on any given arrest report.

In [16]:
data = pd.read_csv("charleston_arrests.csv")

The dataset originally contained the following info:
- FID, ArrestID, Name_ID = these attributes are unique identifiers of the case in question and/or the person arrested in the case. 
- streetnbr, street, city, state, zip, x, y, x2 and y2 = these are location specific attributes that are tied with the geographic coordinates or address of the arrest. 
- Arrest_Dat, ArrestTime = these features are time specific attributes, which are tied with data and time of the arrest, respectively. 
- age, race, sex = these attributes describe the suspect with respect to demographics. 
- Beat = The city is divided into different divisions, which are further divided into different patrol areas. 
- Team = an attribute that specifies the police team that made the arrest. Different police teams are assigned to different beats. 
- recblock
- charge = the actual crime the perpetrator is charged with.

I dropped the following columns for either being unique, too specific, or redundant in the information they provide.
    

In [17]:
data = data.drop(['FID', 'ArrestID', 'streetnbr', 'street', 'city', 'state', 'Name_ID', 'X', 'Y', 'x2', 'y2', 'recblock'], axis = 1)
data

Unnamed: 0,Arrest_Dat,ArrestTime,zip,age,race,sex,Team,Beat,charge
0,02/16/2018,35,29403,25,W,M,T1,122,OPEN CONTAINER ALCOHOL
1,05/07/2018,830,29401,20,W,M,T2,224,PROHIBITED AREAS FOR SOLICITORS
2,03/26/2018,2223,29407,18,B,M,T4,429,SIMPLE POSS MARIJUANA 1ST
3,06/14/2018,118,29401,35,W,M,T2,222,TRESPASSING AFTER WARNING
4,01/01/2018,210,29403,21,W,M,T1,122,PUBLIC INTOXICATION
...,...,...,...,...,...,...,...,...,...
4867,12/31/2018,2340,29414,34,W,F,T4,433,OPEN CONTAINER BEER/WINE IN VEH
4868,12/31/2018,2024,29407,65,W,M,T4,430,DUI 1ST
4869,12/31/2018,2245,29401,27,W,M,T2,222,DISORDERLY CONDUCT
4870,12/31/2018,2238,29401,28,W,F,T2,222,DISORDERLY CONDUCT


At this point I dropped any records that did not have a classification of gender or race. I also made sex a binary attribute.

In [19]:
data.replace(' ', np.nan, inplace=True)
# Drops columns with no gender or classification of race
boolSeries_sex = pd.notnull(data['sex'])
boolSeries_race = pd.notnull(data['race'])
data = data[boolSeries_sex]
data = data[boolSeries_race]

# Make 'sex' a numerical attribute
data['sex'].replace('F', 0,inplace=True)
data['sex'].replace('M', 1,inplace=True)
data.dropna(inplace=True)

After cleaning the data of all attributes that were deemed not useful and of any records that did not have the needed data, I decided to feature engineer Month from Arrest_Dat and Hour from ArrestTime because certain crimes may happen during certain periods of the day/year. Then both Arrest_Dat and ArrestTime were dropped as they were not needed now.

I also made race a binary attribute by representing white as 0 and any other race as 1.

Finally, certain attributes were made numeric as they came in from the data as strings and they needed to be numeric further along in the process.

In [5]:
# Feature engineer the month from arrest date
data['Month'] = data['Arrest_Dat'].str[0:2]
# Feature engineer hour from arrest time
data['Hour'] = data['ArrestTime'].astype(str).str.zfill(4).str[0:2]
data['race'] = np.where(data['race'] == 'W', 0, 1)

In [6]:
# Drop the date and time columns since I've extracted the information necessary
data = data.drop(['Arrest_Dat'], axis = 1)
data = data.drop(['ArrestTime'], axis = 1)

# Make month, hour, and age numerical attributes
data['Month'] = pd.to_numeric(data['Month'])
data['Hour'] = pd.to_numeric(data['Hour'])
data['age'] = pd.to_numeric(data['age'])

In our first run through, I found that one-hot encoding the charge attribute as read in from the data resulted in too many new dimensions and did not allow some models to run at all. Thus, I decided to aggregate this attribute into the following broader categories which significantly reduced the dimensionality produced by one-hot encoding.

In [7]:
# Function which takes the 'charge' column and categorizes different charges into broader categories
def categorize_charges(charge):
    arson = ['ARSON 2ND DEGREE', 'ARSON 3RD DEGREE']
    assault_and_battery = ['ASSAULT & BATTERY 1ST DEGREE','ASSAULT & BATTERY 2ND DEGREE','ASSAULT & BATTERY 2ND DEGREE, F',
                           'ASSAULT & BATTERY 3RD DEGREE','ASSAULT & BATTERY 3RD DEGREE, M']
    burglary = ['BURGLARY 1ST DEGREE', 'BURGLARY 1ST DEGREE, F', 'BURGLARY 2ND  DEGREE  (VIOLENT)', 
                'BURGLARY 2ND  DEGREE  (VIOLENT), F','BURGLARY 2ND  DEGREE (B)  (VIOLENT), F',
                'BURGLARY 2ND DEGREE','BURGLARY 3RD DEGREE']
    assault_on_police = ['ASSAULT ON POLICE', 'ASSAULT ON POLICE WHILE RESISTING ARREST','ASSAULT ON POLICE, M']
    csc = ['CSC 1ST DEGREE','CSC 2ND DEGREE','CSC 3RD DEGREE']
    csc_minor = ['CSC W/ MINOR (11-14 YOA)', 'CSC W/ MINOR (UNDER 11 YOA)','CSC W/MINOR - 3RD DEGREE']
    property_damage = ['DAMAGE TO PERSONAL PROPERTY','DAMAGE TO REAL PROPERTY','DAMAGE TO REAL PROPERTY, F',
                       'DAMAGING/REMOVING PRIVATE PROPERTY']
    defraud = ['DEFRAUD LOTTERY TICKET','DEFRAUDING PUBLIC ACCOMMODATIONS']
    discharge_firearms = ['DISCHARGING FIREARM','DISCHARGING OF FIREARM']
    disorderly_conduct = ['DISORDERLY CONDUCT','DISORDERLY CONDUCT, M']
    drug_distribution = ['DISTR. CONTROLLED SUBSTANCE CLOSE PROX SCHOOL',
                         'DISTRIBUTE/DELIVER IMITATION CONTROLLED SUBSTANCE', 'DISTRIBUTION / PURCHASING COCAINE',
                         'DISTRIBUTION / PURCHASING COCAINE, M', 'DISTRIBUTION / PURCHASING HEROIN',
                         'DISTRIBUTION / PURCHASING MARIJUANA', 
                         'DISTRIBUTION OF COCAINE BASE','DISTRIBUTION OF COCAINE BASE, F']
    dui = ['DUI 1ST', 'DUI 2ND', 'DUI 2ND, M', 'DUI 3RD', 'DUI 3RD, M']
    dus = ['DUS (FOR DUI) 1ST','DUS (FOR DUI) 2ND', 'DUS (FOR DUI) 3RD +', 'DUS 1ST', 'DUS 1ST, M',
           'DUS 2ND','DUS 2ND, M', 'DUS 3RD +']
    dv = ['DV 1ST','DV 2ND', 'DV 2ND, M', 'DV 3RD']
    dv_han = ['DVHAN', 'DVHAN, F']
    financial = ['FINANCIAL IDENTITY FRAUD', 'FINANCIAL TRANSACTION CARD FRAUD','FINANCIAL TRANSACTION CARD THEFT']
    habitual_offender = ['HABITUAL OFFENDER', 'HABITUAL OFFENDER, F']
    harrassment = ['HARASSMENT 1ST DEGREE', 'HARASSMENT 2ND DEGREE']
    leaving_accident = ['LEAVING SCENE OF ACCIDENT (ATTENDED VEHICLE)', 
                        'LEAVING SCENE OF ACCIDENT (PERSONAL INJURY/DEATH)']
    loud_noise = ['LOUD NOISE','LOUD NOISE, M']
    minor_in_possession = ['MINOR IN POSSESSION OF BEER/WINE','MINOR IN POSSESSION OF LIQUOR']
    no_drivers_license = ['NO DRIVERS LICENSE', 'NO MOPED DRIVERS LICENSE','NO SC DRIVERS LICENSE']
    open_container = ['OPEN CONTAINER ALCOHOL', 'OPEN CONTAINER BEER/WINE IN VEH',
                      'OPEN CONTAINER BEER/WINE IN VEH, M','OPEN LIQUOR CONTAINER IN VEH']
    possession_of_drugs = ['POSS COCAINE','POSS COCAINE BASE','POSS COCAINE, F','POSS COCAINE, M',
                           'POSS CONTROLLED SUBSTANCE','POSS CONTROLLED SUSB SCHEDULE I-V','POSS DRUG PARAPHERNALIA',
                           'POSS DRUG PARAPHERNALIA, M', 'POSS HEROIN','POSS MARIJUANA 2ND +',
                           'POSS MARIJUANA 2ND +, F','POSS MDMA / ECSTASY','POSS METHAMPHETAMINE']
    possession_of_firearms = ['POSS OF FIREARM / KNIFE DURING COMM. OF VIOLENT CRIME',
                              'POSS OF FIREARM/AMMO BY CONVICT VIOLENT FELONY']
    property_enhancement = ['PROPERTY CRIME ENHANCEMENT','PROPERTY CRIME ENHANCEMENT, F']
    pwid_manuf = ['PWID/MANUF COCAINE','PWID/MANUF COCAINE, F', 'PWID/MANUF MARIJUANA', 'PWID/MANUF MARIJUANA, F']
    pwid_dist = ['PWID/MANUF/DISTR HEROIN','PWID/MANUF/DISTR HEROIN, M', 'PWID/MANUF/DISTR MDMA / ECSTASY']
    pwid_purchase = ['PWID/MANUF/DISTR/PURCH COCAINE BASE','PWID/MANUF/DISTR/PURCH COCAINE BASE, F',
                     'PWID/MANUF/DISTR/PURCH METHAMPHETAMINE']
    pwid_traffic = ['PWID/MANUF/DISTR/TRAF CONTROLLED SUBSTANCE','PWID/MANUF/DISTR/TRAF CONTROLLED SUBSTANCE, F',
                    'PWID/MANUF/DISTR/TRAF CONTROLLED SUSB SCHEDULE I-V',
                    'PWID/MANUF/DISTR/TRAF CONTROLLED SUSB SCHEDULE I-V, F']
    speeding = ['SPEEDING 1-10','SPEEDING 11-15','SPEEDING 16-25']
    simple_poss = ['SIMPLE POSS MARIJUANA 1ST','SIMPLE POSS MARIJUANA 1ST, M']
    tampering = ['TAMPERING WITH VEHICLE', 'TAMPERING WITH VEHICLE, M']
    trafficking = ['TRAFFICKING COCAINE 10-28G','TRAFFICKING COCAINE BASE 10-28G',
                   'TRAFFICKING COCAINE BASE 10-28G, F','TRAFFICKING COCAINE BASE 28-100G','TRAFFICKING HEROIN 28G +',
                   'TRAFFICKING HEROIN 4-14G','TRAFFICKING HEROIN 4-14G, F','TRAFFICKING MDMA (ECSTASY)',
                   'TRAFFICKING MDMA (ECSTASY), F','TRAFFICKING METHAMPHETAMINE 28-100G']

    trespassing = ['TRESPASSING','TRESPASSING - POSTED NOTICE','TRESPASSING AFTER WARNING','TRESPASSING, M']
    unlawful_carry = ['UNLAWFUL CARRY OF CONCEALED WEAPON','UNLAWFUL CARRY OF WEAPON ON SCHOOL PROPERTY',
                      'UNLAWFUL CARRYING OF HANDGUN','UNLAWFUL CARRYING OF HANDGUN, M']
    out_of_agency = ['OUT OF AGENCY WARRANT','OUT OF AGENCY WARRANT, M']
    failure_to_stop = ['FAILURE TO STOP BLUE LIGHT/ SIREN','FAILURE TO STOP BLUE LIGHT/ SIREN, F']
    larceny = ['LARCENY','LARCENY BY FALSE PRETENSE']
    public_disorderly_conduct = ['PUBLIC DISORDERLY CONDUCT','PUBLIC DISORDERLY CONDUCT, M']
    
    
    charge_dict = {'ARSON': arson, 'ASSAULT & BATTERY': assault_and_battery, 'BURGLARY': burglary, 
                  'ASSAULT ON POLICE': assault_on_police, 'CSC': csc, 'CSC MINOR': csc_minor, 
                  'PROPERTY DAMAGE': property_damage, 'DEFRAUD': defraud, 'DISCHARGE FIREARMS': discharge_firearms,
                  'DISORDERLY CONDUCT': disorderly_conduct, 'DRUG DISTRIBUTION': drug_distribution,
                  'DUI': dui, 'DUS': dus, 'DV': dv, 'DV HAN': dv_han, 'FINANCIAL FRAUD': financial, 
                  'HABITUAL OFFENDER': habitual_offender, 'HARRASSMENT': harrassment, 
                  'LEAVING ACCIDENT': leaving_accident, 'LOUD NOISE': loud_noise,
                   'MINOR IN POSSESSION': minor_in_possession,'NO DRIVERS LICENSE': no_drivers_license, 
                   'OPEN CONTAINER': open_container,'POSSESSION OF DRUGS': possession_of_drugs, 
                   'POSSESSION OF FIREARMS': possession_of_firearms,
                  'PROPERTY ENHANCEMENT': property_enhancement, 'PWID MANUFACTURING': pwid_manuf, 
                  'PWID DISTRIBUTION': pwid_dist, 'PWID PURCHASING': pwid_purchase, 'PWID TRAFFICKING': pwid_traffic,
                   'SPEEDING': speeding, 'SIMPLE POSSESSION': simple_poss, 'TRAFFICKING': trafficking, 
                   'TAMPERING': tampering, 'TRESSPASSING': trespassing,'UNLAWFUL CARRY': unlawful_carry,
                  'OUT OF AGENCY': out_of_agency, 'FAILURE TO STOP': failure_to_stop, 'LARCENY': larceny,
                  'PUBLIC DISORDERLY CONDUCT': public_disorderly_conduct}
    
    for key in charge_dict:
        charge_list = charge_dict[key]
        if charge in charge_list:
            return key
    return charge


In [8]:
# Call the categorize_charges function and apply it on the 'charge' column
data['charge'] = data['charge'].map(categorize_charges)
data.head()

Unnamed: 0,zip,age,race,sex,Team,Beat,charge,Month,Hour
0,29403,25,0,1,T1,122,OPEN CONTAINER,2,0
1,29401,20,0,1,T2,224,PROHIBITED AREAS FOR SOLICITORS,5,8
2,29407,18,1,1,T4,429,SIMPLE POSSESSION,3,22
3,29401,35,0,1,T2,222,TRESSPASSING,6,1
4,29403,21,0,1,T1,122,PUBLIC INTOXICATION,1,2


Here are built functions to run all the different models I wanted to use with our dataset. At one point I also had KNN and Neural Networks but because KNN took a significant amount of time to run and was always beat by SVM's accuracy and Neural Networks never being able to run (probably due to the dimensionality of the data), I have chosen not to include them in the final version.

In [9]:
# All the supervised learning model code
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

def svm_model(features_data, label_data):

    svm_scaler = StandardScaler()
    svm_pca = PCA()
    svcClassifier = SVC()
    svm_pipeline = Pipeline(steps = [('scaler', svm_scaler),('PCA', svm_pca), ('SVC', svcClassifier)])
    param_grid = {
        'PCA__n_components': list(range(5, 19)),
        'SVC__kernel': ['linear', 'rbf', 'poly']
    }

    gridSearch = GridSearchCV(svm_pipeline, param_grid, cv = 5)
    svm_predicted = cross_val_predict(gridSearch, features_data, label_data, cv= 3)
    svm_scoreFinal = accuracy_score(label_data, svm_predicted)
    print("Accuracy for SVM: ", svm_scoreFinal)

    print("Classification Report SVM:", classification_report(label_data, svm_predicted))
    

def decision_tree_model(features_data, label_data):
    ft_train, ft_test, class_train, class_test = train_test_split(features_data, label_data, test_size = 0.2, 
                                                                  random_state = 0)

    decision_tree = tree.DecisionTreeClassifier(criterion='entropy')
    decision_tree = decision_tree.fit(ft_train, class_train)

    predicted = decision_tree.predict(ft_test)
    score = accuracy_score(class_test, predicted)

    scores = cross_val_score(decision_tree,features_data, label_data, cv = 10)
    print("Accuracy for Decision Trees: ", sum(scores)/10)
    
def naive_bayes_model(features_data, label_data):
    clf = GaussianNB()
    accuracies = cross_val_score(clf, features_data, label_data, scoring = 'accuracy', cv = 10)
    print('Accuracy for Naive Bayes: ', np.average(accuracies))
    predictions = cross_val_predict(clf, features_data, label_data, cv = 10)
    print('Confusion Matrix Naive Bayes: ', confusion_matrix(label_data, predictions))
    target_names = ['class 0', 'class 1']
    print('Classification Report Naive Bayes: ', classification_report(label_data, predictions, target_names = target_names))

def random_forest_model(features_data, label_data):
    model = RandomForestClassifier()
    param_grid = {
        'max_depth' : list(range(35, 55)),
        'min_samples_leaf' : [8, 10, 12],
        'max_features' : ['sqrt', 'log2']
    }

    clf = GridSearchCV(model, param_grid, cv = 5)
    metrics = cross_val_score(clf, features_data, label_data, cv = 5)
    print('Random Forest Accuracy: ', np.average(metrics))


At this point, I tried running the data through the models with different attributes being kept and dropped. In this particular one, zip code was dropped.

In [10]:
# Dropping zip code
no_zip_data = data.drop('zip', axis = 1)
no_zip_data = pd.get_dummies(no_zip_data, prefix_sep='_', drop_first=False)

# Move race to last column
cols = list(no_zip_data)
cols.insert((len(cols) - 1), cols.pop(cols.index('race')))
no_zip_data = no_zip_data[cols]

features_data = no_zip_data.iloc[:, 0:163]
label_data = no_zip_data.iloc[:,164]

svm_model(features_data, label_data)
decision_tree_model(features_data, label_data)
naive_bayes_model(features_data, label_data)
random_forest_model(features_data, label_data)

Accuracy for SVM:  0.7075706214689266
Classification Report SVM:              precision    recall  f1-score   support

          0       0.69      0.62      0.65      1947
          1       0.72      0.78      0.75      2478

avg / total       0.71      0.71      0.71      4425

Accuracy for Decision Trees:  0.6492418533657602
Accuracy for Naive Bayes:  0.5493724236236139
Confusion Matrix Naive Bayes:  [[1790  157]
 [1837  641]]
Classification Report Naive Bayes:               precision    recall  f1-score   support

    class 0       0.49      0.92      0.64      1947
    class 1       0.80      0.26      0.39      2478

avg / total       0.67      0.55      0.50      4425

Random Forest Accuracy:  0.6851991972222478


 In this particular one, I kept the zip code but dropped team and beat.

In [11]:
# Dropping team and beat but keeping zip code
zip_data = data.drop(['Team', 'Beat'], axis = 1)
zip_data = pd.get_dummies(zip_data, prefix_sep='_', drop_first=False)


# Move race to last column
cols = list(zip_data)
cols.insert((len(cols) - 1), cols.pop(cols.index('race')))
zip_data = zip_data[cols]

features_data = zip_data.iloc[:, 0:133]
label_data = zip_data.iloc[:,134 ]

svm_model(features_data, label_data)
decision_tree_model(features_data, label_data)
naive_bayes_model(features_data, label_data)
random_forest_model(features_data, label_data)

Accuracy for SVM:  0.6833898305084746
Classification Report SVM:              precision    recall  f1-score   support

          0       0.69      0.52      0.59      1947
          1       0.68      0.81      0.74      2478

avg / total       0.68      0.68      0.68      4425

Accuracy for Decision Trees:  0.6203479597546028
Accuracy for Naive Bayes:  0.4940068022346983
Confusion Matrix Naive Bayes:  [[1830  117]
 [2122  356]]
Classification Report Naive Bayes:               precision    recall  f1-score   support

    class 0       0.46      0.94      0.62      1947
    class 1       0.75      0.14      0.24      2478

avg / total       0.63      0.49      0.41      4425

Random Forest Accuracy:  0.6673350925979946


In this particular one, age was dropped

In [12]:
# Dropping age
no_age_data = data.drop(['age'], axis = 1)
no_age_data = pd.get_dummies(no_age_data, prefix_sep='_', drop_first=False)

# Move race to last column
cols = list(no_age_data)
cols.insert((len(cols) - 1), cols.pop(cols.index('race')))
no_age_data = no_age_data[cols]

features_data = no_age_data.iloc[:, 0:175]
label_data = no_age_data.iloc[:,176 ]

svm_model(features_data, label_data)
decision_tree_model(features_data, label_data)
naive_bayes_model(features_data, label_data)
random_forest_model(features_data, label_data)

Accuracy for SVM:  0.6851977401129944
Classification Report SVM:              precision    recall  f1-score   support

          0       0.66      0.59      0.62      1947
          1       0.70      0.76      0.73      2478

avg / total       0.68      0.69      0.68      4425

Accuracy for Decision Trees:  0.6178720536081539
Accuracy for Naive Bayes:  0.5477943402863257
Confusion Matrix Naive Bayes:  [[1783  164]
 [1837  641]]
Classification Report Naive Bayes:               precision    recall  f1-score   support

    class 0       0.49      0.92      0.64      1947
    class 1       0.80      0.26      0.39      2478

avg / total       0.66      0.55      0.50      4425

Random Forest Accuracy:  0.6788771267952235


In this particular one, I dropped the month and hour code.

In [13]:
# Dropping month and hour
no_time_data = data.drop(['Month', 'Hour'], axis = 1)
no_time_data = pd.get_dummies(no_time_data, prefix_sep='_', drop_first=False)

# Move race to last column
cols = list(no_time_data)
cols.insert((len(cols) - 1), cols.pop(cols.index('race')))
no_time_data = no_time_data[cols] 

features_data = no_time_data.iloc[:, 0:174]
label_data = no_time_data.iloc[:,175 ]

svm_model(features_data, label_data)
decision_tree_model(features_data, label_data)
naive_bayes_model(features_data, label_data)
random_forest_model(features_data, label_data)

Accuracy for SVM:  0.6971751412429379
Classification Report SVM:              precision    recall  f1-score   support

          0       0.69      0.56      0.62      1947
          1       0.70      0.80      0.75      2478

avg / total       0.70      0.70      0.69      4425

Accuracy for Decision Trees:  0.6616986633745933
Accuracy for Naive Bayes:  0.5566097215062445
Confusion Matrix Naive Bayes:  [[1772  175]
 [1787  691]]
Classification Report Naive Bayes:               precision    recall  f1-score   support

    class 0       0.50      0.91      0.64      1947
    class 1       0.80      0.28      0.41      2478

avg / total       0.67      0.56      0.51      4425

Random Forest Accuracy:  0.6985366283507526
