The dataset describes a list of crime reports, and some have been classified as pertaining to a home invasion. I tasked myself with creating a machine learning model that could classify the unknown descriptions according to whether they described a home invasion or not. 

This code sample is meant to be a quick highlight of my pre-processing/machine learning abilities. 

In [25]:
import pandas as pd
import numpy as np
import re
import sklearn 

In [26]:
alls = pd.read_csv("allegations.csv", sep = '|')
alls.head(10)

Unnamed: 0,cr_id,text,home_invasion
0,1042384,It is reported that the involved officer and h...,
1,1042532,It is reported that Officers Sierra and Mosque...,
2,1043217,IT IS REPORTED THAT THE INVOLVED MEMBER DISCOV...,
3,1043569,It is reported that the involved member was re...,
4,1043812,The involved officers attempted to stop a vehi...,
5,1044135,"It is reported that the subject, Dion Richards...",
6,1044692,It is reported that while the offender was han...,
7,1045186,It is reported the involved officer/victim was...,0.0
8,1045352,It is reported that during a narcotic investig...,
9,1045759,It is reported that the involved officer obser...,


In [27]:
len(alls.index)

19138

# 1. Prepare data 

   1a. preprocess

In [28]:
def preprocess(sentence):
    sentence = sentence.lower()
    sentence = re.sub("/", " ", sentence)
    sentence = re.sub("[^A-Za-z ]", "", sentence)
    return sentence

In [29]:
alls['text'] = alls['text'].apply(preprocess)

In [30]:
reviewed_alls = alls[(alls['home_invasion'] == 0) | (alls['home_invasion'] == 1)]
reviewed_alls.head()

Unnamed: 0,cr_id,text,home_invasion
7,1045186,it is reported the involved officer victim was...,0.0
10,1047231,it is reported that during a foot pursuit the ...,0.0
13,1047919,it is reported that the involved officer and h...,0.0
17,1048962,the victim alleges that an unknown male black ...,0.0
19,1048966,the reporting party victim stated that she tel...,0.0


In [31]:
np.mean(reviewed_alls['home_invasion']) # 7% of cases involve home invasion - imbalanced classes

0.07123775601068566

In [32]:
len(reviewed_alls.index) # n = 2246 

2246

1b. vectorize (using bag of words with count for simplicity, but there is a lot you could play around with here in terms of lemmatization, inculding 2 or 3 word phrases rather than just words, different methods of vectorization, etc.) 

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(reviewed_alls['text'])
x = x.toarray()

1c. split into test/train

In [34]:
from sklearn.model_selection import train_test_split
np.random.seed(2021)
x_train, x_test, y_train, y_test = train_test_split(x, reviewed_alls['home_invasion'], test_size = .2)

In [35]:
len(x_train[0]) # features

8349

In [36]:
len(x_train) # samples

1796

# 2. Fit models

For the sake of brevity, we're going to evaluate 2 models
1. Naive Bayes - simple (don't have to do a lot of tuning), and good baseline to compare other models to
2. Random Forest - can easily handle a lot of features without a lot of samples, lots of opportunities for tuning hyperparameters if needed 

And use cross-validation instead of a validation set since we have a relatively small sample size 

2a. Naive Bayes

In [37]:
from sklearn.naive_bayes import MultinomialNB 
from sklearn.model_selection import cross_val_predict

nb = MultinomialNB()
y_pred = cross_val_predict(nb, x_train, y_train)



In [38]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(y_train, y_pred)
conf_mat

array([[1649,   11],
       [  43,   93]])

In [39]:
from sklearn.metrics import f1_score
f1_score(y_train, y_pred)

0.7749999999999999

2b. Random Forest

In [40]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [41]:
# select best hyperparameters using grid search

rf = RandomForestClassifier()
# define search space
space = dict()
space['n_estimators'] = [100, 200, 500] 
space['min_samples_leaf'] = [1, 2]
space['max_features'] = [200, 400, 600]

# search 
search = GridSearchCV(rf, space, scoring='f1', n_jobs =-1, cv=5, refit=False) # figure out how to use refit 
search.fit(x_train,y_train) 

# visualize results 
# pd.DataFrame(search.cv_results_)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid

In [42]:
# build a random forest based on the best parameters 
rf_clf = RandomForestClassifier(**search.best_params_) 
y_pred = cross_val_predict(rf_clf, x_train, y_train)



In [43]:
conf_mat = confusion_matrix(y_train, y_pred)
conf_mat

array([[1657,    3],
       [  28,  108]])

In [44]:
f1_score(y_train, y_pred)

0.874493927125506

There are still a lot more false negatives than false positives, so depending on the real-world ramifications for these categories we could adjust the threshold for declaring something a home invasion. 

# 3. final evaluation on test data for the model selected

In [45]:
rf_clf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features=600, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [46]:
y_pred = rf_clf.predict(x_test)

In [47]:
conf_mat = confusion_matrix(y_test, y_pred)
conf_mat

array([[426,   0],
       [  1,  23]])

In [48]:
f1_score(y_test, y_pred)

0.9787234042553191