# Using Various Classifiers to predict arrests

Let's see how accurate it is.

First, let's import all the necessary stuff, and import our data.

In [1]:
import models.predict as predictions
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

In [2]:
police_data_2019 = pd.read_csv('data/preprocess/bayes_2019.csv')
police_data_2021 = pd.read_csv('data/preprocess/bayes_2021.csv')

Next, let's create our training and test sets, from the 2019 police data.

In [3]:
X_train, X_test, y_train, y_test = predictions.cnb_process(police_data_2019)
_, X_test_2021, _, y_test_2021 = predictions.cnb_process(police_data_2021, test_size=1)

Let's create our first classifier!

## Categorical Naive Bayes

In [4]:
cnb = predictions.cnb_predictor(X_train, y_train)

Now that we have a classifier based on our training data, let's see how good it is at predicting whether or not a given police report ends in an arrest.

For starters, let's check the accuracy.

In [5]:
y_predicted = cnb.predict(X_test)
missed = (y_test != y_predicted).sum()

acc = accuracy_score(y_test, y_predicted)
print(f'Original size: {X_test.shape[0]}')
print(f'Misclassified: {missed}')
print(f'Accuracy: {acc}')

Original size: 106468
Misclassified: 3466
Accuracy: 0.967445617462524


Accuracy-wise, things seem to be okay.

But let's look a little closer.

### Precision and Recall

Let's take a closer look. For starters, let's have a look at the confusion matrix.

In [6]:
pd.DataFrame(confusion_matrix(y_test, y_predicted))

Unnamed: 0,0,1
0,100596,987
1,2479,2406


Next, let's have a look at the classification report.

In [7]:
report = classification_report(y_test, y_predicted, output_dict=True)
pd.DataFrame(report).transpose()

Unnamed: 0,precision,recall,f1-score,support
0,0.97595,0.990284,0.983064,101583.0
1,0.709107,0.492528,0.5813,4885.0
accuracy,0.967446,0.967446,0.967446,0.967446
macro avg,0.842528,0.741406,0.782182,106468.0
weighted avg,0.963706,0.967446,0.964631,106468.0


It seems that there are a lot of false negatives -- though it does pretty okay at tagging real positives!

Let's compare to another classifier.

## One-Class SVM (using stochastic gradient descent)

Here, since I have a [fairly high number of samples](https://scikit-learn.org/stable/modules/outlier_detection.html), I'm going to use a one-class SVM using SGD.

Let's try it out!

In [8]:
anomaly_fraction = y_train[y_train == 1].sum()  / y_train.shape[0]
svm = predictions.oc_svm_predictor(X_train, anomaly_fraction)

First, to do a little bit of reversal, since the classifier tags non-anomalies as 1, while tagging anomalies as 0.

In [9]:
original_predictions = svm.predict(X_test)
f = lambda x: 0 if x == 1 else 1
f = np.vectorize(f)
y_predicted_svm = f(original_predictions)

Now, let's do everything else.

In [10]:
missed = (y_test != y_predicted_svm).sum()

acc = accuracy_score(y_test, y_predicted_svm)
print(f'Original size: {X_test.shape[0]}')
print(f'Misclassified: {missed}')
print(f'Accuracy: {acc}')

Original size: 106468
Misclassified: 5914
Accuracy: 0.9444527933275726


In [11]:
pd.DataFrame(confusion_matrix(y_test, y_predicted_svm))

Unnamed: 0,0,1
0,100553,1030
1,4884,1


In [12]:
report = classification_report(y_test, y_predicted_svm, output_dict=True)
pd.DataFrame(report).transpose()

Unnamed: 0,precision,recall,f1-score,support
0,0.953678,0.989861,0.971433,101583.0
1,0.00097,0.000205,0.000338,4885.0
accuracy,0.944453,0.944453,0.944453,0.944453
macro avg,0.477324,0.495033,0.485885,106468.0
weighted avg,0.909966,0.944453,0.926877,106468.0


Absolutely not good at all. Almost no positives were correctly classified, and there were too many false positives to go along with it.