# TeamA Machine Learning Project
## Los Angeles Crime Data 2020-Present
### Saulo Guzman and Alex Philipsen

---

Importing Libraries and Data

In [1]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
import sklearn.metrics

In [2]:
uneditedDF = pd.read_csv("Crime_Data_from_2020_to_Present.csv", nrows=100000)

---
## Cleaning Data
We are making use of the features of latitude, longitude, area name, time of day, and crime code.

In [3]:
crimeDF = uneditedDF[['AREA', 'Crm Cd', 'TIME OCC', 'LAT', 'LON']]
print(crimeDF.head(1))

   AREA  Crm Cd  TIME OCC      LAT       LON
0     7     510      2130  34.0375 -118.3506


---
## Implementing KNN Classifier

In [4]:
# Splitting data between train, test, and valid sets
trainSet = crimeDF.sample(frac=0.4)
crimeDF = crimeDF.drop(trainSet.index)
testSet = crimeDF.sample(frac=0.3)
crimeDF = crimeDF.drop(testSet.index)
validSet = crimeDF.sample(frac=0.3)
crimeDF = crimeDF.drop(validSet.index)

# Splitting each set into X and y

X_train = trainSet[['AREA', 'TIME OCC', 'LAT', 'LON']]
y_train = trainSet['Crm Cd']
X_test = testSet[['AREA', 'TIME OCC', 'LAT', 'LON']]
y_test = testSet['Crm Cd']
X_valid = validSet[['AREA', 'TIME OCC', 'LAT', 'LON']]
y_valid = validSet['Crm Cd']

areaCodeToName = (uneditedDF[['AREA', 'AREA NAME']]
                    .drop_duplicates()
                    .sort_values(by='AREA')
                    )

In [14]:
for num_neighbors in range(21):
    knnClassifier = KNeighborsClassifier(n_neighbors=(num_neighbors+1))
    knnClassifier.fit(X_train, y_train)
    y_pred = knnClassifier.predict(X_test)
    print(f"The accuracy for KNN for {num_neighbors+1} on the test set is: {sklearn.metrics.accuracy_score(y_pred=y_pred, y_true=y_test)}")

The accuracy for KNN for 1 on the test set is: 0.08655555555555555
The accuracy for KNN for 2 on the test set is: 0.084
The accuracy for KNN for 3 on the test set is: 0.08327777777777778
The accuracy for KNN for 4 on the test set is: 0.08444444444444445
The accuracy for KNN for 5 on the test set is: 0.08905555555555555
The accuracy for KNN for 6 on the test set is: 0.092
The accuracy for KNN for 7 on the test set is: 0.09583333333333334
The accuracy for KNN for 8 on the test set is: 0.09916666666666667
The accuracy for KNN for 9 on the test set is: 0.10038888888888889
The accuracy for KNN for 10 on the test set is: 0.10166666666666667
The accuracy for KNN for 11 on the test set is: 0.10272222222222223
The accuracy for KNN for 12 on the test set is: 0.10288888888888889
The accuracy for KNN for 13 on the test set is: 0.10483333333333333
The accuracy for KNN for 14 on the test set is: 0.10544444444444444
The accuracy for KNN for 15 on the test set is: 0.10716666666666666
The accuracy for 

The best accuracy achieved for an unwieghted KNN classifier was around 0.11 at num_neighbors = 20. So, next we will try weighting the classifier.

In [15]:
weightedKnn = KNeighborsClassifier(n_neighbors=20, weights='distance')
weightedKnn.fit(X_train, y_train)
weighted_y_pred = weightedKnn.predict(X_test)
print(f"Accuracy for weighted KNN by distance with 20 neighbors: {sklearn.metrics.accuracy_score(y_pred=y_pred, y_true=y_test)}")

Accuracy for weighted KNN by distance with 20 neighbors: 0.11266666666666666


Even with weights, KNN has the same accuracy and is not performing very well. However, this data is not normalized, so we will be testing KNN with normalized data.

In [16]:
from sklearn.preprocessing import MinMaxScaler

In [18]:
class NormalizedKNN:
    def __init__(self, X_train, y_train, model):
        self.X = X_train
        self.y = y_train
        self.model = model
        self.scaler = MinMaxScaler()

    def normalizeAndPredict(self, X):

        X_norm = pd.DataFrame(self.scaler.transform(X), columns=X.columns)

        y_pred = self.model.predict(X_norm)

        return y_pred

    def normalizeAndFit(self):
        X_norm = pd.DataFrame(self.scaler.fit_transform(self.X), columns=self.X.columns)

        self.model.fit(X_norm, self.y)

    def getAccuracy(self, X_test, y_test):
        y_pred = self.normalizeAndPredict(X_test)

        accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
        
        return accuracy


In [19]:
num_neighbors = 21
model = KNeighborsClassifier(n_neighbors=21)
normalizedKNN = NormalizedKNN(X_train=X_train, y_train=y_train, model=model)

normalizedKNN.normalizeAndFit()
accuracy = normalizedKNN.getAccuracy(X_test=X_test, y_test=y_test)
print(f"Accuracy for a normalized KNN model with 21 neighbors: {accuracy}")

Accuracy for a normalized KNN model with 21 neighbors: 0.11427777777777778


The accuracy of KNN when normalized is slightly better, but still not very useful.

---
# Predicting the possibility of crime with KNN

For this, we will be using the faker library to generate data where there is no crime at all so that we can distinguish the presence of a crime.

In [37]:
from faker import Faker