# Model Training (KNN)

### Import libraries

In [1]:
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import joblib

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

Read in processed file and labels file then combine them.

In [2]:
data = pd.read_csv("./data/final_processed.csv")
labels = pd.read_csv("./data/final_labels.csv")
combined = pd.concat([data, labels.label], axis=1)
combined.head()

Unnamed: 0,bookingID,Accuracy,Speed,acceleration,gyro,second,label
0,0,10.127617,22.946083,12.988328,0.749086,1589.0,0
1,1,3.706486,21.882141,12.790147,0.717864,1034.0,1
2,2,3.930626,9.360483,13.40341,0.463685,825.0,1
3,4,10.0,19.780001,21.053265,0.661675,1094.0,1
4,6,4.586721,16.394695,14.498268,0.626294,1094.0,0


Separate training features and label.

In [3]:
x = combined.drop(["label","bookingID"], axis=1)
y = combined[['label']]

Split dataset into 80% for training, 20% for testing.

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.8, test_size = 0.2, random_state = 1)

Find the best value for n_neighbors for the KNN classifier.

In [5]:
score = 0
for i in range(1,301):
    classifier = KNeighborsClassifier(n_neighbors = i)
    classifier.fit(x_train,y_train.values.ravel())
    if classifier.score(x_test,y_test) > score:
        score = classifier.score(x_test,y_test)
        n_neighbors = i

Create the classifier using the best value and feed the training data into the classifier then save the trained model.

In [6]:
classifier = KNeighborsClassifier(n_neighbors = n_neighbors)
classifier = classifier.fit(x_train,y_train.values.ravel())
pickle.dump(classifier, open("./models/knn_model", "wb"), protocol=4)

Check the accuracy and f1_score for the prediction results and view its confusion matrix.

In [7]:
y_pred = classifier.fit(x_train,y_train.values.ravel()).predict(x_test)
scores = cross_val_score(classifier, x_train, y_train.values.ravel(), cv=10, scoring = "roc_auc")
print("Mean Score: {}".format(scores.mean()))
print("f1_score: {}".format(f1_score(y_test, y_pred)))
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print("Confusion Matrix: {} {} {} {}".format(tn, fp, fn, tp))

Mean Score: 0.6655379791792522
f1_score: 0.21827411167512686
Confusion Matrix: 2887 59 865 129
