# Model Evaluation

This file captures the results of different model runs and evaluates the best performance in prediction on test data after parameter tuning.

### Data Preparation

The keystroke dynamics data from 20 users are first selected and pre-processed before the learning starts.  The data are first split in 20/80 for testing / training dataset.  The data are then scaled, followed by dimensional reduction.

In [11]:
# Import all required libraries
import mlflow
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

#### Loading Data and Selection on Features and Subjects

In [12]:
# Load dataset and extract first 20 subjects
whole_dataset = pd.read_csv('DSL-StrongPasswordData.csv')

first_20subject = (whole_dataset.groupby(by='subject', axis=0).count().index[:20])
selected_dataset = whole_dataset[whole_dataset['subject'].isin(first_20subject)]

# The DD and UD timings of each key have been showed to be highly correlated to each other 
# in a separate analysis. Hence, we will drop all features starting with 'DD'
all_features = selected_dataset.columns[3:34]
selected_features = [x for x in all_features if not x.startswith('DD')]

# Get a copy of dataset with selected feature columns
df = selected_dataset[selected_features].copy()
# df['subject'] = selected_dataset['subject']

# Show samples
df.head()

# Get feature columns
#feature_data = selected_dataset[selected_features]

Unnamed: 0,H.period,UD.period.t,H.t,UD.t.i,H.i,UD.i.e,H.e,UD.e.five,H.five,UD.five.Shift.r,...,UD.Shift.r.o,H.o,UD.o.a,H.a,UD.a.n,H.n,UD.n.l,H.l,UD.l.Return,H.Return
0,0.1491,0.2488,0.1069,0.0605,0.1169,0.1043,0.1417,1.0468,0.1146,1.4909,...,0.6523,0.1016,0.112,0.1349,0.0135,0.0932,0.2583,0.1338,0.2171,0.0742
1,0.1111,0.234,0.0694,0.0589,0.0908,0.0449,0.0829,1.1141,0.0689,0.7133,...,0.6307,0.1066,0.0618,0.1412,0.1146,0.1146,0.1496,0.0839,0.1917,0.0747
2,0.1328,0.0744,0.0731,0.056,0.0821,0.0721,0.0808,0.96,0.0892,0.5311,...,0.5741,0.1365,0.1566,0.1621,0.0711,0.1172,0.1533,0.1085,0.1762,0.0945
3,0.1291,0.1224,0.1059,0.1436,0.104,0.0998,0.09,0.9656,0.0913,1.1651,...,0.6096,0.0956,0.0574,0.1457,0.0172,0.0866,0.1475,0.0845,0.2387,0.0813
4,0.1249,0.1068,0.0895,0.0781,0.0903,0.0686,0.0805,0.7824,0.0742,0.8213,...,0.6389,0.043,0.1545,0.1312,0.027,0.0884,0.1633,0.0903,0.1614,0.0818


#### Perform training and test data splitting

In [13]:
X_train, X_test, y_train, y_test = train_test_split(df, selected_dataset['subject'],test_size=0.20, random_state=42, stratify=selected_dataset['subject'])
X_train.shape

(6400, 21)

#### PCA reduction

In [23]:
# Perform simple PCA reduction to 95% variance
pca = PCA(n_components=0.95)
X_train_reduced = pca.fit_transform(X_train)

print(pca.n_components_)
print(pca.explained_variance_ratio_)

9
[0.47922889 0.12483348 0.10319736 0.06831754 0.05296734 0.04189186
 0.03423702 0.02602116 0.02376261]


In [24]:
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(cv=10, random_state=42, max_iter=1000).fit(X_train_reduced, y_train)

X_test_reduced = pca.transform(X_test)
print("Accuracy on training set: {:.3f}".format(clf.score(X_train_reduced, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test_reduced, y_test)))

Accuracy on training set: 0.707
Accuracy on test set: 0.704


In [22]:
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(cv=10, random_state=42, max_iter=1000).fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(clf.score(X_test, y_test)))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Accuracy on training set: 0.907
Accuracy on test set: 0.902
