# Utility Analysis on Machine Learning Performance

This notebook presents the utility analysis of the proposed fingerprinting method as a difference in the performance of an ML algorithm trained on original dataset and its fingerprinted copy. 
Since the fingerprinting methods necessarily change the values inside of the dataset, it is our aim as the designers of the method to keep the quality of the data as close to the original as possible. 

In our experiments we will train the ML models on the original data and test the performance of the final model on the holdout test set - we want to mock the real scenario when the recipient of the data will have only the fingerprinted data to train their model and to use that model to predict the new, never seen set. Then we will train the model with same hyperparameters using the fingerprinted dataset. 

We will test the utility of our fingeprinted models using different types  of classifiers, namely:
* Decision Tree
* Logistic Regression
* Gradient Boosting

Our experiments will have the following workflow:
* choose a random holdout test set (20%)
* train the classifier on the remaining & cross validate 
* record the performance on the test set
* fingerprint the remaining 
* train the classifier on the fingerprinted & cross validate
* record the performance on the (non-fingerprinted) test set

This is to be repeated a number of times (eg. 1000)

In [1]:
# need this for correct imports
import sys, os
if 'C:/Users/tsarcevic/PycharmProjects/fingerprinting-toolbox/' not in sys.path:
    sys.path.append('C:/Users/tsarcevic/PycharmProjects/fingerprinting-toolbox/')
os.chdir('../')

## Decision Tree on German Credit data (1000 entries)

We first present one run of our experiments.

In [2]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

from utils import *

In [3]:
# i divide a random portion of data (20%) and keep it until the test phase
dataset_name = "german_credit_full"
german_credit, primary_key = import_dataset(dataset_name)
# take a random holdout
holdout = german_credit.sample(frac=0.2, random_state=0)
german_credit = german_credit.drop(holdout.index, axis=0)

# one-hot encode the categorical vals but first label-encode the catergorical because OneHot cant't handle them LOL
label_enc = LabelEncoder()
categorical_attributes = german_credit.select_dtypes(include='object').columns
for cat in categorical_attributes:
    german_credit[cat] = label_enc.fit_transform(german_credit[cat])
    holdout[cat] = label_enc.fit_transform(holdout[cat])
c = len(german_credit.columns)
data = german_credit.values[:, 1:(c-1)]
target = german_credit.values[:, (c-1)]

categorical_features_idx = [i-1 for i in range(len(german_credit.columns)) if german_credit.columns[i] in categorical_attributes]
encoder = OneHotEncoder(categorical_features=categorical_features_idx)
data = encoder.fit_transform(data).toarray().astype(np.int)

Dataset: datasets/german_credit_full.csv


We define our model and define possible hyperparameters for the grid seach. With the grid search we will determine which hyperparameters show to obtain the model with the best performance on the given data. We test the performance (accurace, f1) via cross validation.
We do not stress our analysis on finding the best possible model; our goal is simply to see the difference in the performance when the exact same model is trained with fingerprinted data - "good-enough" model will do for this purpose. 

In [4]:
model = DecisionTreeClassifier(random_state=0)

criterion_range = ["gini", "entropy"]
max_depth_range = range(1, 30)

First we will serch for the best hyperparameters based on F1 score:

In [5]:
param_dist = dict(criterion=criterion_range, max_depth=max_depth_range)
rand = RandomizedSearchCV(model, param_dist, cv=10, n_iter=20, scoring="f1", random_state=0)
rand.fit(data, target)
print("The best F1 score achieved is: " + str(rand.best_score_) + " with hyperparameters: " + str(rand.best_params_))

The best F1 score achieved is: 0.8317228925455917 with hyperparameters: {'max_depth': 3, 'criterion': 'entropy'}


In [6]:
rand = RandomizedSearchCV(model, param_dist, cv=10, n_iter=20, scoring="accuracy", random_state=0)
rand.fit(data, target)
print("The best accuracy achieved is: " + str(rand.best_score_) + " with hyperparameters: " + str(rand.best_params_))

The best accuracy achieved is: 0.72375 with hyperparameters: {'max_depth': 3, 'criterion': 'entropy'}


Both performance measures achieved the peak on the setting (max_depth=3, criterion=entropy) so we will use these hyperparameters for our Decision Tree model.

In [7]:
# train the model
model = DecisionTreeClassifier(random_state=0, criterion='entropy', max_depth=3)

In [8]:
model.fit(data, target)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [9]:
X_test = holdout.values[:, 1:(c-1)]
X_test = encoder.fit_transform(X_test).toarray().astype(np.int)
y_test = holdout.values[:, (c-1)]

model.score(X_test, y_test)

0.73

This is our accuracy for the model trained with the original data.
Now we use the same hyperparameters to build the Decision Tree model and train it on fingerprinted data. Let us choose the dataset that is fingerprinted with gamma=7 and xi=2. We will use the same subset for training as in the first case. 

#### Fingerprinted data

In [10]:
german_credit_fp = read_data_with_target("german_credit", "categorical_neighbourhood", [7, 2], 0)
# remove the holdout
german_credit_fp = german_credit_fp.drop(holdout.index, axis=0)

In [11]:
# preprocess
# one-hot encode the categorical vals but first label-encode the catergorical because OneHot cant't handle them LOL
label_enc = LabelEncoder()
for cat in categorical_attributes:
    german_credit_fp[cat] = label_enc.fit_transform(german_credit_fp[cat])
data_fp = german_credit_fp.values[:, 1:(c-1)]
target_fp = german_credit_fp.values[:, (c-1)]

encoder = OneHotEncoder(categorical_features=categorical_features_idx)
data_fp = encoder.fit_transform(data_fp).toarray().astype(np.int)

Now we define our model with the same hyperparameters as above and train it on fingerprinted data.

In [12]:
model = DecisionTreeClassifier(random_state=0, criterion='entropy', max_depth=3)

In [13]:
model.fit(data_fp, target_fp)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

After we trained the model, let us see the performance on the holdout set (which is not fingerprinted).

In [14]:
model.score(X_test, y_test)

0.715

Accuracy obtained is 0.715 -> smaller than our original model. 

This experiment should be performed a number of times to be able to generalize the conclusions. 

## Logistic Regression on German Credit data (1000 entries)

Now we want to do essentially the same thing as described previously, but this time testing on Logistic Regression performance. The workflow stays the same. We will reuse the above defined object for a faster execution. 

#### Original data

In [15]:
from sklearn.linear_model import LogisticRegression

In [21]:
import warnings
warnings.filterwarnings("ignore")
model = LogisticRegression(random_state=0, max_iter=200)

solver_range = ["liblinear", "newton-cg", "lbfgs", "saga"]
C_range = range(10, 101, 10)
param_dist = dict(C=C_range, solver=solver_range)
rand = RandomizedSearchCV(model, param_dist, cv=10, n_iter=10, scoring="f1", random_state=0)

In [22]:
rand.fit(data, target)
print("The best F1 score achieved is: " + str(rand.best_score_) + " with hyperparameters: " + str(rand.best_params_))

The best F1 score achieved is: 0.8282794656372932 with hyperparameters: {'solver': 'liblinear', 'C': 60}


In [23]:
rand = RandomizedSearchCV(model, param_dist, cv=10, n_iter=10, scoring="accuracy", random_state=0)
rand.fit(data, target)
print("The best accuracy achieved is: " + str(rand.best_score_) + " with hyperparameters: " + str(rand.best_params_))

The best accuracy achieved is: 0.75 with hyperparameters: {'solver': 'liblinear', 'C': 60}


We choose the model with hyperparameters (solver=liblinear and C=60).

In [25]:
# train the model
model = LogisticRegression(random_state=0, max_iter=200, solver="liblinear", C=60)
model.fit(data, target)

LogisticRegression(C=60, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=200, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [26]:
model.score(X_test, y_test)

0.74

#### Fingerprinted data

In [27]:
model = LogisticRegression(random_state=0, max_iter=200, solver="liblinear", C=60)
model.fit(data_fp, target_fp)

LogisticRegression(C=60, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=200, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [28]:
model.score(X_test, y_test)

0.755

The model built on fingerprinted data seems to be better than the model buit on original data, which might be the result of a pure coincidence. For the real results, this experiment needs to be repeated multiple times.