# Utility Analysis on Machine Learning Performance (Breast Cancer)

This notebook presents the utility analysis of the proposed fingerprinting method as a difference in the performance of an ML algorithm trained on original dataset and its fingerprinted copy. 
Since the fingerprinting methods necessarily change the values inside of the dataset, it is our aim as the designers of the method to keep the quality of the data as close to the original as possible. 

In our experiments we will train the ML models on the original data and test the performance of the final model on the holdout test set - we want to mock the real scenario when the recipient of the data will have only the fingerprinted data to train their model and to use that model to predict the new, never seen set. Then we will train the model with same hyperparameters using the fingerprinted dataset. 

We will test the utility of our fingeprinted models using different types  of classifiers, namely:
* Decision Tree
* Logistic Regression
* Gradient Boosting
* Random Forest

Our experiments will have the following workflow:
* choose a random holdout test set (20%)
* train the classifier on the remaining & cross validate 
* record the performance on the test set
* fingerprint the remaining 
* train the classifier on the fingerprinted & cross validate
* record the performance on the (non-fingerprinted) test set

This is to be repeated a number of times (eg. 1000)

In [1]:
# need this for correct imports
import sys, os
if 'C:/Users/tsarcevic/PycharmProjects/fingerprinting-toolbox/' not in sys.path:
    sys.path.append('C:/Users/tsarcevic/PycharmProjects/fingerprinting-toolbox/')
os.chdir('../')

## Decision Tree on Breast Cancer data (286 entries)

We first present one run of our experiments.

In [2]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from schemes.categorical_neighbourhood.categorical_neighbourhood import CategoricalNeighbourhood

from utils import *

In [3]:
# 0) import data
breast_cancer, primary_key = import_dataset("breast_cancer_full")
breast_cancer = breast_cancer.drop('Id', axis=1)

Dataset: datasets/breast_cancer_full.csv


In [4]:
# 1) calculate benchmark hyperparameters that will be used in all experiments with Decision Tree

# 1.1) get dummies from the categorical data
c = len(breast_cancer.columns)
target = breast_cancer.values[:, (c-1)]
breast_cancer = breast_cancer.drop("recurrence", axis=1)
breast_cancer = pd.get_dummies(breast_cancer)
data = breast_cancer.values

# 1.2) define the model and possible hyperparameters
random_state = 25 # increase every run
model = DecisionTreeClassifier(random_state=random_state)
criterion_range = ["gini", "entropy"]
max_depth_range = range(1, 30)

# 1.3) perform the random hyperparameter search
param_dist = dict(criterion=criterion_range, max_depth=max_depth_range)
rand_search = RandomizedSearchCV(model, param_dist, cv=10, n_iter=58, scoring="accuracy", random_state=random_state)
rand_search.fit(data, target)
best_params = rand_search.best_params_
print("The best accuracy achieved is: " + str(rand_search.best_score_) + " with hyperparameters: " + str(best_params))

The best accuracy achieved is: 0.7482517482517482 with hyperparameters: {'max_depth': 2, 'criterion': 'gini'}


In [5]:
# 2) stratify data into main data and test holdout
X_data, X_holdout, y_data, y_holdout = train_test_split(data, target, test_size=0.2, stratify=target, random_state=random_state)

In [6]:
# 3) train a model on original data
model = DecisionTreeClassifier(max_depth=best_params['max_depth'], criterion=best_params['criterion'], random_state=random_state)
model.fit(X_data, y_data)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=25,
            splitter='best')

In [7]:
# 4) evaluate the original on holdout
model.score(X_holdout, y_holdout)

0.6551724137931034

In [8]:
# 5) fingerprint the data
# 5.1) define a scheme 
scheme = CategoricalNeighbourhood(gamma=1, xi=2, fingerprint_bit_length=8)
# 5.2) fingerprint the data
secret_key = 2322  # increase every run
fp_dataset = scheme.insertion(dataset_name="breast_cancer", buyer_id=1, secret_key=secret_key)
fp_dataset = fp_dataset.drop("Id", axis=1)
print(type(fp_dataset))

Start the insertion algorithm of a scheme for fingerprinting categorical data (neighbourhood) ...
	gamma: 1
	xi: 2
Dataset: datasets/breast_cancer.csv

Generated fingerprint for buyer 1: 00011010
Inserting the fingerprint...

Training balltrees in: 0.04 sec.
Fingerprint inserted.
Time: 0 sec.
<class 'pandas.core.frame.DataFrame'>


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


In [9]:
# 6) train the model with fingerprinted data

# 6.1) preprocess the fingerprinted data (get dummies)
fp_dataset = pd.get_dummies(fp_dataset)
data_fp = fp_dataset.values

# 6.2) filter-out the holdout
X_data_fp, X_holdout_fp, y_data_fp, y_holdout_fp = train_test_split(data_fp, target, test_size=0.2, stratify=target, random_state=random_state)

# 6.3) train the model
model2 = DecisionTreeClassifier(max_depth=best_params['max_depth'], criterion=best_params['criterion'], random_state=random_state)
model2.fit(X_data_fp, y_data)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=25,
            splitter='best')

In [10]:
# 7) evaluate on the holdout
model2.score(X_holdout, y_holdout)

0.6896551724137931