# Default Performance
The purpose of this notebook is to test the default fitness scores of the algorithm 
space for used by [Dhahri et al. 2019](https://www.hindawi.com/journals/jhe/2019/4253641/) in analyzing the [Breast Cancer Wisconsin (Diagnostic) Dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)).
The tables generated here used see-classify with hash `76a745b0ce0b6ab08052fb52f4392c6b54b8ca85`.

In this notebook we example two different measures of accuracy. The first is the accuracy score (ACC) defined as $\text{ACC} = \frac{\text{# of correct labels}}{\text{total # of items}}$. This was used in our fitness function when experimenting when GA, where $\text{fitness} = 1 - \text{ACC}$. Dhahri et al. used the average score from the 10-Fold Cross-Validation score for their fitness function, where the score for each fold is also $\frac{\text{# of correct labels}}{\text{total # of items}}$.

In the first section, we determine the default performance of each of classifier using just the accuracy score (No CV). This section corresponds to the experiment that we ran and can be used to compare the fitness values of the best tuned solutions found by our experiment with GA.

In the second section, we examine the default performances using the average 10-Fold CV Score. This allows us to compare the default scores of the algorithm implementations that we are using to those used by Dhari 2019 as shown in their [Table 2 Row 2](https://www.hindawi.com/journals/jhe/2019/4253641/tab2/)\*. Additionally, similar to the purpose of the first section, this would allow us to compare our results after GA with the same fitness function to default scores.

\* Although the table's title is F1 Measurements, we believe that the metric they averaged (Row 2) is accuracy as we have stated here rather than the F1 metric because they reference this row in the context of accuracy: "According to Table 2, the AdaBoost classifier seemed to exhibit the best **accuracy** of 98.24%."

In [5]:
# Path hack so that we can import see library.
import sys, os
sys.path.insert(0, os.path.abspath('..'))

The idea behind this approach is to split the dataset once into a training, testing, and validation set (60-20-20). Then we record the default performance of the classifiers on each of the data subsets in a table.
Performance is measured as the fitness value (1 - accuracy) of the classifier predicting on each of the subsets
after training on the training set. The fitness value of a classifier training and predicting on the training
set is then its training fitness.

The accuracy metric we use is $\frac{\text{# of correct labels}}{\text{total # of items}}$

Re-generate training, testting, and **validation** set.

In [6]:
from sklearn.preprocessing import StandardScaler
from see.classifiers import Classifier
from see.classifier_fitness import ClassifierFitness
from see.classifier_helpers.helpers import generate_train_test_set
from see.classifier_helpers.fetch_data import fetch_wisconsin_data

X, y = fetch_wisconsin_data()

# Data Preprocessor
standard_scaler = StandardScaler()

X = standard_scaler.fit_transform(X)

temp = generate_train_test_set(X, y, test_size=0.2)
validation_set = temp.testing_set
temp = generate_train_test_set(temp.training_set.X, temp.training_set.y, test_size=0.25)
training_set = temp.training_set
testing_set = temp.testing_set

In [3]:
import numpy as np
from see.base_classes import pipedata
# Default performance on the entire dataset
Classifier.use_dhahri_space()
algorithm_space = Classifier.algorithmspace
scores = np.zeros(len(algorithm_space))

for i, name in enumerate(algorithm_space):
    clf = algorithm_space[name]()
    predictions = clf.evaluate(training_set, training_set)
    score = ClassifierFitness().evaluate(predictions, training_set.y)
    scores[i] = score

testing_scores = np.zeros(len(algorithm_space))
for i, name in enumerate(algorithm_space):
    clf = algorithm_space[name]()
    predictions = clf.evaluate(training_set, testing_set)
    score = ClassifierFitness().evaluate(predictions, testing_set.y)
    testing_scores[i] = score
    
validation_scores = np.zeros(len(algorithm_space))
for i, name in enumerate(algorithm_space):
    clf = algorithm_space[name]()
    predictions = clf.evaluate(training_set, validation_set)
    score = ClassifierFitness().evaluate(predictions, validation_set.y)
    validation_scores[i] = score

In [4]:
# Default performance on our cuts of the training, testing, and validation sets
# when trained on the the training set. The performance corresponds to 
# training fitness, testing fitness, and validation fitness.
import pandas as pd
df = pd.DataFrame([scores, testing_scores, validation_scores], index=['training set', 'testing set', 'validation set'])
df.columns = list(algorithm_space.keys())
df.style.set_caption('Scores on Default Parameters and trained with training set')

Unnamed: 0,Ada Boost,Decision Tree,Extra Trees,Gaussian Naive Bayes,Gradient Boosting,Linear Discriminant Analysis,Logistic Regression,K Nearest Neighbors,Random Forest,SVC
training set,0.0,0.0,0.0,0.073314,0.0,0.029326,0.008798,0.029326,0.0,0.008798
testing set,0.035088,0.070175,0.026316,0.078947,0.04386,0.035088,0.026316,0.017544,0.035088,0.035088
validation set,0.04386,0.052632,0.026316,0.035088,0.04386,0.04386,0.017544,0.052632,0.035088,0.026316


The reported best value in Dhahri 2019 is an accuracy of 0.9824 or a fitness value of 0.176
The idea behind this section is to more closely match the study of Dhahri, which used a
10-fold Cross Validation scheme as to measure the accuracy of the classifiers using a accuracy score.

We replicate in two ways. One of which is closer to an extension. In Dhahri 2019, they used
the entire dataset in the fitness function, wherein they applied 10-Fold Cross Validation. This
means that they do not have a validation step. This is the first replication.

Our second replication is with a validation step. To create the validation dataset, we take 20% from
the original dataset. We only use 80% of the data to perform 10-Fold Cross Validation within the GA.
For this final evaluation we train on the 80% and measure performance using accuracy when predicting
on the 20%.

For accuracy, we again use the same metric as before. We report the **average accuracy over all 10 folds** from the 10-fold Cross Validation.

Without Validation Set and closer to Dhahri 2019

In [5]:
from sklearn.model_selection import cross_validate, train_test_split
# from sklearn.decomposition import PCA

ten_cv_test_scores = np.zeros(len(algorithm_space))
ten_cv_train_scores = np.zeros(len(algorithm_space))
# pca = PCA(12)
# X = pca.fit_transform(X)

for i, name in enumerate(algorithm_space):
    clf = algorithm_space[name]()
    clf = clf.create_clf() # Returns sklearn object
    return_dict = cross_validate(clf, X, y, cv=10, return_train_score=True)
    ten_cv_test_scores[i] = return_dict['test_score'].mean()
    ten_cv_train_scores[i] = return_dict['train_score'].mean()

In [6]:
import pandas as pd

df_without_validation = pd.DataFrame(
    [1 - ten_cv_train_scores, 1 - ten_cv_test_scores],
    index=["Average 10-Fold Training Fitness (entire dataset)", "Average 10-Fold CV Score (entire dataset)"],
)
df_without_validation.columns = list(algorithm_space.keys())
df_without_validation.style.set_caption("CV Scores on Default Parameters")
df_without_validation.transpose()

Unnamed: 0,Average 10-Fold Training Fitness (entire dataset),Average 10-Fold CV Score (entire dataset)
Ada Boost,0.0,0.03869
Decision Tree,0.0,0.08609
Extra Trees,0.0,0.03515
Gaussian Naive Bayes,0.060535,0.068484
Gradient Boosting,0.0,0.036873
Linear Discriminant Analysis,0.034174,0.043954
Logistic Regression,0.011716,0.01933
K Nearest Neighbors,0.020894,0.033365
Random Forest,0.0,0.042105
SVC,0.012498,0.022901


With Validation Set for the final evaluation

In [7]:
from sklearn.model_selection import cross_validate, train_test_split

# We use 80% during GA and leave out 20% for final evaluation
ga_X, validation_X, ga_y, validation_y = train_test_split(X, y, random_state=42, test_size=.2)

ten_cv_test_scores = np.zeros(len(algorithm_space))
ten_cv_train_scores = np.zeros(len(algorithm_space))

final_evaluation = np.zeros(len(algorithm_space))
for i, name in enumerate(algorithm_space):
    clf = algorithm_space[name]()
    clf = clf.create_clf() # Returns sklearn object
    return_dict = cross_validate(clf, ga_X, ga_y, cv=10, return_train_score=True)
    ten_cv_test_scores[i] = return_dict['test_score'].mean()
    ten_cv_train_scores[i] = return_dict['train_score'].mean()
    
    clf.fit(ga_X, ga_y)
    final_evaluation[i] = clf.score(validation_X, validation_y)

In [8]:
import pandas as pd

df_with_validation = pd.DataFrame(
    [1 - ten_cv_train_scores, 1 - ten_cv_test_scores, 1 - final_evaluation],
    index=[
        "Average 10-Fold Training Fitness (using 80% of dataset)",
        "Average 10-Fold CV Score (using 80% of dataset)",
        "Final Evaluation", # after training on 80% and testing on 20% of data
    ],
)
df_with_validation.columns = list(algorithm_space.keys())
df_with_validation.style.set_caption("CV Scores on Default Parameters")
df_with_validation.transpose()

Unnamed: 0,Average 10-Fold Training Fitness (using 80% of dataset),Average 10-Fold CV Score (using 80% of dataset),Final Evaluation
Ada Boost,0.0,0.032995,0.026316
Decision Tree,0.0,0.074638,0.052632
Extra Trees,0.0,0.033043,0.026316
Gaussian Naive Bayes,0.062029,0.072609,0.035088
Gradient Boosting,0.0,0.039565,0.04386
Linear Discriminant Analysis,0.034676,0.046087,0.04386
Logistic Regression,0.011966,0.026425,0.026316
K Nearest Neighbors,0.021,0.033043,0.052632
Random Forest,0.0,0.035169,0.04386
SVC,0.011721,0.028599,0.026316


In [9]:
# Stack Tables
df_stacked = pd.DataFrame()
df_stacked = df_stacked.append(df_without_validation)
df_stacked = df_stacked.append(df_with_validation)
df_stacked = df_stacked.transpose()
df_stacked.style.set_caption("Average Scores")

Unnamed: 0,Average 10-Fold Training Fitness (entire dataset),Average 10-Fold CV Score (entire dataset),Average 10-Fold Training Fitness (using 80% of dataset),Average 10-Fold CV Score (using 80% of dataset),Final Evaluation
Ada Boost,0.0,0.03869,0.0,0.032995,0.026316
Decision Tree,0.0,0.08609,0.0,0.074638,0.052632
Extra Trees,0.0,0.03515,0.0,0.033043,0.026316
Gaussian Naive Bayes,0.060535,0.068484,0.062029,0.072609,0.035088
Gradient Boosting,0.0,0.036873,0.0,0.039565,0.04386
Linear Discriminant Analysis,0.034174,0.043954,0.034676,0.046087,0.04386
Logistic Regression,0.011716,0.01933,0.011966,0.026425,0.026316
K Nearest Neighbors,0.020894,0.033365,0.021,0.033043,0.052632
Random Forest,0.0,0.042105,0.0,0.035169,0.04386
SVC,0.012498,0.022901,0.011721,0.028599,0.026316


We extract two specific columns from the table above and put them immediately below. Notice below that the **average 10-Fold CV Scores** are approximately equivalent when for each classifier using the default hyperparameters for both the entire dataset and using 80% of the dataset. This suggests that it could be okay to **diverge** from Dhahri 2019 and use just **80%** of the Wisconsin Breast Cancer (Diagnostic) Dataset within GA and leave the remaining 20% for a final validation step.

In [10]:
df_subset = df_stacked[['Average 10-Fold CV Score (entire dataset)', 'Average 10-Fold CV Score (using 80% of dataset)']]
df_subset.style.set_caption("Compare CV Scores when using entire dataset or 80% of the dataset")

Unnamed: 0,Average 10-Fold CV Score (entire dataset),Average 10-Fold CV Score (using 80% of dataset)
Ada Boost,0.03869,0.032995
Decision Tree,0.08609,0.074638
Extra Trees,0.03515,0.033043
Gaussian Naive Bayes,0.068484,0.072609
Gradient Boosting,0.036873,0.039565
Linear Discriminant Analysis,0.043954,0.046087
Logistic Regression,0.01933,0.026425
K Nearest Neighbors,0.033365,0.033043
Random Forest,0.042105,0.035169
SVC,0.022901,0.028599


To compare with the default scores reported by Dhahri 2019, we include the accuracy scores Table 2 Row 2.

In [11]:
dhahri_scores = np.array([0.9823, 0.938, 0.9734, 0.9557, 0.9557, 0.9533, 0.9463, 0.9111, 0.9645, 0.5110])
dhahri_fitness = 1 - dhahri_scores
df_subset.insert(loc=0, column='Dhahri reported average %', value=dhahri_fitness)
df_subset

Unnamed: 0,Dhahri reported average %,Average 10-Fold CV Score (entire dataset),Average 10-Fold CV Score (using 80% of dataset)
Ada Boost,0.0177,0.03869,0.032995
Decision Tree,0.062,0.08609,0.074638
Extra Trees,0.0266,0.03515,0.033043
Gaussian Naive Bayes,0.0443,0.068484,0.072609
Gradient Boosting,0.0443,0.036873,0.039565
Linear Discriminant Analysis,0.0467,0.043954,0.046087
Logistic Regression,0.0537,0.01933,0.026425
K Nearest Neighbors,0.0889,0.033365,0.033043
Random Forest,0.0355,0.042105,0.035169
SVC,0.489,0.022901,0.028599


Examining the percentage differences:

In [12]:
(df_subset.iloc[:,1] - df_subset.iloc[:,0])/ df_subset.iloc[:,0] * 100

Ada Boost                       118.590261
Decision Tree                    38.855203
Extra Trees                      32.144270
Gaussian Naive Bayes             54.590766
Gradient Boosting               -16.764258
Linear Discriminant Analysis     -5.880869
Logistic Regression             -64.004518
K Nearest Neighbors             -62.469447
Random Forest                    18.606375
SVC                             -95.316768
dtype: float64

In [13]:
(df_subset.iloc[:,2] - df_subset.iloc[:,0])/ df_subset.iloc[:,0] * 100

Ada Boost                       86.413385
Decision Tree                   20.383357
Extra Trees                     24.223602
Gaussian Naive Bayes            63.902248
Gradient Boosting              -10.687997
Linear Discriminant Analysis    -1.312727
Logistic Regression            -50.791209
K Nearest Neighbors            -62.830733
Random Forest                   -0.932163
SVC                            -94.151527
dtype: float64

Note: This feels strange that we cannot reproduce default results. Interpretation of table 2 is most likely wrong, but I think I am going in circles trying to reproduce this now.