![image.png](https://www.guavus.com/wp-content/uploads/2018/05/data-centric-174718966.jpg)
> Photo by <a href="https://www.guavus.com/becoming-data-centric-no-longer-optional/">Guavos Technology</a>
  

# HAM10000: Neural Networks for Skin Lesion Classification


> "Training of neural networks for automated diagnosis of pigmented skin lesions is hampered by the small size and lack of diversity of available datasets of dermatoscopic images. We tackle this problem by releasing the HAM10000 (“Human Against Machine with 10000 training images”) dataset."<br>
[Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 doi:10.1038/sdata.2018.161 (2018)](https://www.nature.com/articles/sdata2018161).


This notebook continues the work initiated with HAM10000 dataset in this [Notebook](https://www.kaggle.com/jnegrini/ham10000-analysis-and-model-comparison). 

Here is an attempt to improve the model by devoting more time into Data Augmentation process and Data Quality. The main inspiration for this was a live presentation from Andrew Ng on 24th March on how we should change our focus to Model-Centric to Data-Centric AI. The presentation is available [here](https://www.youtube.com/watch?v=06-AZXmwHjo).

## <center style="background-color:Gainsboro; width:40%;">Contents</center>
1. [Overview](#1.-Overview)<br>
1.1. [Content](#1.1.-Content)<br>
1.2. [Acknowledgements](#1.2.-Acknowledgements)<br>
2. [The Model](#2.-The-Model)<br>
3. [How to be more Data-Centric?](#3.-How-to-be-more-Data-Centric?)<br>
3.1 [Data Augmentation](#3.-Data-Augmentation)<br>
3.2 [ISIC Website and Additional Images](#3.2-ISIC-Website-and-Additional-Images)<br>
3.3 [Synthetic Data](#3.3-Synthetic-Data)<br>
4. [Results and Conclusion](#4.-Results-and-Conclusion)<br>

***Please remember to upvote if you find this Notebook helpful!***

# **1. Overview**

Dermatoscopy is a diagnostic technique that can improve the diagnosis of benign and malignant pigmented skin lesions. Other than increasing the accuracy of skin cancer detection (if compared to naked eye exams), dermatoscopic images can also be used to train ANN. In the past, promising attempts have been made to use ANN to classify skin lesions. However, the lack of data and computing power limited the application of this method.

The [ISIC archive](https://isic-archive.com/) is the largest public database for dermatoscopic image analysis research, and where the original HAM10000 was made available. In 2018, the database contained approximately 13.000 dermatoscopic images. Currently, the database holds over 60.000 images, demonstrating the power of collaboration between different scientific groups. 

As mentioned by the authors, the original paper and release of HAM10000 aimed to boost the research on the automated diagnosis of dermatoscopic images. We can say they have certainly achieved that goal after three successful challenges and an impressive expansion of the database.


## 1.1. Content ##

The HAM10000 dataset is composed of 10.015 dermatoscopic images of pigmented skin lesions. The data was collected from Australian and Austrian patients. Competitions and several notebooks have already tackled this problem. 

On this Notebook I will try to improve my previous results by using a more Data-Centric approach, where minimal effort will be applied to the build and set up of the CNN. Here we will discuss data augmentation optimisation and how more samples can be helpful (or not so much).


## 1.2. Acknowledgements ##

The dataset has been collated and published by [Tschandl, P., Rosendahl, C. & Kittler, H.](https://www.nature.com/articles/sdata2018161). The complementary dataset was created by downloading the files from the ISIC archive](https://isic-archive.com/).

>Import Libraries

In [None]:
#Reproducible Results
from numpy.random import seed
seed(1)
import tensorflow
tensorflow.random.set_seed(1)
#Basics
import numpy as np
import pandas as pd
import os
from glob import glob
from PIL import Image
import matplotlib.pyplot as plt 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score,precision_recall_fscore_support
import seaborn as sns
#CNN Model
import keras
from keras import backend as K
from keras.models import Sequential, Model
from keras.layers import Activation,Dense, Dropout, Flatten, Conv2D, MaxPool2D,AveragePooling2D,GlobalMaxPooling2D
from keras.wrappers.scikit_learn import KerasClassifier
from keras.layers.normalization import BatchNormalization
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.optimizers import Adam
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau, EarlyStopping
#Optimisation
from skopt import gp_minimize
from skopt.space import Real, Integer,Categorical
from skopt.utils import use_named_args
from skopt.plots import plot_convergence

# 2. The Model

Usually a notebook should start by performing an EDA. However, this is no the goal for this study. An EDA was previously performed on this [Notebook](https://www.kaggle.com/jnegrini/ham10000-analysis-and-model-comparison). In this section, the basic CNN model is used to evaluate how the different Data Augmentation methods can help model accuracy. 

## Key Steps

* Add the images to the Dataframe
* Separate the dataframe into Features and Targets data
* Create Training and Test sets (80 - 20 ratio)
* Normalise the input. Following the best practices, the normalisation should be performed using the training set data as a reference. The test data cannot be normalised to its data, as it should remain unknown
* One Hot Encoding to transform the Target labels
* Separate the training set into Training and Validation sets (90 - 10 ratio)
* The CNN requires the images to be reshaped into 3 dimensions. For faster computations, here we use (height = 28px, width = 28px , canal = 3)

>Functions

In [None]:
def df_prep(skin_df):
    features=skin_df.drop(columns=['cell_type_idx'],axis=1)
    target=skin_df['cell_type_idx']

    # Create First Train and Test sets
    x_train_o, x_test_o, y_train_o, y_test_o = train_test_split(features, target, test_size=0.20,random_state=123)

    #The normalisation is done using the training set Mean and Std. Deviation as reference
    x_train = np.asarray(x_train_o['image'].tolist())
    x_test = np.asarray(x_test_o['image'].tolist())

    x_train_mean = np.mean(x_train)
    x_train_std = np.std(x_train)

    x_train = (x_train - x_train_mean)/x_train_std
    x_test = (x_test - x_train_mean)/x_train_std

    # Perform one-hot encoding on the labels
    y_train = to_categorical(y_train_o, num_classes = 7)
    y_test = to_categorical(y_test_o, num_classes = 7)

    #Splitting training into Train and Validatation sets
    x_train, x_validate, y_train, y_validate = train_test_split(x_train, y_train, test_size = 0.1,random_state=123)

    #Reshaping the Images into 3 channels (RGB)
    x_train = x_train.reshape(x_train.shape[0], *(28, 28, 3))
    x_test = x_test.reshape(x_test.shape[0], *(28, 28, 3))
    x_validate = x_validate.reshape(x_validate.shape[0], *(28, 28, 3))
    
    return x_train,x_validate,x_test,y_train,y_validate,y_test, y_test_o

def history(model,dataaugment):
    model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=['accuracy'])
    history = model.fit(dataaugment.flow(x_train,y_train, batch_size=batch_size),
                        epochs = epochs, validation_data = (x_validate,y_validate),
                        verbose = 0, steps_per_epoch=x_train.shape[0] // batch_size,shuffle = False,
                        callbacks=[learning_rate_reduction,early_stopping_monitor])

    loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
    predictions = model.predict(x_test)
    loss_v, accuracy_v = model.evaluate(x_validate, y_validate, verbose=0)
    loss_t, accuracy_t = model.evaluate(x_train, y_train, verbose=0)
    overall_results,results_per_class = MetricsScore(baseline_predictions,y_test_baseline)
    keras.backend.clear_session()
    return (accuracy,overall_results,results_per_class,predictions)

    
def MultipleRuns(model,dataaugment,n_runs):
    accuracy = []
    precision = []
    recall= []
    f1 = []    
    f1_class = []
    for i in range(0,n_runs):
        acc,prec_rec_f1,class_prec_rec_f1,pred = history(model,dataaugment)
        accuracy.append(acc)
        precision.append(prec_rec_f1[0])
        recall.append(prec_rec_f1[1])
        f1.append(prec_rec_f1[2])
        f1_class.append(class_prec_rec_f1[2])
    Mean_results = [np.mean(accuracy),np.mean(precision),np.mean(recall),np.mean(f1)]
    StaDev_results = [np.std(accuracy),np.std(precision),np.std(recall),np.std(f1)]
    return (Mean_results,StaDev_results,pred)

def baseline_CNN():
    model = Sequential()
    model.add(Conv2D(16, kernel_size = (3,3), input_shape = input_shape, activation = 'relu', padding = 'same'))
    model.add(Conv2D(32, kernel_size = (3,3), activation = 'relu'))
    model.add(MaxPool2D(pool_size = (2,2)))

    model.add(Conv2D(32, kernel_size = (3,3), activation = 'relu', padding = 'same'))
    model.add(Conv2D(64, kernel_size = (3,3), activation = 'relu'))
    model.add(MaxPool2D(pool_size = (2,2), padding = 'same'))

    model.add(Conv2D(64, kernel_size = (3,3), activation = 'relu'))
    model.add(Conv2D(64, kernel_size = (3,3), activation = 'relu', padding = 'same'))
    model.add(MaxPool2D(pool_size = (2,2), padding = 'same'))

    model.add(Flatten())

    model.add(Dense(64, activation = 'relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(7, activation='softmax'))
    
    return model

def MetricsScore(preds,y_test):
    preds = np.array(list(map(lambda x: np.argmax(x), baseline_predictions)))
    results_all = precision_recall_fscore_support(y_test, preds, average='macro',zero_division = 1)
    results_class = precision_recall_fscore_support(y_test, preds, average=None, zero_division = 1)
    return results_all,results_class


def barplot(X,Y,title, x_label,y_label):
    df = pd.DataFrame(Y,X, columns = ['Metrics'])
    plt.figure(figsize=(18,10))
    ax = sns.barplot(data =df, x=df.index, y = 'Metrics',palette = "Blues_d")
    #Bar Labels
    for p in ax.patches:
            ax.annotate("%.1f%%" % (100*p.get_height()), (p.get_x() + p.get_width() / 2., abs(p.get_height())),
            ha='center', va='bottom', color='black', xytext=(-3, 5),rotation = 'horizontal',textcoords='offset points')
    sns.despine(top=True, right=True, left=True, bottom=False)
    ax.set_xlabel(x_label,fontsize = 14,weight = 'bold')
    ax.set_ylabel(y_label,fontsize = 14,weight = 'bold')
    ax.set(yticklabels=[])
    ax.axes.get_yaxis().set_visible(False) 
    plt.title(title, fontsize = 16,weight = 'bold');
    plt.show()


>Data Preparation

In [None]:
#Lesion Dictionary
lesion_type_dict = {
    'nv': 'Melanocytic nevi',
    'mel': 'Melanoma',
    'bkl': 'Benign keratosis-like lesions ',
    'bcc': 'Basal cell carcinoma',
    'akiec': 'Actinic keratoses',
    'vasc': 'Vascular lesions',
    'df': 'Dermatofibroma'}

#Lesion Dictionary
lesion_code_dict = {
    'nv': 0,
    'mel': 1,
    'bkl': 2,
    'bcc': 3,
    'akiec': 4,
    'vasc': 5,
    'df': 6}

categories = list(lesion_type_dict.values())

base_skin_dir = os.path.join('..', 'input')

#Dictionary for Image Names
imageid_path_dict = {os.path.splitext(os.path.basename(x))[0]: x for x in glob(os.path.join(base_skin_dir, '*','*', '*.jpg'))}

#Read File csv
skin_df = pd.read_csv('../input/skin-cancer-mnist-ham10000/HAM10000_metadata.csv')

#Create useful Columns - Images Path, Lesion Type and Lesion Categorical Code
skin_df['path'] = skin_df['image_id'].map(imageid_path_dict.get)
skin_df['cell_type'] = skin_df['dx'].map(lesion_type_dict.get) 
skin_df['cell_type_idx'] = skin_df['dx'].map(lesion_code_dict.get) 
skin_df['image'] = skin_df['path'].map(lambda x: np.asarray(Image.open(x).resize((28,28))))

>Training and test sets

In [None]:
x_train,x_validate,x_test,y_train,y_validate,y_test,y_test_baseline = df_prep(skin_df)

>Basic Model parameters

In [None]:
#Model Parameters
input_shape = (28, 28, 3)
num_classes = 7

optimizer = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)

epochs = 20
batch_size = 64

#Callbacks
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', patience=5, verbose=0, factor=0.5, min_lr=0.00001)
early_stopping_monitor = EarlyStopping(patience=20,monitor='val_accuracy')

#Data Augmentation
dataaugment_baseline = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=90,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image 
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=True,  # randomly flip images
        shear_range = 10) 

Here we define the CNN model, a simple, fast and effective architecture to obtain a Baseline Accuracy and see how much it can be improved by optimising the Data Augment parameters.

Some basic Data Augmentation is already applied to the Baseline model. These parameters were obtained by experimentation and previous work.

In [None]:
dataaugment_baseline.fit(x_train)

#CNN Baseline Results - Average of 3 Runs
accuracy = []
precision = []
recall= []
f1 = []
f1_class = []
for i in range(0,3):
    baseline_model = baseline_CNN()
    baseline_model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=['accuracy'])
    history_baseline = baseline_model.fit(dataaugment_baseline.flow(x_train,y_train, batch_size=batch_size),
                            epochs = epochs, validation_data = (x_validate,y_validate),
                            verbose = 0, steps_per_epoch=x_train.shape[0] // batch_size,shuffle = False, 
                            callbacks=[learning_rate_reduction,early_stopping_monitor])

    #Predictions and Baseline Accuracy
    baseline_predictions = baseline_model.predict(x_test)
    _, acc = baseline_model.evaluate(x_test, y_test, verbose=0)
    Baseline_Overall,Baseline_Class = MetricsScore(baseline_predictions,y_test_baseline)
    accuracy.append(acc)
    precision.append(Baseline_Overall[0])
    recall.append(Baseline_Overall[1])
    f1.append(Baseline_Overall[2])
    f1_class.append(Baseline_Class[2])
    #print(baseline_accuracy_t)
    keras.backend.clear_session()
    
Baseline_results = [np.mean(accuracy),np.mean(precision),np.mean(recall),np.mean(f1)]

The test metrics presented below is the average result after three runs. Now we have a baseline numbers we need to improve upon.

>Plot model results

In [None]:
barplot(['Accuracy','Precision','Recall','F-Score'],Baseline_results,'Baseline Model Metrics', 'Metric','Value')

# 3. How to be more Data-Centric?

From the Andrew Ng lecture, I could extract three main points on how to be more Data-Centric:

* **Pay attention to your labelling**: make sure your labelling process is consistent and is not causing any bias.
>For this application, we have the best Doctors in the world labelling our data, as such, I do not believe I can do any better. However, I noticed some Melanoma images that contain a black background with the lesion within a circle. This pattern could be causing model bias


* **Feature Engineering**: use strategies and field knowledge to create/combine features that can help the model generalisation. 
>Here only images are being used as inputs. At this point I do not see how I could "get creative" and use new Features other than the training images. 


* **Effort in Data Augmentation**: increase the number of samples by rotating and performing other image transformations is a well-known strategy. However, how much better can the result get if we put a bit more effort into tuning the Data Augmentation hyperparameters?
>To understand the model sensitivity to Data Augmentation, we will tune the parameters with Bayesian Optimisation, and check how much the accuracy increases. It would be interesting if with our basic CNN model we could arrive at accuracy values I encountered with XCeption architecture. 


* **More Samples, please**: additional samples are the most effective way to improve model generalisation. A dataset with images not included in the original HAM10000 was created, and it is available [here](https://www.kaggle.com/jnegrini/skin-lesions-act-keratosis-and-melanoma)
>Luckily for this application, the ISIC Archive contains more images to be used, only a bit of effort needs to be employed to download and organise the images according to the HAM10000


* **Synthetic data - at last, but not least**: due to the effort and somewhat complex process of using GAN's to generate new samples, Andrew Ng recommended that this strategy should be used as a final attempt. 

# 3.1 Data Augmentation

## Bayesian Optimisation to find Optimal Parameters

For an in depth explanation of how Bayesian Optimisation works, I would suggest to read this paper from [Peter Frazier](https://arxiv.org/abs/1807.02811). In addition, the set of lectures of [Nando de Freitas](https://www.youtube.com/watch?v=vz3D36VXefI) are extremly valuable and available on Youtube. He also explains Gaussian Process in a marvellous way that blew my mind. Finally, [Jeff Heaton](https://www.youtube.com/watch?v=sXdxyUCCm8s) has a great example on how to use BO to set the architecture of the ANN.

I also have another notebook where I used BO to find the best hyperparameters for a [ML model](https://www.kaggle.com/jnegrini/bayesian-optimization-for-hyperparameter-selection/edit/run/54955048). Greater details of how to set up the BO parameters were added there.

Below is the code of how to set the Bayesian Optimisation module in scikit learn to define the best Data Augmentation approach. 

In [None]:
#Algorithm Search Space

#Data Augmentation Parameters (https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator)
space  = [Integer(0,90, name = 'rotation_range'),
        Real(0,0.5, name = 'zoom_range'),
        Real(0,0.5, name = 'width_shift_range'),
        Real(0,0.5, name = 'height_shift_range'),
        Real(0,1, name = 'shear_range'),
        Real(0,1, name = 'rescale')]

# Allows the GP optm to go through your defined search space
@use_named_args(space)

# Function that is going to be called by the optmiser
def objective(**params):
    dataaugment = ImageDataGenerator(**params) #set the parameters defined in the space variable
    dataaugment.fit(x_train) #fit to the training set
    Optmodel = baseline_CNN()
    result, _,_ = MultipleRuns(Optmodel,dataaugment_baseline,2)

    return 1-result[3] #Optimisations always try to find the minimum of a function

res_gp = gp_minimize(objective, space, n_calls=25, n_initial_points = 10, random_state=0)

The convergence plot is shown next. After a few trials, it was possible to conclude that were no major improvements after 25 iterations for this particular problem.

In [None]:
print("Best F1 Score %.4f" % (1-res_gp.fun))
plot_convergence(res_gp);

The best parameters are displayed below. It is quite ironic to use optimisation to see that most parameters should be left at "0".

In [None]:
print("Besta data augmentation parameters")
print("rotation_range: %f" % (res_gp.x[0]))
print("zoom_range: %f" % (res_gp.x[1]))
print("width_shift_range: %f" % (res_gp.x[2]))
print("height_shift_range: %f" % (res_gp.x[3]))
print("shear_range: %f" % (res_gp.x[4]))
print("rescale: %f" % (res_gp.x[5]))

Now, let's use the parameters given by our optimiser to build ImageDataGenerators and the CNN model.

In [None]:
Optm_params = {
    'rotation_range':res_gp.x[0],
    'zoom_range':res_gp.x[1],
    'width_shift_range':res_gp.x[2],
    'height_shift_range':res_gp.x[3],
    'shear_range':res_gp.x[4],
    'rescale':res_gp.x[5]}

OptmAugment = ImageDataGenerator(**Optm_params)

Optm_results, Optm_stddev, Optpred = MultipleRuns(model,OptmAugment,3)

In a previous Notebook, I had to use the XCeption architecture to reach ~80% accuracy, using a larger input size (75x100 instead of 28x28). After all these trials are done, it will be interesting to see how much we improve on my previous result.

<blockquote style="margin-right:auto; margin-left:auto; background-color: ##FFF2CC; padding: 1em; margin:24px;">
<strong>Tips on Data Augmentation with Bayesian Optmisation</strong>
<ul>
<li>Neural Networks outputs vary due to the randomness of the weights, samples shuffling and other details. Your optimiser should use the average accuracy metric over three or more runs to make sure the it finds the best value and that is reproducible later.
<li>The Data Augmentation parameters seems to be sensible to the specifics of the model, <em>e.g. by changing the input size from 28x28 to 75x100 I obtained different hyperparameters</em>.
<li>From my experience using this Bayesian Optimisation library the number of runs does not have to be very large, <em>e.g. after 40 runs it starts to repeat data points it already visited</em>. Clearly, this will depend on the complexity of your algorithm Search Space and Objective
<ul>
</blockquote>

# 3.2 ISIC Website and Additional Images

The ISIC archive is the largest public database for dermatoscopic image analysis research, and where the original HAM10000 dataset was made available. In 2018, the ISIC database contained approximately 13.000 dermatoscopic images. Currently, the database holds over 60.000 images, demonstrating the power of collaboration between different scientific groups.

With the intent to analyse how additional data can improve the skin lesion classification, I have created an additional dataset with  3.100 images of the least represented classes. The images were downloaded from the ISIC archive and the metadata.csv file was edited to be in the same format as the original HAM1000 dataset. 

More information regarding this dataset can be found [here](https://www.kaggle.com/jnegrini/skin-lesions-act-keratosis-and-melanoma). A basic EDA of this "**new**" dataset is presented [here](https://www.kaggle.com/jnegrini/exploratory-data-analysis?scriptVersionId=58163682).

<blockquote style="margin-right:auto; margin-left:auto; background-color: ##FFF2CC; padding: 1em; margin:24px;">
<strong>Complementary dataset content:</strong>
<ul>
<li>Contains 3.100 images of six types of skin lesions
<li>Actinic Keratoses (akiec) - 600<br>
<li>Basal Cell Carcinoma (bcc) - 600<br>
<li>Seborrheic Keratoses / Solar Lentigo (bkl) - 600<br>
<li>Dermatofibroma lesions (df) - 126<br>
<li>Melanoma (mel) - 900<br>
<li>Vascular lesions (vasc) - 275<br>
<ul>
</blockquote>

The two dataframes are merged into one. Next we setup the basic CNN model with the **non-optimised** Data Augmentation strategy to check the improvement over Baseline.

>Creation of the new dataframe containing both datasets and running the model again

In [None]:
#Recreating the first DF
base_skin_dir = os.path.join('..', 'input')
imageid_path_dict = {os.path.splitext(os.path.basename(x))[0]: x for x in glob(os.path.join(base_skin_dir, '*','*', '*.jpg'))}
skin_df = pd.read_csv('../input/skin-cancer-mnist-ham10000/HAM10000_metadata.csv')
skin_df['path'] = skin_df['image_id'].map(imageid_path_dict.get)
skin_df['cell_type'] = skin_df['dx'].map(lesion_type_dict.get) 
skin_df['cell_type_idx'] = skin_df['dx'].map(lesion_code_dict.get) 
base_skin_dir_plus = os.path.join('..', 'input','skin-lesions-act-keratosis-and-melanoma')

#Creating the 2nd DF
imageid_path_dict_plus = {os.path.splitext(os.path.basename(x))[0]: x for x in glob(os.path.join(base_skin_dir_plus, '*','*', '*.jpg'))}
skin_df_plus = pd.read_csv('../input/skin-lesions-act-keratosis-and-melanoma/ISIC-images/metadata.csv')
skin_df_plus['path'] = skin_df_plus['image_id'].map(imageid_path_dict_plus.get)
skin_df_plus['cell_type'] = skin_df_plus['dx'].map(lesion_type_dict.get) 
skin_df_plus['cell_type_idx'] = skin_df_plus['dx'].map(lesion_code_dict.get) 
skin_df_plus.drop_duplicates(inplace = True)

#Merge Two dataframes
dataframes = [skin_df,skin_df_plus]
skin_final = pd.concat(dataframes)

#Create Training and Test sets
skin_final['image'] = skin_final['path'].map(lambda x: np.asarray(Image.open(x).resize((28,28))))
x_train,x_validate,x_test,y_train,y_validate,y_test,y_test_o = df_prep(skin_final)

#Fit Data augmentation
dataaugment_baseline.fit(x_train)

#Run CNN model
Plusmodel = baseline_CNN()
Plus_results, Plus_stddev, Pluspred = MultipleRuns(Plusmodel,dataaugment_baseline,2)

To better understand the impact of additional data to the model, let's see how the accuracy increases as more data are fed to the model. The code below creates and trains six models with different percentages of data. The first model contains only 50% of samples, the second 60% and so on. 

The results of each model are shown in the plot below:

In [None]:
fractions = [0.5,0.6,0.7,0.8,0.9,1]
acc_results = []
std_devs = []

for i in fractions:
    df = skin_final.sample(frac=i,random_state=0)
    #df['image'] = df['path'].map(lambda x: np.asarray(Image.open(x).resize((28,28))))
    x_train,x_validate,x_test,y_train,y_validate,y_test = df_prep(df)

    #Fit Data augmentation
    dataaugment_baseline.fit(x_train)

    #Run CNN model
    Fracmodel = baseline_CNN()
    acc_result, stddev_result = MultipleRuns(Fracmodel,dataaugment_baseline,3)
    acc_results.append(acc_result)
    std_devs.append(stddev_result)

In [None]:
acc_results = np.array(acc_results)
std_devs = np.array(std_devs)
fractions = np.array(fractions)
fractions = fractions*100

#To be able to Despine the graph
ax = plt.subplot(111)
ax.plot(fractions, acc_results, 'b-', marker = 'o')

# Hide the right and top spines
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

ax = plt.fill_between(fractions, acc_results - std_devs, acc_results + std_devs, color='b', alpha=0.2)
plt.title('Test set Accuracy Values and Dataset Fraction',fontsize = 14,weight = 'bold');
plt.xlabel('Fraction of the Dataset (%)',fontsize = 14,weight = 'bold');
plt.ylabel('Accuracy - Test set',fontsize = 14,weight = 'bold');

* As expected, with 50% of the data, the accuracy is lower and increases almost linearly up to 70%
* It is interesting that the accuracy peaks at the 70% mark, because that contain approximately the number of samples of the original dataset, HAM10000
* From 70% to 80 % there is a decrease in accuracy, and the standard deviation also decreases (which is good)
* The model with 100% data, containing both datasets - 13000 samples, presents ~ 69% accuracy, only a 0.5% increase over the dataset with 70% samples.

It is curious that the accuracy does not continue to increase with the additional 3.000 samples. In the next section, we analyse the different strategies and understand how they impacted Accuracy, Precision, Recall and the F1-Score of the model.

# 4. Final Model and Putting Everyhting Together

This final model contains the improved Data Augmentation parameters, the additional data and the XCeption Network architecure. Let's see how much the Accuracy and F-Score improves if compared to a previous Notebook where I also used XCeption archicteture.

The goal is to reach Accuracy above **0.81** and obtain better **F1-Scores** than the previous [Notebook](https://www.kaggle.com/jnegrini/ham10000-analysis-and-model-comparison) achieved. The image below summarise the F1-Score of the previous work with XCeption architecture:
![](https://www.kaggleusercontent.com/kf/57671423/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..yN47vmOTQEwpGmyXlsFneg.yola8-C0HnNgccE3vStdz-tO6vcB9VurLXOCBfPboqOgD-n2S2i-_7CQ23xUgUoD3Dbw7MXR69hIp_qE5RGI3B5bTEQl9dzYdCVWI-2ZA-OURjFRhwivk-WwJvCxXzKhGxbIq-hQgGF4fV1KhnBYPNYshzGVAYegPGwVT_jV3Su6hp12yzrpnypwZvpQnSuRHjsXVp8-w2Tv2ZLlY12kpzN2LCmk7Yuz-177m6zjVjdvszRYvXXgMkjPkK31FJXgqbssXxJQjPB_bgssZwJLntiA2OeGn0DCyzbaayiVzj5Mm-hIkfu_g3AIw4rVBgPleXqnWzmY8h3Y6zpmbQjDcMBB_Uuq5NuqH3mriEVUa5HcZRlS5IPHtAyLLhtKpZRjUbscrKWD1uFZJP2HSGS9Fe1E-q-QvXWzUpoCGEohCp-nN7howTCO6k4ONNJmsKGu5sLRP9-sRrKMoBaCYyGkBc-avvTfJuS4hQ9vlIk8cCca31qnWAAf-iO8pHt8Ryzfur5YENWgw7ZsgLIo6dNfs9pB4OxaKtvaTWtRyK9RA9gSWb9F7zJ37hMwAQ2DDMsnoskCAQ8PMIZk_h_iiQVMWosN7SKkwn47E3q8g_Dyw-sc-GOhOyFXWkXz_PLOHgP8LwakRoFtac1eThsorjLXwRCKSs04y0mS5NmyCx4G97JPozmY14jLC4eYbmMz50eF.mHwY44zL49NMrJjWVp-Q6w/__results___files/__results___36_0.png)

In [None]:
#XCeption Model
#Fit Data augmentation
OptmAugment.fit(x_train)

from keras.applications import Xception
#
training_shape = (28,28, 3)
base_model = Xception(include_top=False,weights='imagenet',input_shape = training_shape)

XCeptionmodel = base_model.output
XCeptionmodel = Flatten()(XCeptionmodel)

XCeptionmodel = BatchNormalization()(XCeptionmodel)
XCeptionmodel = Dense(128, activation='relu')(XCeptionmodel)
XCeptionmodel = Dropout(0.2)(XCeptionmodel)

XCeptionmodel = BatchNormalization()(XCeptionmodel)
XCeptionoutput = Dense(num_classes, activation = 'softmax')(XCeptionmodel)
XCeptionmodel = Model(inputs=base_model.input, outputs=XCeptionoutput)

for layer in base_model.layers:
    layer.trainable = True

XCeptiony_pred,XCeptionaccuracy_t,XCeptionaccuracy_v,XCeptionaccuracy = history(XCeptionmodel)
    
print("XCeption Training: accuracy = %f" % (XCeptionaccuracy_t))
print("XCeption Validation: accuracy = %f" % (XCeptionaccuracy_v))
print("XCeption Test: accuracy = %f" % (XCeptionaccuracy))

#Run CNN model
Finalmodel = baseline_CNN()
Final_results, Final_stddev, Finalpred = MultipleRuns(XCeptionmodel,OptmAugment,3)

# 4. Results

One of the classes of the dataset is Melanoma, a malignant form of cancer. For this reason, it is important to analyse not the Accuracy of the model, but also the other metrics to understand how the model performs. 

As a refresher, the image from [Wikipedia](https://en.wikipedia.org/wiki/Precision_and_recall) below provides the concept behind **Precision** and **Recall**. Unfortunately, these metrics usually present an inverse relationship, e.g. it is only possible to increase one at the cost of reducing the other.

![image.png](https://d1zx6djv3kb1v7.cloudfront.net/wp-content/media/2020/10/Accuracy-Recall-Precision-F1-Score-in-Python-4.png)

<blockquote style="margin-right:auto; margin-left:auto; background-color: ##FFF2CC; padding: 1em; margin:24px;">
<strong>Precision and Recall in this Context</strong>
<ul>
<li>A model with high Precision presents the number of TP much higher than the FP. 
<li>For example, if our high Precision model would classify a sample as "Melanoma", it would definetely be a "Melanoma". However, since it does not account for FN, it does not say how many "Melanomas" samples it missed/misclassified as "Vascular Lesion"
<li>A model with high Recall presents the number of TP much higher than the FN. 
<li>Contrary to Precision, it takes into account the number of "Melanoma" samples it missed. A high recall would mean that all "Melanoma" samples were correctly labelled. However, since it does not account for FP, it does not tell us how many "Vascular Lesions" samples were misclassified as "Melanoma"
<ul>
</blockquote>
    
To balance these both metrics, F-Score can be used and its Beta parameter can be tuned to consider Recall β times as important as Precision.
    
![image.png](https://wikimedia.org/api/rest_v1/media/math/render/svg/136f45612c08805f4254f63d2f2524bc25075fff) 

In [None]:
barplot(['Accuracy','Precision','Recall','F1-Score'],Baseline_results,'Baseline Model Metrics', 'Metric','Value')
barplot(['Accuracy','Precision','Recall','F1-Score'],Optm_results,'Optimised Data Augmentation Model Metrics', 'Metric','Value')
barplot(['Accuracy','Precision','Recall','F1-Score'],Plus_results,'Model with Additional Data Metrics', 'Metric','Value')
barplot(['Accuracy','Precision','Recall','F1-Score'],Final_results,'Model with Additional Data Metrics', 'Metric','Value')

* For this application, Accuracy can be incredibly misleading
* One could argue that for this application Recall would be more important than Precision. Due to the Melanoma risk, a False Negative could leave the patient with the cancer untreated
* We can understand now why the Accuracy decreased with more samples. The model trained with the additional dataset contains more samples (i.e. a larger denominator) while mantaining a similar numerator as the Baseline Model (number of TP + TN has not changed much, Precision and Recall Metrics did not changed dramatically)
* Data Augmentation seems to improve the model ability to ...
* 


# 5. Conclusion

This Notebook focused on trying to improve the model accuracy by focusing the efforts on Data-Centric approaches. Here we used data augmentation and the collection of more samples as strategies to improve model generalisation.

The experiment with data augmentation achieved a decent accuracy improvement by applying a Bayesian Optimisation method. To fine-tune the Data Augmentation hyperparameters has brought better outcomes if compared to the result achieved by the additional samples using the same model training parameters. 

The amount of work it took to collect and create the complementary dataset did not pay off for this application. There are a few reasons for this outcome:
* I fixed the number of batches and epoch to 64 and 20, respectively. One hypothesis is that with more samples the network is capable to achieve better results. However, it requires more training time and different training parameters
* Perhaps the accuracy has not increased, however, Precision or Recall could have improved