# EDA

The dataset we will use to train the model is the Skin Cancer MNIST: HAM10000 which can be found on Kaggle. There are also some great EDA notebooks that can be found under the kernels for this dataset. The EDA here is largely based on [this noteobok](https://www.kaggle.com/sid321axn/step-wise-approach-cnn-model-77-0344-accuracy).

Major TODOs:
- Create dockerfile and clean readme
- Look at class activation maps and other localization techniques
- Automate performance analysis tracking, create csv that stores metadata of what settings/hyperparameters were used
- Setup hyperparameter tuning using keras tuner
- Create a pipeline using TF Records, tf datasets, and tensorboard
- Create ios mobile app
- Allow various models to be trained
- Try incorporating non image data (i.e. patient info) into a single end-to-end model
- Split out this notebook and keep model training separate from EDA. Also have one for model performance analysis and localization/visualization (that take in trained model)
- Add in class weighting (with description)

## Data Processing

In [None]:
import os
from glob import glob
import pandas as pd
import numpy as np

import altair as alt
import matplotlib.pyplot as plt
from PIL import Image

from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import layers

In [None]:
alt.data_transformers.disable_max_rows()

Get a dictionary of images for our dataset and create a lookup table for readable names for our classes

In [None]:
base_dir = os.path.join('..', 'data')

# Merging images from both folders HAM10000_images_part1.zip and HAM10000_images_part2.zip into one dictionary

image_path_dict = {os.path.splitext(os.path.basename(x))[0]: x
                     for x in glob(os.path.join(base_dir, '*', '*.jpg'))}

# This dictionary is useful for displaying more human-friendly labels later on

lesion_type_dict = {
    'nv': 'Melanocytic nevi',
    'mel': 'Melanoma',
    'bkl': 'Benign keratosis-like lesions',
    'bcc': 'Basal cell carcinoma',
    'akiec': 'Actinic keratoses',
    'vasc': 'Vascular lesions',
    'df': 'Dermatofibroma'
}

In [None]:
print(f'There are {len(image_path_dict)} images in our dataset')

Here we will read and process the data. This will help later with creating labels.

In [None]:
skin_df = pd.read_csv(os.path.join(base_dir, 'datasets_54339_104884_HAM10000_metadata.csv'))

# Creating New Columns for better readability

skin_df['path'] = skin_df['image_id'].map(image_path_dict.get)
skin_df['cell_type'] = skin_df['dx'].map(lesion_type_dict.get) 
skin_df['cell_type_idx'] = pd.Categorical(skin_df['cell_type']).codes

In [None]:
skin_df.head()

Next, check for null values. Test different methods of imputation.

In [None]:
skin_df.isnull().sum()

In [None]:
print(skin_df.dtypes)

## EDA

First look at the distribution of our target variable.

In [None]:
alt.Chart(skin_df, height=300).mark_bar().encode(
    x='count()',
    y='cell_type',
    color='cell_type',
    tooltip='count()'
)

There are various methods by which the ground truth labels were established with this dataset:

1. Histopathology(Histo): Histopathologic diagnoses of excised lesions have been performed by specialized dermatopathologists.
2. Confocal: Reflectance confocal microscopy is an in-vivo imaging technique with a resolution at near-cellular level , and some facial benign with a grey-world assumption of all training-set images in Lab-color space before and after manual histogram changes.
3. Follow-up: If nevi monitored by digital dermatoscopy did not show any changes during 3 follow-up visits or 1.5 years biologists accepted this as evidence of biologic benignity. Only nevi, but no other benign diagnoses were labeled with this type of ground-truth because dermatologists usually do not monitor dermatofibromas, seborrheic keratoses, or vascular lesions.
4. Consensus: For typical benign cases without histopathology or followup biologists provide an expert-consensus rating of authors PT and HK. They applied the consensus label only if both authors independently gave the same unequivocal benign diagnosis. Lesions with this type of groundtruth were usually photographed for educational reasons and did not need further follow-up or biopsy for confirmation.

In [None]:
alt.Chart(skin_df, height=300).mark_bar().encode(
    x='count()',
    y='dx_type',
    color='dx_type',
    tooltip='count()'
)

Look at the distribution of localization field

In [None]:
alt.Chart(skin_df, height=400).mark_bar().encode(
    x='count()',
    y='localization',
    color='localization',
    tooltip='count()'
)

Look at the distribution of patient age

In [None]:
alt.Chart(skin_df[-skin_df['age'].isnull()]).mark_bar().encode(
    alt.X("age:Q", bin=True),
    y='count()',
)

Look at sex distribution in our data

In [None]:
alt.Chart(skin_df, height=400).mark_bar().encode(
    x='count()',
    y='sex',
    color='sex',
    tooltip='count()'
)

Look at cell type (the target) by median age

In [None]:
alt.Chart(skin_df[-skin_df['age'].isnull()], height=400).mark_bar().encode(
    x='median(age)',
    y='cell_type',
    color='cell_type'
)

## Data Quality

Look for duplicate images from patients and make sure datasets are stratified

In [None]:
df = skin_df.groupby('lesion_id').count()

In [None]:
df.sort_values(by='image_id', ascending=False).head(10)

In [None]:
print(f'Original dataset had {skin_df.shape[0]} records, there are {df.shape[0]} unique lesions')

## Create Train, Test, and Val Sets

- TODO Set aside a test set to evaluate model after it has been trained
- TODO Use TF Records and tf dataset to store training dataset as an alternative to data generator below

We see that there are numerous images taken for some patients, therefore we will choose a single image from each patient. Then we will take a stratified sample across our target variable in order to create our train test and validation directories.

First create a dataframe containing a single image from each patient. Note that we could also try including these duplicates, just making sure that when we split our dataset we keep patients in a single train, test, or val set.

In [None]:
# Set a seed (random_state) for reproducibility and deterministic train/val/test sets
df_dataset = skin_df.sample(frac=1, random_state=123).drop_duplicates(subset='lesion_id').copy()
df_dataset.reset_index(drop=True, inplace=True)

In [None]:
CLASS_LABELS = [
    'nv' ,
    'mel', 
    'bkl', 
    'bcc',
    'akiec',
    'vasc',
    'df',
]

First use image generator to build model, then below use the newer tf records to orchastrate training.

Visualize some of the images

In [None]:
n_samples = 5

fig, m_axs = plt.subplots(7, n_samples, figsize = (4*n_samples, 3*7))
for n_axs, (type_name, type_rows) in zip(m_axs, df_dataset.sort_values(['cell_type']).groupby('cell_type')):
    n_axs[0].set_title(type_name)
    for c_ax, (_, c_row) in zip(n_axs, type_rows.sample(n_samples, random_state=1234).iterrows()):
        c_ax.imshow(np.asarray(Image.open(c_row['path'])))
        c_ax.axis('off')
# The line below will save the images to disk
#fig.savefig('category_samples.png', dpi=300)

In [None]:
df_dataset.head()

In [None]:
img_shape = np.asarray(Image.open(df_dataset['path'][0])).shape
print('Image shape:', img_shape)

## Original Image

In [None]:
img = np.asarray(Image.open(df_dataset['path'][0]))

In [None]:
img.shape

In [None]:
f, ax = plt.subplots(1, 1, figsize=(5, 5))

ax.imshow(img)
ax.axis('off')
ax.set_aspect('auto')

plt.show() 

## Augmented Image

In [None]:
augmented = tf.image.random_brightness(img, max_delta=0.2)
# augmented = tf.image.random_flip_up_down(img)
# augmented = tf.image.random_flip_left_right(img)
# augmented = tf.image.random_saturation(image=img, lower=0.7, upper=1.3)
# augmented = tf.image.random_hue(image=img, max_delta=0.03)
# augmented = tf.image.random_contrast(image=img, lower=0.7, upper=1.3)

In [None]:
augmented.shape

In [None]:
f, ax = plt.subplots(1, 1, figsize=(5, 5))

ax.imshow(augmented)
ax.axis('off')
ax.set_aspect('auto')

plt.show() 

Create stratified train/test/val sets

In [None]:
X = df_dataset['path']
y = df_dataset['cell_type_idx']

Set a seed (random_state) for reproducibility and deterministic train/val/test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=123)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, stratify=y_train, test_size=0.111, random_state=321)

In [None]:
train_data = pd.DataFrame({
    'path': X_train,
    'cell_type_idx': y_train
})

In [None]:
NUM_TRAIN = len(train_data)

In [None]:
def convert_image_to_array(path):
    return np.asarray(Image.open(path), dtype=np.float32)

In [None]:
def create_model_file(X_path, y):
    """
    X_path: (pandas series) contains the file paths to the images
    y: (pandas series of type int) the target label
    
    return a pair of numpy arrays representing (features, target)
    """
    
    X = X_path.apply(convert_image_to_array)
    X /= 255.
    X = X.values
    X = list(X)
    X = np.array(X)
    
    y = y.map(lambda y: to_categorical(y, num_classes=len(CLASS_LABELS)))
    y = y.values
    y = list(y)
    y = np.array(y)
    
    return (X, y)

In [None]:
def model_predict(path, model):
    x = convert_image_to_array(path=path)
    x /= 255.
    x = np.expand_dims(x, axis=0)
    return model.predict(x)

In [None]:
val_data = create_model_file(X_path=X_val, y=y_val)

Create a data generator for training

In [None]:
def data_gen(data, batch_size, image_size=(450, 600), dtype=np.float32):
    # Get total number of samples in the data
    n = len(data)
    steps = n//batch_size
    
    # Define two numpy arrays for containing batch data and labels
    batch_data = np.zeros((batch_size, image_size[0], image_size[1], 3), dtype=dtype)
    batch_labels = np.zeros((batch_size, len(CLASS_LABELS)), dtype=dtype)

    # Get a numpy array of all the indices of the input data
    indices = np.arange(n)
    
    # Initialize a counter
    i = 0
    while True:
        np.random.shuffle(indices)
        # Get the next batch 
        count = 0
        next_batch = indices[(i*batch_size):(i+1)*batch_size]
        for j, idx in enumerate(next_batch):
            img_path = data.iloc[idx]['path']
            label = data.iloc[idx]['cell_type_idx']
            
            # one hot encoding
            encoded_label = to_categorical(label, num_classes=len(CLASS_LABELS))
            # read the image
            img = np.asarray(Image.open(img_path), dtype=dtype)
            
            # add image augmentation
            if np.random.uniform() < 0.15:
                img = tf.image.random_brightness(img, max_delta=0.2)
            if np.random.uniform() < 0.15:
                img = tf.image.random_flip_up_down(img)
            if np.random.uniform() < 0.15:
                img = tf.image.random_flip_left_right(img)
            if np.random.uniform() < 0.15:
                img = tf.image.random_saturation(image=img, lower=0.7, upper=1.3)
            if np.random.uniform() < 0.15:
                img = tf.image.random_hue(image=img, max_delta=0.03)
            if np.random.uniform() < 0.15:
                img = tf.image.random_contrast(image=img, lower=0.7, upper=1.3)
                
            
            # normalize the image pixels
            img = img/255.

            batch_data[count] = img
            batch_labels[count] = encoded_label

            count+=1

            if count==batch_size:
                break
            
        i+=1
        yield batch_data, batch_labels
            
        if i>=steps:
            i=0

## Keras Utility Functions

Define some functions that will help simplify the fine-tuning pre-trained models

In [None]:
def freeze_layers(model, freeze_layer_name):
    for layer in model.layers:
        if layer.name != freeze_layer_name:
            layer.trainable = False
        else:
            layer.trainable = False
            break
            
def unfreeze_batch_norm(model):
    for layer in model.layers:
        if layer.__class__.__name__ == 'BatchNormalization':
            layer.trainable = True

def print_layer_trainable(model):
    for layer in model.layers:
        print('{0}:\t{1}'.format(layer.trainable, layer.name))

## Build the model

TODO Add in MobileNetV2, EfficientNet, DenseNet

In [None]:
from tensorflow.keras.applications.xception import Xception

input_tensor = layers.Input(shape=(450, 600, 3), name='ImageInput')

model = Xception(include_top=False, weights='imagenet', input_tensor=input_tensor)

In [None]:
#model.summary()

Determine where to freeze and cut off base model

In [None]:
transfer_layer_name = 'block14_sepconv1_act'
freeze_layer_name = 'add_10'

transfer_layer = model.get_layer(transfer_layer_name)

In [None]:
conv_model = tf.keras.Model(inputs=model.input, outputs=transfer_layer.output)

In [None]:
freeze_layers(conv_model, freeze_layer_name)

In [None]:
def build_model(base_model, num_classes, pooling='avg', final_conv_layer='vgg_separable'):
    # Get the output of the base model on which we will build
    x = base_model.layers[-1].output
    
    if final_conv_layer == 'xception':
        x = layers.SeparableConv2D(2048, (3, 3), padding='same', use_bias=False, name='block14_sepconv2')(x)
        x = layers.BatchNormalization(name='block14_sepconv2_bn')(x)
        x = layers.Activation('relu', name='block14_sepconv2_act')(x)
    elif final_conv_layer == 'non_separable':
        x = layers.Conv2D(2048, (3, 3), padding='same', use_bias=False, name='block14_conv2')(x)
        x = layers.BatchNormalization(name='block14_conv2_bn')(x)
        x = layers.Activation('relu', name='block14_conv2_act')(x)
    elif final_conv_layer == 'vgg_separable':
        x = layers.SeparableConv2D(2048, (3,3), activation='relu', padding='same', name='block14_sepconv2')(x)
    elif final_conv_layer == 'vgg':
        x = layers.Conv2D(2048, (3,3), activation='relu', padding='same', name='block14_sepconv2')(x)
    else:
        raise ValueError('`final_conv_layer` should be one of the following: xception, non_separable, vgg_separable, or vgg')
    
    if pooling == 'global_avg':
        x = layers.GlobalAveragePooling2D(name='global_avg_pool')(x)
    elif pooling == 'global_max':
        x = layers.GlobalMaxPooling2D(name='global_max_pool')(x)
    elif pooling == 'max':
        x = layers.MaxPooling2D((2,2), name='local_max_pool')(x)
        x = layers.Flatten(name='flatten')(x)
    elif pooling == 'avg':
        x = layers.AveragePooling2D((2,2), name='local_avg_pool')(x)
        x = layers.Flatten(name='flatten')(x)
    else:
        raise ValueError('`pooling` should be one of the following: global_avg, global_max, max')
        
    x = layers.Dense(num_classes, activation='softmax', name='prediction')(x)

    # Create model.
    model = tf.keras.Model(base_model.input, x, name='Xception')
    return model

In [None]:
model = build_model(base_model=conv_model, num_classes=len(CLASS_LABELS))

In [None]:
model.summary()

In [None]:
unfreeze_batch_norm(model)

In [None]:
print_layer_trainable(model)

## Train the model

In [None]:
epochs = 20
batch_size = 8
model_path='../serialized_models/model_X{epoch:02d}-{val_loss:.4f}.h5'

callbacks = [
    tf.keras.callbacks.ModelCheckpoint(filepath=model_path, save_best_only=True),
    tf.keras.callbacks.EarlyStopping(patience=5)
]
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)
model.fit(
    x=data_gen(data=train_data, batch_size=batch_size), epochs=epochs, callbacks=callbacks, validation_data=val_data, steps_per_epoch=int(NUM_TRAIN/batch_size),
)

## Evaluate model performance

Run model on a single image

In [None]:
X_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

In [None]:
test_num = 9

X_test_example = X_test[test_num]
y_test_example = y_test[test_num]

y_hat = model_predict(path=X_test_example, model=model)

print(f'Ground truth label: {y_test_example} \n Predicted label: {np.argmax(y_hat)} \t Probability: {np.max(y_hat)}')

Run model on the entire test set

In [None]:
test_data = create_model_file(X_path=X_test, y=y_test)

In [None]:
model.evaluate(x=test_data[0], y=test_data[1])