## <font color=blue> Melanoma Image Classification using CNN </font> - by Sankalp Gupta

<font color=blue> Problem statement: </font> To build a CNN based model which can accurately detect melanoma. Melanoma is a type of cancer that can be deadly if not detected early. It accounts for 75% of skin cancer deaths. A solution which can evaluate images and alert the dermatologists about the presence of melanoma has the potential to reduce a lot of manual effort needed in diagnosis.

<font color=blue> Instructions: </font>
- Importing Skin Cancer Data
- To do: Take necessary actions to read the data
- Importing all the important libraries

#### <font color=blue> Step 1: </font> Import necessary libraries

In [None]:
import os, pathlib, glob, shutil

import numpy as np, pandas as pd

import matplotlib.pyplot as plt
from PIL import Image

import tensorflow as tf
from tensorflow import keras

<font color=blue> Notes: </font>
- This assignment uses a dataset of about 2357 images of skin cancer types.
- The dataset contains 9 sub-directories in each train and test subdirectories.
- The 9 sub-directories contains the images of 9 skin cancer types respectively.

#### <font color=blue> Step 2: </font> Connect with the Image Data directory

In [None]:
data_dir = './melanomas'

train_data = data_dir + '/Train'
test_data  = data_dir + '/Test'

In [None]:
def crete_folder_if_not_existing (f):
    if not os.path.exists(f):
        print ('Folder ' + f + 'does not exist. Creating one')
        os.mkdir(f)
    else:
        print ('Folder ' + f + ' exists')

In [None]:
#Create directories if they do not exist - This is done to prevent the program from crashing
crete_folder_if_not_existing(data_dir)
crete_folder_if_not_existing(train_data)
crete_folder_if_not_existing(test_data)

In [None]:
data_dir_train = pathlib.Path(train_data)
data_dir_test  = pathlib.Path(test_data)

In [None]:
image_count_train = len(list(data_dir_train.glob('*/*.jpg')))
image_count_test  = len(list(data_dir_test.glob('*/*.jpg')))

print("Number of Training Images: ", f"{image_count_train:>4}")
print("Number of Test     Images: ", f"{image_count_test:>4}")
print('-'*32)
print("Total Number of    Images: ", f"{(image_count_train + image_count_test):>4}")

#### <font color=blue> Step 3: </font> Load images using image_dataset_from_directory utility from keras.preprocessing

<font color=blue> Instructions: </font>
- Write your train dataset here
- Use 80% of the images for training, and 20% for validation.
- Note use seed=123 while creating your dataset using tf.keras.preprocessing.image_dataset_from_directory
- Note, make sure your resize your images to the size img_height*img_width, while writting the dataset

In [None]:
from keras.utils import image_dataset_from_directory

In [None]:
batch_size = 32
img_height, img_width = 180, 180

In [None]:
#Creating Training Dataset - 80% of the images used for training
train_ds = image_dataset_from_directory(
    train_data,
    labels='inferred',
    color_mode='rgb',
    validation_split=0.2,
    subset='training',
    seed=123,
    image_size=(img_height, img_width),
    batch_size=batch_size)

In [None]:
#Creating Validation Dataset - 20% of the images used for validation
val_ds = image_dataset_from_directory(
    train_data,
    labels='inferred',
    color_mode='rgb',
    validation_split=0.2,
    subset='validation',
    seed=123,
    image_size=(img_height, img_width),
    batch_size=batch_size)

<font color=blue> Instructions: </font>
- List out all the classes of skin cancer and store them in a list. 
- You can find the class names in the class_names attribute on these datasets. 
- These correspond to the directory names in alphabetical order.

In [None]:
class_names = train_ds.class_names
print(class_names)

#### <font color=blue> Step 4: </font> Visualize one sample image for each class

<font color=blue> Instructions: </font>
- Visualize the data
- Todo, create a code to visualize one instance of all the nine classes present in the dataset
- your code goes here, you can use training or validation data to visualize

In [None]:
num_classes = len (class_names)
num_classes

In [None]:
plt.figure(figsize=(16, 8))

for i in range(num_classes):
    class_ds = train_ds.filter(lambda _, l: tf.math.equal(l[0], i))
    ax = plt.subplot(2, 5, i+1)
    
    for image, label in class_ds.take(1):    
        plt.imshow(image[0].numpy().astype('uint8'))
        
        l = label.numpy()[0]
        title_str = str (l+1) + ':' + class_names[l]
        plt.title(title_str)
        
        plt.axis('off')

#### <font color=blue> Step 5: </font> Build CNN Model

<font color=blue> Notes: </font>
- The image_batch is a tensor of the shape (32, 180, 180, 3). This is a batch of 32 images of shape 180x180x3 (the last dimension refers to color channels RGB). The label_batch is a tensor of the shape (32,), these are corresponding labels to the 32 images.

- Dataset.cache() keeps the images in memory after they're loaded off disk during the first epoch.

- Dataset.prefetch() overlaps data preprocessing and model execution while training.

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds   = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

<font color=blue> Instructions: </font>
- Create the model
- Todo: Create a CNN model, which can accurately detect 9 classes present in the dataset. 
- Use layers.experimental.preprocessing.Rescaling to normalize pixel values between (0,1). 
- The RGB channel values are in the [0, 255] range. 
- This is not ideal for a neural network. Here, it is good to standardize values to be in the [0, 1]

#### <font color=blue> Import Keras libraries for CNN Model building

In [None]:
from keras import layers
from keras.models import Sequential, Model, load_model

In [None]:
from tensorflow.keras.layers import Rescaling
from tensorflow.keras.layers import Input, Add,Dropout, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten
from tensorflow.keras.layers import Conv2D, AveragePooling2D, MaxPooling2D, GlobalAveragePooling2D

#### <font color=blue> Define a function to visualize model results

In [None]:
def visualize_results (history, epochs):
    epochs_range = range(epochs)
    plt.figure(figsize=(8, 8))

    acc      = history.history['accuracy']
    val_acc  = history.history['val_accuracy']

    plt.subplot(1, 2, 1)
    plt.plot(epochs_range, acc,      label='Training Accuracy')
    plt.plot(epochs_range, val_acc,  label='Validation Accuracy')
    plt.legend(loc='lower right')
    plt.title('Training and Validation Accuracy')

    loss     = history.history['loss']
    val_loss = history.history['val_loss']

    plt.subplot(1, 2, 2)
    plt.plot(epochs_range, loss,     label='Training Loss')
    plt.plot(epochs_range, val_loss, label='Validation Loss')
    plt.legend(loc='upper right')
    plt.title('Training and Validation Loss')

    plt.show()

#### <font color=blue> Model 1: </font> ResNet50

In [None]:
from tensorflow.keras.applications import ResNet50

In [None]:
# Get base ResNet50 model 
base_model = ResNet50(weights='imagenet', include_top=False)

# As we are using ResNet model only for feature extraction and not adjusting the weights
# we freeze the layers in base model
for layer in base_model.layers:
    layer.trainable = False
        
# Get base model output 
base_model_ouput = base_model.output
    
# Adding our own layer 
x = GlobalAveragePooling2D()(base_model_ouput)

# Adding fully connected layer
x = Dense(512, activation='relu')(x)
x = Dense(num_classes, activation='softmax', name='fcnew')(x)
    
model = Model(inputs=base_model.input, outputs=x)

In [None]:
#Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

In [None]:
#View Model Summary
model.summary()

In [None]:
#Train the model
epochs = 20
history = model.fit(train_ds, validation_data=val_ds, epochs=epochs, batch_size=batch_size)

In [None]:
#Visualize the model results
visualize_results (history, epochs)

#### <font color=blue> Model 2: </font> Simple CNN model with 4 CNN layer blocks, each with 2 Conv layers and 1 Maxpool

In [None]:
model = Sequential()
model.add(Rescaling(1.0/255, input_shape=(180,180,3)))
model.add(Conv2D(filters = 32, kernel_size = (3,3), padding = 'Same', activation ='relu', input_shape = (180,180,3)))
model.add(Conv2D(filters = 32, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
 
model.add(Conv2D(filters =128, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(Conv2D(filters =128, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(filters =256, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(Conv2D(filters =256, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(num_classes, activation = "softmax"))

In [None]:
#Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

In [None]:
#View Model Summary
model.summary()

In [None]:
#Train the model
epochs = 20
history = model.fit(train_ds, validation_data=val_ds, epochs=epochs, batch_size=batch_size)

In [None]:
#Visualize the model results
visualize_results (history, epochs)

#### <font color=blue> Model 3: </font> Add Dropouts - to remove overfitting

In [None]:
model = Sequential()
model.add(Rescaling(1.0/255, input_shape=(180,180,3)))
model.add(Conv2D(filters = 32, kernel_size = (3,3), padding = 'Same', activation ='relu', input_shape = (180,180,3)))
model.add(Conv2D(filters = 32, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(Conv2D(filters = 64, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
 
model.add(Conv2D(filters =128, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(Conv2D(filters =128, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(filters =256, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(Conv2D(filters =256, kernel_size = (3,3), padding = 'Same', activation ='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(num_classes, activation = "softmax"))

In [None]:
#Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

In [None]:
#View Model Summary
model.summary()

In [None]:
#Train the model
epochs = 20
history = model.fit(train_ds, validation_data=val_ds, epochs=epochs, batch_size=batch_size)

In [None]:
#Visualize the model results
visualize_results (history, epochs)

<font color=blue> Instructions: </font>
- Todo: Write your findings after the model fit, see if there is an evidence of model overfit or underfit

Write your findings here

- Todo, after you have analysed the model fit history for presence of underfit or overfit, choose an appropriate data augumentation strategy. 
- Your code goes here
- Todo, visualize how your augmentation strategy works for one instance of training image.
- Your code goes here

Todo: Write your findings after the model fit, see if there is an evidence of model overfit or underfit. Do you think there is some improvement now as compared to the previous model run?
Todo: Find the distribution of classes in the training dataset.
Context: Many times real life datasets can have class imbalance, one class can have proportionately higher number of samples compared to the others. Class imbalance can have a detrimental effect on the final model quality. Hence as a sanity check it becomes important to check what is the distribution of classes in the data.
## Your code goes here.
Todo: Write your findings here:
- Which class has the least number of samples?
- Which classes dominate the data in terms proportionate number of samples?
Todo: Rectify the class imbalance
Context: You can use a python package known as Augmentor (https://augmentor.readthedocs.io/en/master/) to add more samples across all classes so that none of the classes have very few samples.

#### <font color=blue> Model 4: </font> Use Augmentor

In [None]:
!pip install Augmentor

To use Augmentor, the following general procedure is followed:

Instantiate a Pipeline object pointing to a directory containing your initial image data set.
Define a number of operations to perform on this data set using your Pipeline object.
Execute these operations by calling the Pipeline’s sample() method.

In [None]:
path_to_training_dataset="To do"

In [None]:
import Augmentor
for i in class_names:
    p = Augmentor.Pipeline(path_to_training_dataset + i)
    p.rotate(probability=0.7, max_left_rotation=10, max_right_rotation=10)
    p.sample(500) ## We are adding 500 samples per class to make sure that none of the classes are sparse.

Augmentor has stored the augmented images in the output sub-directory of each of the sub-directories of skin cancer types.. Lets take a look at total count of augmented images.

In [None]:
image_count_train = len(list(data_dir_train.glob('*/output/*.jpg')))
print(image_count_train)

Lets see the distribution of augmented data after adding new images to the original training data.

In [None]:
path_list = [x for x in glob(os.path.join(data_dir_train, '*','output', '*.jpg'))]
path_list
lesion_list_new = [os.path.basename(os.path.dirname(os.path.dirname(y))) for y in glob(os.path.join(data_dir_train, '*','output', '*.jpg'))]
lesion_list_new

In [None]:
dataframe_dict_new = dict(zip(path_list_new, lesion_list_new))
df2 = pd.DataFrame(list(dataframe_dict_new.items()),columns = ['Path','Label'])
new_df = original_df.append(df2)
new_df['Label'].value_counts()

So, now we have added 500 images to all the classes to maintain some class balance. We can add more images as we want to improve training process.

Todo: Train the model on the data created using Augmentor

In [None]:
#Creating Training Dataset - 80% of the images used for training
data_dir_train="path to directory with training data + data created using augmentor"
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    data_dir_train,
    seed=123,
    validation_split = 0.2,
    subset='training',
    image_size=(img_height, img_width),
    batch_size=batch_size)

In [None]:
#Creating Validation Dataset - 20% of the images used for validation
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    data_dir_train,
    seed=123,
    validation_split = 0.2,
    subset='validation',
    image_size=(img_height, img_width),
    batch_size=batch_size)

#### <font color=blue> Model 5: </font> Add Batch Normalization

In [None]:
#Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

In [None]:
#View Model Summary
model.summary()

In [None]:
#Train the model
epochs = 30
history = model.fit(train_ds, validation_data=val_ds, epochs=epochs, batch_size=batch_size)

In [None]:
#Visualize the model results
visualize_results (history, epochs)

#### <font color=blue> Model 6: </font> Increase epochs to 50

In [None]:
#Train the model
epochs = 50
history = model.fit(train_ds, validation_data=val_ds, epochs=epochs, batch_size=batch_size)

In [None]:
#Visualize the model results
visualize_results (history, epochs)