## Problem statement:

To build a CNN based model which can accurately detect melanoma. Melanoma is a type of cancer that can be deadly if not detected early. It accounts for 75% of skin cancer deaths. A solution which can evaluate images and alert the dermatologists about the presence of melanoma has the potential to reduce a lot of manual effort needed in diagnosis.

### Importing Skin Cancer Data
#### To do: Take necessary actions to read the data

### Importing all the important libraries

In [None]:
# Libraries for read/write operations
import os
import glob
import pathlib

# Libraries for calculations
import numpy as np
import pandas as pd

# Libraries for graphical visualizations
import seaborn as sns
import matplotlib.pyplot as plt

# libraries for machine learning
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

In [None]:
# If you are using the data by mounting the google drive, uncomment and use the use the following :
# from google.colab import drive
# drive.mount('/content/gdrive')

##Ref:https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166

This assignment uses a dataset of about 2357 images of skin cancer types. The dataset contains 9 sub-directories in each train and test subdirectories. The 9 sub-directories contains the images of 9 skin cancer types respectively.

In [None]:
# I've used my local windows based system for the assignment. Kindly update path in the variable 'local_dir_path' as required during evaluation.
# Please don't add Train and Test in the below path as it will be populated by itself as per logic in the following cell.

dataset_dir_path = r"datasets"
local_dir_path = os.path.join(dataset_dir_path, "Skin cancer ISIC The International Skin Imaging Collaboration")

In [None]:
# Defining the path for train and test images

data_dir_train = pathlib.Path(f"{local_dir_path}/Train")
data_dir_test = pathlib.Path(f"{local_dir_path}/Test")

In [None]:
# List the count of images within the Train and Test Files respectively

image_count_train = len(list(data_dir_train.glob('*/*.jpg')))
print(image_count_train)

image_count_test = len(list(data_dir_test.glob('*/*.jpg')))
print(image_count_test)

### Load using keras.preprocessing

Let's load these images off disk using the helpful image_dataset_from_directory utility.

### Create a dataset

Define some parameters for the loader:

In [None]:
batch_size = 32
img_height = 180
img_width = 180

Use 80% of the images for training, and 20% for validation.

In [None]:
# Training Dataset logic
# Initialized 'seed = 123' while creating the dataset using tf.keras.preprocessing.image_dataset_from_directory as per instructions in the Starter Notebook

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    data_dir_train,
    seed = 123,
    validation_split = 0.2,  # 80% sample for training and 20% for validation
    subset = 'training',
    image_size = (img_height, img_width),
    batch_size = batch_size
)

In [None]:
# Validation Dataset logic
# Initialized 'seed = 123' while creating the dataset using tf.keras.preprocessing.image_dataset_from_directory as per instructions in the Starter Notebook

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    data_dir_train,
    seed = 123,
    validation_split = 0.2,
    subset = 'validation',
    image_size =(img_height,img_width),
    batch_size = batch_size
)

In [None]:
# Listing all the classes Names. These correspond to the directory names in alphabetical order.

class_names = train_ds.class_names
print(class_names)

### Visualize the data

In [None]:
plt.figure(figsize=(10, 10))

for images, labels in train_ds.take(1):
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(images[i].numpy().astype("uint8"))
        plt.title(class_names[labels[i]])
        plt.axis("off")

An **image batch** is a collection of images stacked together along the batch dimension. In **machine learning and deep learning**, models often process data in batches rather than individual samples.

For example, in a batch of images, each image might have dimensions **(height, width, channels)**, and if the batch size is **32**, the shape of the batch tensor would be **(32, height, width, channels)**. This allows the model to process multiple images simultaneously, which can improve efficiency and speed during training or inference.

In [None]:
for image_batch, labels_batch in train_ds:
  print(image_batch.shape)
  print(labels_batch.shape)
  break

From the above cell output, it's clear that it is an image batch of **32 images** of shape **180 x 180 x 3**. 
The last dimension refers to color channels **RGB (Red, Blue, Green)**.

**AUTOTUNE** is a special value that allows TensorFlow to automatically tune the prefetch buffer size dynamically at runtime based on the available memory and other factors. 
This can help optimize the performance of your input pipeline without manually tuning the buffer size.

**Dataset.cache** keeps the images in memory after they're loaded off disk during the first epoch. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache.

**Dataset.prefetch** overlaps data preprocessing and model execution while trainingg

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

### Create the model
Create a CNN model, which can accurately detect 9 classes present in the dataset. 
Use ```layers.experimental.preprocessing.Rescaling``` to normalize pixel values between (0,1). 
The RGB channel values are in the `[0, 255]` range. This is not ideal for a neural network.
Here, it is good to standardize values to be in the `[0, 1]`

In [None]:
# Total classes are 9
num_classes = 9

model = Sequential(
    [
        tf.keras.layers.Rescaling(1. / 255, input_shape = (img_height, img_width, 3)),
        
        tf.keras.layers.Conv2D(16, (3, 3), padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        
        tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        
        tf.keras.layers.Conv2D(64, (3, 3), padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        
        tf.keras.layers.Flatten(),
        
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(num_classes)
    ]
)

### Compile the model
Choose an appropirate optimiser and loss function for model training 

In [None]:
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

In [None]:
# View the summary of all layers
model.summary()

### Train the model

In [None]:
%%time

epochs = 20
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs
)

### Visualize the results

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

#### Write your findings after the model fit, see if there is an evidence of model overfit or underfit

**Observations and Findings**

1. The model's training accuracy rose steadily at first but declined slightly at **$14^{th}$ epoch** and then steadily rose again upto **88%**.
2. The model's validation accuracy was fluctuating as it first rose then declined slightly and then increased upto **54%** .
3. The model's training loss steadily declined.
4. The model's validation loss shows a U curve shape where it declined first and then increased.
5. The model's high training accuracy and low validation accuracy indicate overfitting as it managed to capture noise and details in the data.

As a result we'll need to modify the existing training data using data Augmentation techniques which involves adjusting the data slightly by rotation, flipping, zooming in/out etc. and then train the model again.

### Augment the Training Data

In [None]:
augmented_data = keras.Sequential(
    [
        layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical", input_shape = (img_height, img_width, 3)),
        layers.experimental.preprocessing.RandomRotation(0.2, fill_mode = 'reflect'),
        layers.experimental.preprocessing.RandomZoom(0.2, fill_mode = 'reflect')
    ]
)

### Visualize the Augmented Data

In [None]:
# Visualize how your augmentation strategy works for one instance of training image.

plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(augmented_data(images)[i].numpy().astype("uint8"))
        plt.title(class_names[labels[i]])
        plt.axis("off")

### Create the model using Augmented Data

In [None]:
model = Sequential(
    [
        augmented_data,
        tf.keras.layers.Rescaling(1. / 255, input_shape = (img_height, img_width, 3)),
        
        tf.keras.layers.Conv2D(16, (3, 3), padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        
        tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        
        tf.keras.layers.Conv2D(64, (3, 3), padding='same', activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        
        tf.keras.layers.Flatten(),
        
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(num_classes)
    ]
)

### Compile the Model

In [None]:
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

### Train the Model

In [None]:
%%time

history = model.fit(
    train_ds,
    validation_data = val_ds,
    epochs = epochs    # Declared already in earlier coding steps so using same value here
)

### Visualize the results

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

#### Write your findings after the model fit, see if there is an evidence of model overfit or underfit. Do you think there is some improvement now as compared to the previous model run?

**Observations and Findings**

1. The model's training and validation accuracy post data augmentation are now along similar lines.
2. The model's training loss steadily declined.
4. The model's validation loss is more compared to training loss.
5. The model's training and validation accuracies are both low thus model is underfitting.

As a result we'll now try another approach to check if these results can be improved.

#### Find the distribution of classes in the training dataset.
#### **Context:** Many times real life datasets can have class imbalance, one class can have proportionately higher number of samples compared to the others. Class imbalance can have a detrimental effect on the final model quality. Hence as a sanity check it becomes important to check what is the distribution of classes in the data.

In [None]:
path_list = []
class_lables_list = []

In [None]:
required_images = os.path.abspath(os.path.join(data_dir_train, '**/*.jpg'))
list_of_images = glob.glob(required_images, recursive = True)

In [None]:
for image in list_of_images:
    class_label = os.path.basename(os.path.dirname(image))
    class_lables_list.append(class_label)
    path_list.append(image)

In [None]:
class_labels_path_df = pd.DataFrame(
    dict(
        class_label = class_lables_list,
        image_path = path_list
    )
)
class_labels_path_df

### Visualize the Class Distribution

In [None]:
# Pie Chart to visualize percenatge wise class distribution

plt.figure(figsize = (7, 7))
class_distribution_value_counts = class_labels_path_df.class_label.value_counts(ascending = True)
plt.pie(class_distribution_value_counts.values, labels = class_distribution_value_counts.index, autopct='%.2f%%', startangle=140)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.title('Class Distribution', pad=30, loc='center')
plt.show()

In [None]:
# Seaborn Count Plot to visualize class distribution

plt.figure(figsize = (10, 10))
sns.countplot(y = "class_label", data = class_labels_path_df, order=class_distribution_value_counts.index, palette = "Set1", hue = "class_label")
plt.xlabel('Count of Images in a particular Class type', fontsize = 10)
plt.ylabel('Class types', fontsize = 10)
plt.title('Count of Images belonging to a class Type v/s Type of class', fontsize = 10)
plt.show()

#### Write your findings here: 
#### - Which class has the least number of samples?
#### - Which classes dominate the data in terms proportionate number of samples?

**Observations and Findings:**
- Based upon the above visualizations we can see that there's a clear case of class imbalance.
- **seborrheic keratosis** class has the least number of samples (**3.44%**).
- **pigmented benign keratosis** with **20.63%** dominates the classes followed by **melanoma** with **19.56%**.

#### Rectify the class imbalance
#### **Context:** You can use a python package known as `Augmentor` (https://augmentor.readthedocs.io/en/master/) to add more samples across all classes so that none of the classes have very few samples.

In [None]:
!pip install Augmentor

To use `Augmentor`, the following general procedure is followed:

1. Instantiate a `Pipeline` object pointing to a directory containing your initial image data set.<br>
2. Define a number of operations to perform on this data set using your `Pipeline` object.<br>
3. Execute these operations by calling the `Pipeline’s` `sample()` method.
hod.hod.hod.

In [None]:
path_to_training_dataset = str(data_dir_train) + "/"

import Augmentor

for i in class_names:
    p = Augmentor.Pipeline(path_to_training_dataset + i,save_format='jpg')
    p.rotate(probability = 0.7, max_left_rotation = 10, max_right_rotation = 10)
    p.sample(500) ## We are adding 500 samples per class to make sure that none of the classes are sparse.