# Problem statement: 
To build a CNN based model which can accurately detect melanoma. Melanoma is a type of cancer that can be deadly if not detected early. It accounts for 75% of skin cancer deaths. A solution which can evaluate images and alert the dermatologists about the presence of melanoma has the potential to reduce a lot of manual effort needed in diagnosis.

### Importing Skin Cancer Data
#### To do: Take necessary actions to read the data

### Importing all the important libraries

In [None]:
import pathlib
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import PIL
import cv2
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import glob

In [None]:
# Please set this flag to True if mounting google drive else set to False
mounting_from_gdrive = False

In [None]:
## If you are using the data by mounting the google drive, use the following :
#from google.colab import drive
#drive.mount('/content/gdrive', force_remount=True)
#mounting_from_gdrive = True

This assignment uses a dataset of about 2357 images of skin cancer types. The dataset contains 9 sub-directories in each train and test subdirectories. The 9 sub-directories contains the images of 9 skin cancer types respectively.

In [None]:
# Defining the path for train and test images
## Todo: Update the paths of the train and test dataset
if mounting_from_gdrive:
  path_to_dataset = "gdrive/My Drive/Colab Notebooks/Skin cancer ISIC The International Skin Imaging Collaboration/"
else:
  path_to_dataset = "./Skin cancer ISIC The International Skin Imaging Collaboration/"
#path_to_dataset = "./Skin cancer ISIC The International Skin Imaging Collaboration/"
path_to_training_dataset = path_to_dataset + "Train"
path_to_test_dataset = path_to_dataset + "Test"

data_dir_train = pathlib.Path(path_to_training_dataset)
data_dir_test = pathlib.Path(path_to_test_dataset)

In [None]:
image_count_train = len(list(data_dir_train.glob('*/*.jpg')))
print("train:", image_count_train)
image_count_test = len(list(data_dir_test.glob('*/*.jpg')))
print("test:", image_count_test)
print("test + train:", image_count_train + image_count_test)

### Load using keras.preprocessing

Let's load these images off disk using the helpful image_dataset_from_directory utility.

### Create a dataset

Define some parameters for the loader:

In [None]:
batch_size = 32
img_height = 180
img_width = 180

Use 80% of the images for training, and 20% for validation.

In [None]:
## Write your train dataset here
## Note use seed=123 while creating your dataset using tf.keras.preprocessing.image_dataset_from_directory
## Note, make sure your resize your images to the size img_height*img_width, while writting the dataset
train_ds = tf.keras.preprocessing.image_dataset_from_directory(data_dir_train, labels='inferred', batch_size=32, image_size=(img_height,
    img_width), seed=123, validation_split=0.2, subset='training')

In [None]:
## Write your validation dataset here
## Note use seed=123 while creating your dataset using tf.keras.preprocessing.image_dataset_from_directory
## Note, make sure your resize your images to the size img_height*img_width, while writting the dataset
val_ds = tf.keras.preprocessing.image_dataset_from_directory(data_dir_train, labels='inferred', batch_size=32, image_size=(img_height,
    img_width), seed=123, validation_split=0.2, subset='validation')

In [None]:
test_ds = tf.keras.preprocessing.image_dataset_from_directory(data_dir_test, labels='inferred', batch_size=32, image_size=(img_height,
    img_width))

In [None]:
# List out all the classes of skin cancer and store them in a list. 
# You can find the class names in the class_names attribute on these datasets. 
# These correspond to the directory names in alphabetical order.
class_names = train_ds.class_names
print(class_names)
num_classes = len(class_names)

### Visualize the data
#### Todo, create a code to visualize one instance of all the nine classes present in the dataset

In [None]:
import matplotlib.pyplot as plt
class_to_img = {}
search_class = 0
for element in train_ds:
  for count, class_value in enumerate(element[1].numpy()):
    if class_value == search_class:
      class_to_img[search_class] = element[0][count]
      search_class += 1
      if search_class == 9:
        break

  if search_class == 9:
    break
### your code goes here, you can use training or validation data to visualize

print(class_to_img.keys())

In [None]:
plt.figure(figsize=(10,10))
for i in range(9):
  img_data = class_to_img[i]
  plt.subplot(3,3, i+1)
  plt.imshow(np.float32(img_data/255))
  plt.title(class_names[i])
  plt.tight_layout()

The `image_batch` is a tensor of the shape `(32, 180, 180, 3)`. This is a batch of 32 images of shape `180x180x3` (the last dimension refers to color channels RGB). The `label_batch` is a tensor of the shape `(32,)`, these are corresponding labels to the 32 images.

`Dataset.cache()` keeps the images in memory after they're loaded off disk during the first epoch.

`Dataset.prefetch()` overlaps data preprocessing and model execution while training.

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

### Function to plot accuracy and loss

In [None]:
def accuracyandvalidationplot(history):
  acc = history.history['accuracy']
  val_acc = history.history['val_accuracy']

  loss = history.history['loss']
  val_loss = history.history['val_loss']

  epochs_range = range(epochs)

  plt.figure(figsize=(8, 8))
  plt.subplot(1, 2, 1)
  plt.plot(epochs_range, acc, label='Training Accuracy')
  plt.plot(epochs_range, val_acc, label='Validation Accuracy')
  plt.legend(loc='lower right')
  plt.title('Training and Validation Accuracy')

  plt.subplot(1, 2, 2)
  plt.plot(epochs_range, loss, label='Training Loss')
  plt.plot(epochs_range, val_loss, label='Validation Loss')
  plt.legend(loc='upper right')
  plt.title('Training and Validation Loss')
  plt.show()

### Create the model
#### Todo: Create a CNN model, which can accurately detect 9 classes present in the dataset. Use ```layers.experimental.preprocessing.Rescaling``` to normalize pixel values between (0,1). The RGB channel values are in the `[0, 255]` range. This is not ideal for a neural network. Here, it is good to standardize values to be in the `[0, 1]`

In [None]:
### Your code goes here
model1 = Sequential([
  layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width, 3)),

  layers.Conv2D(32, 2, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),

  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),

  layers.Conv2D(128, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(128, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),

  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_classes, activation='softmax')
])



### Compile the model
Choose an appropirate optimiser and loss function for model training 

In [None]:
### Todo, choose an appropirate optimiser and loss function
model1.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy'])

In [None]:
# View the summary of all layers
model1.summary()

### Train the model

In [None]:
epochs = 20
history1 = model1.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

### Visualizing training results

In [None]:
accuracyandvalidationplot(history1)

In [None]:
model1.evaluate(test_ds)

#### Todo: Write your findings after the model fit, see if there is an evidence of model overfit or underfit

### The training accuracy keeps increasing with every epoch. The validation accuracy plateaus quickly. Thus after 20 epochs the train accuracy is very high and the validation accuracy is very low. This clearly indicates overfitting.

#### Augmentation stragtegy of adding images by flipping, rotation and zoom

In [None]:
# Todo, after you have analysed the model fit history for presence of underfit or overfit, choose an appropriate data augumentation strategy. 
data_augmentation = keras.Sequential(
  [
    layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical", 
                                                 input_shape=(img_height, 
                                                              img_width,
                                                              3), seed=123),
    layers.experimental.preprocessing.RandomRotation(0.1, seed=123),
    layers.experimental.preprocessing.RandomZoom(0.1,0.1, seed=123),
  ]
)

In [None]:
# Todo, visualize how your augmentation strategy works for one instance of training image.
plt.figure(figsize=(10, 10))
for images, _ in train_ds.take(1):
  for i in range(9):
    augmented_images = data_augmentation(images)
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(augmented_images[0].numpy().astype("uint8"))
    plt.axis("off")



### Todo:
### Create the model, compile and train the model


In [None]:
model2 = Sequential([
  data_augmentation,
  layers.experimental.preprocessing.Rescaling(1./255),

  layers.Conv2D(32, 2, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),

  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),

  layers.Conv2D(128, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(128, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),

  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_classes, activation='softmax')
])

### Compiling the model

In [None]:
model2.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy'])

In [None]:
model2.summary()

### Train the model

In [None]:
epochs = 20
history2 = model2.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

### Visualizing the results

In [None]:
accuracyandvalidationplot(history2)

In [None]:
model2.evaluate(test_ds)

#### Todo: Write your findings after the model fit, see if there is an evidence of model overfit or underfit. Do you think there is some improvement now as compared to the previous model run?
The overfitting was reduced in this model. This led to overall lower accuracy which we can summarise as **underfitting**.

#### **Todo:** Find the distribution of classes in the training dataset.
#### **Context:** Many times real life datasets can have class imbalance, one class can have proportionately higher number of samples compared to the others. Class imbalance can have a detrimental effect on the final model quality. Hence as a sanity check it becomes important to check what is the distribution of classes in the data.

In [None]:
class_counts = [len(list(data_dir_train.glob(class_name + '/*.jpg'))) for class_name in class_names]
class_counts_df = pd.DataFrame({"class": class_names, "count": class_counts})
class_counts_df['proportion'] = round(class_counts_df['count']/class_counts_df['count'].sum(),2)
class_counts_df

In [None]:
plt.figure(figsize=(25,5))
plt.bar(class_counts_df['class'], class_counts_df['proportion'])
for index, value in enumerate(class_counts_df['proportion']):
    plt.text(index - 0.05, value + 0.001, str(value))
plt.title("Cancer dataset class distributions")

#### **Todo:** Write your findings here: 
#### - Which class has the least number of samples?
dermatofibroma and seborrheic keratosis have the least number of samples

#### - Which classes dominate the data in terms proportionate number of samples?
pigmented benign keratosis, melanoma, basal cell carcinoma and nevus


#### **Todo:** Rectify the class imbalance
#### **Context:** You can use a python package known as `Augmentor` (https://augmentor.readthedocs.io/en/master/) to add more samples across all classes so that none of the classes have very few samples.

In [None]:
!pip install Augmentor

To use `Augmentor`, the following general procedure is followed:

1. Instantiate a `Pipeline` object pointing to a directory containing your initial image data set.<br>
2. Define a number of operations to perform on this data set using your `Pipeline` object.<br>
3. Execute these operations by calling the `Pipeline’s` `sample()` method.


In [None]:
import Augmentor
for i in class_names:
    print(path_to_training_dataset + "/" + i)
    p = Augmentor.Pipeline(path_to_training_dataset + "/" + i)
    p.rotate(probability=0.7, max_left_rotation=10, max_right_rotation=10)
    p.sample(1000) ## We are adding 500 samples per class to make sure that none of the classes are sparse.

### Lets see the distribution of augmented data after adding new images to the original training data.

In [None]:
class_counts = [len(list(data_dir_train.glob(class_name + '/**/*.jpg'))) for class_name in class_names]
class_counts_df = pd.DataFrame({"class": class_names, "count": class_counts})
class_counts_df['proportion'] = round(class_counts_df['count']/class_counts_df['count'].sum(),2)
class_counts_df

In [None]:
plt.figure(figsize=(25,5))
plt.bar(class_counts_df['class'], class_counts_df['proportion'])
for index, value in enumerate(class_counts_df['proportion']):
    plt.text(index - 0.05, value + 0.001, str(value))
plt.title("Cancer dataset class distributions")

Augmentor has stored the augmented images in the output sub-directory of each of the sub-directories of skin cancer types.. Lets take a look at total count of augmented images.

In [None]:
len(list(data_dir_train.glob('*/output/*.jpg')))

So, now we have added 1000 images to all the classes to maintain some class balance. We can add more images as we want to improve training process.

#### **Todo**: Train the model on the data created using Augmentor

In [None]:
batch_size = 32
img_height = 180
img_width = 180

#### **Todo:** Create a training dataset

In [None]:
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir_train,
  seed=123,
  validation_split = 0.2,
  subset = 'training',
  image_size=(img_height, img_width),
  batch_size=batch_size)

#### **Todo:** Create a validation dataset

In [None]:
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir_train,
  seed=123,
  validation_split = 0.2,
  subset = 'validation',
  image_size=(img_height, img_width),
  batch_size=batch_size)

#### **Todo:** Create your model (make sure to include normalization)

In [None]:
model3 = Sequential([
  data_augmentation,
  layers.experimental.preprocessing.Rescaling(1./255),

  layers.Conv2D(32, 2, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),

  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),

  layers.Conv2D(128, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(128, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),

  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_classes, activation='softmax')
])

#### **Todo:** Compile your model (Choose optimizer and loss function appropriately)

In [None]:
model3.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
              metrics=['accuracy'])

#### **Todo:**  Train your model

In [None]:
epochs = 30
history3 = model3.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

#### **Todo:**  Visualize the model results

In [None]:
accuracyandvalidationplot(history3)

In [None]:
model3.evaluate(test_ds)

#### **Todo:**  Analyze your results here. Did you get rid of underfitting/overfitting? Did class rebalance help?



With the first model we had an overfit. After we used augmentation with flip rotation and zoom we got and unferfit model. When we used Augmentor to fix the class imbalance, we got a good model with high train and test accuracy.