# CNN Assignment on Melanoma Cancer Detection 

##### Problem Statement
- To build a CNN based model which can accurately detect melanoma. Melanoma is a type of cancer that can be deadly if not detected early. It accounts for 75% of skin cancer deaths. A solution that can evaluate images and alert dermatologists about the presence of melanoma has the potential to reduce a lot of manual effort needed in diagnosis.

##### About Dataset
- The dataset consists of 2357 images of malignant and benign oncological diseases, which were formed from the International Skin Imaging Collaboration (ISIC). All images were sorted according to the classification taken with ISIC, and all subsets were divided into the same number of images, with the exception of melanomas and moles, whose images are slightly dominant.

##### The data set contains the following diseases:

- Actinic keratosis
- Basal cell carcinoma
- Dermatofibroma
- Melanoma
- Nevus
- Pigmented benign keratosis
- Seborrheic keratosis
- Squamous cell carcinoma
- Vascular lesion

#### Importing all required Libraries

In [None]:
# For Data Processing
import pandas as pd
import numpy as np
# For Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# For other tasks
import pathlib
import os
import PIL
# For CNN
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [None]:
from glob import glob

#### Importing Dataset

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

## Data Reading/Data Understanding
- Defining the path for `train and test images`

- Assigning variables to the train and test datasets

In [None]:
train_data_dir = pathlib.Path("/content/gdrive/MyDrive/Colab-Notebooks/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train")
test_data_dir = pathlib.Path('/content/gdrive/MyDrive/Colab-Notebooks/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Test')

- Getting the list of images in each dataset

In [None]:
image_count_train = len(list(train_data_dir.glob('*/*.jpg')))
print("The total number of images in train dataset is:",image_count_train)
image_count_test = len(list(test_data_dir.glob('*/*.jpg')))
print("The total number of images in test dataset is:",image_count_test)

## Dataset Creation
- Create `train & validation dataset` from the train directory with a batch size of 32. Also, make sure you resize your images to 180*180.

- Creating a Dataset

In [None]:
batch_size = 32
img_height = 180
img_width = 180

In [None]:
train_ds = tf.keras.utils.image_dataset_from_directory(
    train_data_dir,
    labels='inferred',
    label_mode='int',
    class_names=None, 
    color_mode='rgb', 
    batch_size=batch_size, 
    image_size=(img_height, img_width),
    shuffle=True,
    seed=123,
    validation_split = 0.2,
    subset = 'training',
    interpolation='bilinear',
    follow_links=False,
    crop_to_aspect_ratio=False
)

In [None]:
val_ds = tf.keras.utils.image_dataset_from_directory(
    test_data_dir, 
    labels='inferred', 
    label_mode='int',
    class_names=None, 
    color_mode='rgb', 
    batch_size=batch_size, 
    image_size=(img_height, img_width),
    shuffle=True, 
    seed=123,
    validation_split = 0.2,
    subset = 'validation',
    interpolation='bilinear',
    follow_links=False,
    crop_to_aspect_ratio=False
)

- Getting the list of `class names` for train and test datasets.

In [None]:
class_names = train_ds.class_names
list(class_names)

In [None]:
class_names = val_ds.class_names
list(class_names)

- Getting the shape of the image_batch and labels_batch

In [None]:
for image_batch, labels_batch in train_ds:
  print(image_batch.shape)
  print(labels_batch.shape)
  break

## Dataset visualisation
- Create a code to visualize one instance of all the nine classes present in the dataset 

In [None]:
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
    for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(images[i].numpy().astype("uint8"))
        plt.title(class_names[labels[i]])
        plt.axis("off")

The `image_batch` is a tensor of the shape `(32, 180, 180, 3)`. This is a batch of 32 images of shape `180x180x3` (the last dimension refers to color channels RGB). The `label_batch` is a tensor of the shape `(32,)`, these are corresponding labels to the 32 images.

`Dataset.cache()` keeps the images in memory after they're loaded off disk during the first epoch.

`Dataset.prefetch()` overlaps data preprocessing and model execution while training.

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

## Model Building & training
- Create a CNN model, which can accurately detect 9 classes present in the dataset. While building the model, rescale images to normalize pixel values between (0,1).

In [None]:
#Standardize Data of RGB channel value
# normalization_layer = layers.Rescaling(1./255)
normalization_layer = tf.keras.layers.experimental.preprocessing.Rescaling(1./255)
normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
image_batch, labels_batch = next(iter(normalized_ds))
first_image = image_batch[0]

In [None]:
# Notice the pixel values are now in `[0,1]`.
print(np.min(first_image), np.max(first_image))

In [None]:
num_class = len(class_names)
model = Sequential([
    layers.Rescaling(scale = 1./255, input_shape = (img_height,img_width,3)),
    layers.Conv2D(16,3,padding='same',activation= 'relu'),
    layers.Conv2D(32,3,padding='same',activation= 'relu'),
    layers.Conv2D(64,3,padding='same',activation= 'relu'),
    layers.Flatten(),
    layers.Dense(128,activation='relu'),
    layers.Dense(num_class)
])

In [None]:
model.compile(optimizer='adam',
          loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
          metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
epochs = 20
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

In [None]:
# As the model is overfitting, lets due the agumentation to reduce the overfitting
# Let's use random flip, rotate and zoom for agumentation
data_augmentation = keras.Sequential([
    layers.RandomFlip("horizontal",input_shape=(img_height,img_width,3)),
    layers.RandomRotation(0.1),
    layers.RandomZoom(0.1)])

In [None]:
#visualize how your augmentation strategy works for one instance of training image.
# setting the output image size
plt.figure(figsize=(10, 10))
for images, _ in train_ds.take(1):
  for i in range(9):
    augmented_images = data_augmentation(images)
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(augmented_images[0].numpy().astype("uint8"))
    plt.axis("off")

In [None]:
## You can use Dropout layer if there is an evidence of overfitting in your findings
# Let's add some dropout layers to the model as our model is overfitting
model = Sequential([
  data_augmentation,
  layers.Rescaling(1./255),
  layers.Conv2D(16, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Dropout(0.2),
  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_class, name="outputs")
])

In [None]:
## compiling the model with adam optinizer and crossentropy for loss function and accuracy as metrics
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
# printing the model summary
model.summary()

In [None]:
epochs = 20
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

In [None]:
# installing the Agumentor
!pip install Augmentor

In [None]:
# importing the agumentor package 
import Augmentor

# taking the Training dataset path
# path_to_training_dataset='gdrive/My Drive/Colab Notebooks/Skin cancer ISIC The International Skin Imaging Collaboration/Train/'
training_path="/content/gdrive/MyDrive/Colab-Notebooks/CNN_assignment/Skin cancer ISIC The International Skin Imaging Collaboration/Train/"

for i in class_names:
  # instantiating the pipeline object with training dataset for a specific class
    p = Augmentor.Pipeline(training_path + i)
    # rotating the image
    p.rotate(probability=0.7, max_left_rotation=10, max_right_rotation=10)
    # We are adding 500 samples per class to make sure that none of the classes are sparse.
    p.sample(500) 

In [None]:
# again printing the image count in training dataset
image_count_train = len(list(train_data_dir.glob('*/output/*.jpg')))
print(f'Number of images in training dataset: {image_count_train}')

In [None]:
# generating path list
path_list = [x for x in glob(os.path.join(train_data_dir, '*','output', '*.jpg'))]
len(path_list)

In [None]:
# taking the skin cancer type in a list
lesion_list_new = [os.path.basename(os.path.dirname(os.path.dirname(y))) for y in glob(os.path.join(train_data_dir, '*','output', '*.jpg'))]
len(lesion_list_new)

In [None]:
# creating a new dictionary with the file path and class type
dataframe_dict_new = dict(zip(path_list, lesion_list_new))

In [None]:
# creating a dataframe with the above dictionary
df2 = pd.DataFrame(list(dataframe_dict_new.items()),columns = ['Path','Label'])
new_df = df2
# new_df = original_df.append(df2)

In [None]:
# printing the number of images under each type
new_df['Label'].value_counts()

In [None]:
train_ds = tf.keras.utils.image_dataset_from_directory(
    train_data_dir,
    labels='inferred',
    label_mode='int',
    class_names=None, 
    color_mode='rgb', 
    batch_size=batch_size, 
    image_size=(img_height, img_width),
    shuffle=True,
    seed=123,
    validation_split = 0.2,
    subset = 'training',
    interpolation='bilinear',
    follow_links=False,
    crop_to_aspect_ratio=False
)

In [None]:
val_ds = tf.keras.utils.image_dataset_from_directory(
    train_data_dir, 
    labels='inferred', 
    label_mode='int',
    class_names=None, 
    color_mode='rgb', 
    batch_size=batch_size, 
    image_size=(img_height, img_width),
    shuffle=True, 
    seed=123,
    validation_split = 0.2,
    subset = 'validation',
    interpolation='bilinear',
    follow_links=False,
    crop_to_aspect_ratio=False
)

In [None]:
# performing autotune
AUTOTUNE = tf.data.experimental.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

#normalizing the Data
normalization_layer = tf.keras.layers.experimental.preprocessing.Rescaling(1./255)
normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
image_batch, labels_batch = next(iter(normalized_ds))
first_image = image_batch[0]
# Notice the pixel values are now in `[0,1]`.
print(np.min(first_image), np.max(first_image))

# creating a model post handling the imbalancing
model = Sequential([
  data_augmentation,
  layers.Rescaling(1./255),
  layers.Conv2D(16, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Dropout(0.2),
  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_class)
])

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.summary()

In [None]:
# lets train the model with 30 epochs
epochs = 30
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()