Problem statement: To build a CNN based model which can accurately detect melanoma. Melanoma is a type of cancer that can be deadly if not detected early. It accounts for 75% of skin cancer deaths. A solution which can evaluate images and alert the dermatologists about the presence of melanoma has the potential to reduce a lot of manual effort needed in diagnosis.

### Importing Skin Cancer Data
#### To do: Take necessary actions to read the data

### Importing all the important libraries

In [None]:
import pathlib
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import PIL
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

In [None]:
## Files are loaded on local drive and then used in the code for formation of test and train numpy arrays

This assignment uses a dataset of about 2357 images of skin cancer types. The dataset contains 9 sub-directories in each train and test subdirectories. The 9 sub-directories contains the images of 9 skin cancer types respectively.

In [None]:
# Defining the path for train and test images
## Todo: Update the paths of the train and test dataset
trainpath = pathlib.Path(r"D:\Siddharth Upgrad\Case Study\17. Melanoma detection\CNN_assignment\Skin cancer ISIC The International Skin Imaging Collaboration\Train")
testpath = pathlib.Path(r"D:\Siddharth Upgrad\Case Study\17. Melanoma detection\CNN_assignment\Skin cancer ISIC The International Skin Imaging Collaboration\Test")


In [None]:
## The below piece of code scans the train dataset path , reads images in individual folders and converts them to numpy arrays.
## This forms the X_train array
## Parallely it also writes labels to another array, which forms the y_train array.

trainpathlist = list(trainpath.iterdir())
trainclasslist = []
for i in trainpathlist:
    length = len(i.parts)-1
    trainclasslist.append(i.parts[length])
    
subtrainpath = []
for i in trainclasslist:
    subtrainpath.append(pathlib.Path(str(trainpath)+'/'+ i))

traininterlist = []
listlength=0
X_trainarray=[]
y_trainarray = []
for i in range(0,9):
    traininterlist = list(subtrainpath[i].iterdir())
    listlength = listlength + len(traininterlist)
    print("Label Name = " + str(subtrainpath[i].name))
    print("Label Number = " + str(i))
    print("number of train images = " + str(len(traininterlist)))
    print()
    for j in traininterlist:
        image = tf.keras.utils.load_img(j)
        image = image.resize((180,180))
        X_trainarray.append(tf.keras.utils.img_to_array(image))
        y_trainarray.append(i)
X_train = np.array(X_trainarray)
y_train = np.array(y_trainarray)

print(X_train.shape)
print(y_train.shape)


In [None]:
## The above process is repeated on the test dataset and X_test and y_test arrays are formed.

testpathlist = list(testpath.iterdir())
testclasslist = []
for i in testpathlist:
    length = len(i.parts)-1
    testclasslist.append(i.parts[length])
    
subtestpath = []
for i in testclasslist:
    subtestpath.append(pathlib.Path(str(testpath)+'/'+ i))

testinterlist = []
listlength=0
X_testarray=[]
y_testarray = []
for i in range(0,9):
    testinterlist = list(subtestpath[i].iterdir())
    listlength = listlength + len(testinterlist)
    print("Label Name = " + str(subtestpath[i].name))
    print("Label Number = " + str(i))
    print("number of test images = " + str(len(testinterlist)))
    print()
    for j in testinterlist:
        image = tf.keras.utils.load_img(j)
        image = image.resize((180,180))
        X_testarray.append(tf.keras.utils.img_to_array(image))
        y_testarray.append(i)
X_test = np.array(X_testarray)
y_test = np.array(y_testarray)

print(X_test.shape)
print(y_test.shape)


In [None]:
# Normalising the X_train and X_test datasets.
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
# normalise
X_train /= 255
X_test /= 255

## One Hot Encoding the y_train and y_test datasets
y_train = tf.keras.utils.to_categorical(y_train,9)
y_test = tf.keras.utils.to_categorical(y_test,9)

In [None]:
## Since google colab was to be used considering the GPU requirement. The above numpy arrays were saved on local drive
## and then uploaded to google drive. These numpy arrays were then used in the code on google colab for further model building

In [None]:
## Below code is used to save the numpy arrays on the local drive
np.save(r"D:\Siddharth Upgrad\Case Study\17. Melanoma detection\CNN_assignment\X_train.npy",X_train)
np.save(r"D:\Siddharth Upgrad\Case Study\17. Melanoma detection\CNN_assignment\X_test.npy",X_test)
np.save(r"D:\Siddharth Upgrad\Case Study\17. Melanoma detection\CNN_assignment\y_train.npy",y_train)
np.save(r"D:\Siddharth Upgrad\Case Study\17. Melanoma detection\CNN_assignment\y_test.npy",y_test)

In [None]:
## Defining the batch size and the epochs
batch_size = 32
epochs = 20

Use 80% of the images for training, and 20% for validation.

In [None]:
## Loading the numpy arrays to google colab
from google.colab import drive
drive.mount('/content/drive')

X_train = np.load("/content/drive/MyDrive/Assignment_Numpy/X_train.npy")
y_train = np.load("/content/drive/MyDrive/Assignment_Numpy/y_train.npy")
X_test = np.load("/content/drive/MyDrive/Assignment_Numpy/X_test.npy")
y_test = np.load("/content/drive/MyDrive/Assignment_Numpy/y_test.npy")

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train1,X_val,y_train1,y_val = train_test_split(X_train,y_train,test_size=0.2,random_state=5)

### Create the model


In [None]:

#Build Model
model = Sequential()

model.add(Conv2D(32,kernel_size=(5,5),padding='same',strides=(1,1),activation='relu',input_shape=((180,180,3))))
model.add(Conv2D(32,kernel_size=(5,5),padding='same',strides=(1,1),activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())


model.add(Conv2D(64,kernel_size=(3,3),padding='same',strides=(1,1),activation='relu'))
model.add(Conv2D(64,kernel_size=(3,3),padding='same',strides=(1,1),activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.05))

model.add(Conv2D(128,kernel_size=(3,3),padding='same',strides=(1,1),activation='relu'))
model.add(Conv2D(128,kernel_size=(3,3),padding='same',strides=(1,1),activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.05))


model.add(Flatten())

model.add(Dense(1024,activation='relu'))
model.add(Dense(1024,activation='relu'))
model.add(Dense(1024,activation='relu'))
model.add(Dense(1024,activation='relu'))
model.add(Dense(9, activation='softmax'))

model.summary()




### Compile the model
Choose an appropirate optimiser and loss function for model training 

In [None]:
model.compile(loss=tf.keras.losses.categorical_crossentropy,
              optimizer = 'sgd',
              metrics=['accuracy'])


In [None]:
# View the summary of all layers
model.summary()

### Train the model

In [None]:
history = model.fit(x = X_train1, y = y_train1,batch_size=batch_size,epochs=epochs,verbose=1,
                    validation_data=(X_val,y_val))

### Visualizing training results

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

#### Todo: Write your findings after the model fit, see if there is an evidence of model overfit or underfit

### Write your findings here

In [1]:
## The model has low training accuracy of about 74.5% and poor validation accuracy of 40% indicating a probable overfitting.

### Visualizing the results

#### Todo: Write your findings after the model fit, see if there is an evidence of model overfit or underfit. Do you think there is some improvement now as compared to the previous model run?

#### **Todo:** Find the distribution of classes in the training dataset.
#### **Context:** Many times real life datasets can have class imbalance, one class can have proportionately higher number of samples compared to the others. Class imbalance can have a detrimental effect on the final model quality. Hence as a sanity check it becomes important to check what is the distribution of classes in the data.

In [None]:
distributiondf = pd.DataFrame()
distributiondf['Type'] = ['actinic keratosis','basal cell carcinoma','dermatofibroma','melanoma','nevus','pigmented benign keratosis',
                          'seborrheic keratosis','squamous cell carcinoma','vascular lesion']
distributiondf['Number_of_images'] = [traindf[0].sum(),traindf[1].sum(),traindf[2].sum(),traindf[3].sum(),traindf[4].sum(),
                                          traindf[5].sum(),traindf[6].sum(),traindf[7].sum(),traindf[8].sum()]
print(distributiondf)

import seaborn as sns
sns.barplot(x=distributiondf['Type'],y=distributiondf['Number_of_images'])
plt.xticks(rotation=90)
plt.show

#### **Todo:** Write your findings here: 

#### - Which class has the least number of samples?
   ### Answer: - Class "pigmented benign keratosis" has the highest number of samples
   
#### - Which classes dominate the data in terms proportionate number of samples?
   ### Answer : - Classes "Basal Cell Carcinoma","melanoma","pigmented bening keratosis,"nevus" dominate the data in terms of proportionate number of samples


#### **Todo:** Rectify the class imbalance
#### **Context:** You can use a python package known as `Augmentor` (https://augmentor.readthedocs.io/en/master/) to add more samples across all classes so that none of the classes have very few samples.

In [None]:
!pip install Augmentor

To use `Augmentor`, the following general procedure is followed:

1. Instantiate a `Pipeline` object pointing to a directory containing your initial image data set.<br>
2. Define a number of operations to perform on this data set using your `Pipeline` object.<br>
3. Execute these operations by calling the `Pipeline’s` `sample()` method.


In [None]:
class_names = ['actinic keratosis','basal cell carcinoma','dermatofibroma','melanoma','nevus','pigmented benign keratosis',
                          'seborrheic keratosis','squamous cell carcinoma','vascular lesion']

#### The below code scans every sub-folder in the training folder and augments the images. Total 500 samples are created in each class to remove the class imbalance issue.

#### This code is executed in local machine and then similar to earlier the numpy arrays are created which are then imported to google drive and then used in the google colab code. The code is mentioned below

In [None]:
trainpath= "D:\Siddharth Upgrad\Case Study\17. Melanoma detection\CNN_assignment\Skin cancer ISIC The International Skin Imaging Collaboration\Train"
import Augmentor

trainingpathlist = list(trainpath.iterdir())
trainclasslist = []
for i in trainpathlist:
    length = len(i.parts)-1
    trainclasslist.append(i.parts[length])
    
subtrainpath = []
for i in trainclasslist:
    subtrainpath.append(pathlib.Path(str(trainpath)+'/'+ i))

for i in range(0,9):
    p = Augmentor.Pipeline(subtrainpath[i])
    p.rotate(probability=0.7, max_left_rotation=10, max_right_rotation=10)
    p.sample(500) ## We are adding 500 samples per class to make sure that none of the classes are sparse.

In [None]:
trainpath = pathlib.Path(r"D:\Siddharth Upgrad\Case Study\17. Melanoma detection\CNN_assignment\Augmented Train")

trainpathlist = list(trainpath.iterdir())
trainclasslist = []
for i in trainpathlist:
    length = len(i.parts)-1
    trainclasslist.append(i.parts[length])
    
subtrainpath = []
for i in trainclasslist:
    subtrainpath.append(pathlib.Path(str(trainpath)+'/'+ i))

traininterlist = []
listlength=0
X_trainarray=[]
y_trainarray = []
for i in range(0,9):
    traininterlist = list(subtrainpath[i].iterdir())
    listlength = listlength + len(traininterlist)
    print("Label Name = " + str(subtrainpath[i].name))
    print("Label Number = " + str(i))
    print("number of train images = " + str(len(traininterlist)))
    print()
    for j in traininterlist:
        image = tf.keras.utils.load_img(j)
        image = image.resize((180,180))
        X_trainarray.append(tf.keras.utils.img_to_array(image))
        y_trainarray.append(i)
X_train2 = np.array(X_trainarray)
y_train2 = np.array(y_trainarray)

print(X_train2.shape)
print(y_train2.shape)


In [None]:
X_train2 = X_train2.astype('float32')
X_train2 /= 255
y_train2 = tf.keras.utils.to_categorical(y_train2,9)

In [None]:
np.save(r"D:\Siddharth Upgrad\Case Study\17. Melanoma detection\CNN_assignment\Augmented_X_train.npy",X_train2)
np.save(r"D:\Siddharth Upgrad\Case Study\17. Melanoma detection\CNN_assignment\Augmented_y_train.npy",y_train2)

In [None]:
## LOADING THE AUGMENTED NUMPY ARRAYS TO GOOGLE COLAB
X_train2 = np.load("/content/drive/MyDrive/Assignment_Numpy/Augmented_X_train.npy")
y_train2 = np.load("/content/drive/MyDrive/Assignment_Numpy/Augmented_y_train.npy")
X_test2 = np.load("/content/drive/MyDrive/Assignment_Numpy/X_test.npy")
y_test2 = np.load("/content/drive/MyDrive/Assignment_Numpy/y_test.npy")

In [None]:
## CREATING TRAIN AND VALIDATION SPLIT FOR THE AUGMENTED DATA
X_train1,X_val,y_train1,y_val = train_test_split(X_train2,y_train2,test_size=0.2,random_state=5)

#### **Todo:** Create your model (make sure to include normalization)

In [None]:

#Build Model
model = Sequential()

model.add(Conv2D(64,kernel_size=(3,3),padding='same',strides=(1,1),activation='relu',input_shape=((180,180,3))))
model.add(Conv2D(64,kernel_size=(3,3),padding='same',strides=(1,1),activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())


model.add(Conv2D(128,kernel_size=(3,3),padding='same',strides=(1,1),activation='relu'))
model.add(Conv2D(128,kernel_size=(3,3),padding='same',strides=(1,1),activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.1))

model.add(Conv2D(256,kernel_size=(3,3),padding='same',strides=(1,1),activation='relu'))
model.add(Conv2D(256,kernel_size=(3,3),padding='same',strides=(1,1),activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.1))

model.add(Flatten())

model.add(Dense(1024,activation='relu'))
model.add(Dense(1024,activation='relu'))
model.add(Dense(1024,activation='relu'))
model.add(Dense(1024,activation='relu'))
model.add(Dense(9, activation='softmax'))




#### **Todo:** Compile your model (Choose optimizer and loss function appropriately)

In [None]:
## your code goes here

model.compile(loss=tf.keras.losses.categorical_crossentropy,
              optimizer = 'sgd',
              metrics=['accuracy'])

#### **Todo:**  Train your model

In [None]:
history = model.fit(x = X_train1, y = y_train1,batch_size=batch_size,epochs=epochs,verbose=1,
                    validation_data=(X_val,y_val))

#### **Todo:**  Visualize the model results

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(epochs)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

#### **Todo:**  Analyze your results here. Did you get rid of underfitting/overfitting? Did class rebalance help?



In [None]:
## When the above model is executed the train accuracy has improved to around 94% and validation accuracy is improved to around 84%.
## This shows a significance improvement in the overall model due to class rebalance.
