# Tom Segal's Entry For the Competition "SIIM-ISIC Melanoma Classification" ##

This is my third ML project.

The first project was the MNIST digit recognizer, 10 labels for 1-color images.

The second project was flower identification, >100 labels for 3-color images.

In this project tumors are differiented into malignant (melanoma-inducing) and benign.

In this project, downsampling is used in order to even the label distribution. An unconventional loss function, Focal Loss, is used in order to compensate for the remaining difference in the distribution.

Project overview https://www.kaggle.com/c/siim-isic-melanoma-classification

data https://www.kaggle.com/c/siim-isic-melanoma-classification/data



this project relies on research of the following notebooks:

https://www.kaggle.com/cdeotte/triple-stratified-kfold-with-tfrecords

https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/155579

https://www.kaggle.com/agentauers/incredible-tpus-finetune-effnetb0-b6-at-once

https://www.kaggle.com/ibtesama/siim-baseline-keras-vgg16


additional references:

https://pypi.org/project/focal-loss/


I acknowledge and appreciate the support of the kaggle community.


In [None]:
!pip install focal-loss

## install missing libraries ##

In [None]:
import tensorflow as tf
print("tensorflow version: " + tf.__version__)
from kaggle_datasets import KaggleDatasets
import pandas as pd
import os
import matplotlib.pyplot as plt
import cv2
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from keras.applications.vgg16 import VGG16
from keras.applications import DenseNet201
from keras.layers import Flatten, Dense
from keras.models import Model
from keras.optimizers import Adam
from tensorflow.python.keras import backend
from focal_loss import BinaryFocalLoss

## define constants ##

In [None]:
random_state = 19 # using a constant random seed makes the results more consistent and helps comparing between them.

## read the data as panda dataframe for examination ##

In [None]:
# Get the path of the Current System (GCS)
GCS_PATH = KaggleDatasets().get_gcs_path("siim-isic-melanoma-classification")
# get the train data in dataframe format for quick examination of the data
dataframe_train = pd.read_csv("../input/siim-isic-melanoma-classification/train.csv")


## examine the data ##

view the first few dataframe entries

In [None]:
dataframe_train.head(10)

plot some samples

In [None]:
# image_paths = GCS_PATH_TRAIN + "\\" + dataframe_train["image_name"]+".jpg" # \\ because \ is an escape character
# image_paths = "../input/siim-isic-melanoma-classification/jpeg/train/" + dataframe_train["image_name"] + ".jpg" 
image_paths = "../input/jpeg-melanoma-256x256/train/" + dataframe_train["image_name"] + ".jpg" 
f, ax = plt.subplots(3, 5, figsize = (10,6))
for i in range(15):
    #print(image_paths[i])
    img = cv2.imread(image_paths[i])
    #print(img.shape)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # the default cv2 format is BGR
    ax[i//5, i%5].imshow(img)
    ax[i//5, i%5].axis("off")
plt.show()
    

examine the label distribution

In [None]:
dataframe_train["target"].value_counts()

584/32542 = ~1.8% of the tumors are malignant, such that the labels are highly unbalanced.

this can be treated using oversampling or undersampling.

## undersampling ##

only a portion of the benign tumor training data will be used in order to make the labels more balanced.

In [None]:
downsampling = 1000
# sample 1000 benign samples and merge them together with all of the malignant samples
dataframe_train_benign_downsampled = dataframe_train[dataframe_train["target"]==0].sample(downsampling)
dataframe_train_malignant = dataframe_train[dataframe_train["target"]==1]
# join the two parts together. Note that now the two sample types are not mixed anymore in the data
# but appear in two blocks.
dataframe_train_downsampled = pd.concat([dataframe_train_benign_downsampled, dataframe_train_malignant])

show benign tumor samples

In [None]:
image_paths = ["../input/jpeg-melanoma-256x256/train/" + dataframe_train_benign_downsampled["image_name"].values[i] + ".jpg" for i in range(downsampling)]
f, ax = plt.subplots(3, 5, figsize = (10,6))
for i in range(15):
    #print(image_paths[i])
    img = cv2.imread(image_paths[i])
    #print(img.shape)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # the default cv2 format is BGR

    ax[i//5, i%5].imshow(img)
    ax[i//5, i%5].axis("off")
plt.show()

show malignant tumor samples

In [None]:
image_paths = ["../input/jpeg-melanoma-256x256/train/" + dataframe_train_malignant["image_name"].values[i] + ".jpg" for i in range(dataframe_train_malignant.shape[0])]

#print(image_paths[1])
f, ax = plt.subplots(3, 5, figsize = (10,6))
for i in range(15):
    #print(image_paths[i])
    img = cv2.imread(image_paths[i])
    #print(img.shape)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # the default cv2 format is BGR
    ax[i//5, i%5].imshow(img)
    ax[i//5, i%5].axis("off")
plt.show()

## simplify the data  ##

create datasets with only the images and the labels

In [None]:
image_paths = ["../input/jpeg-melanoma-256x256/train/" + dataframe_train_benign_downsampled["image_name"].values[i] + ".jpg" for i in range(downsampling)]


In [None]:
dataframe_train_labels = []
dataframe_train_images = []
for i in range(dataframe_train_downsampled.shape[0]):
    dataframe_train_labels.append(dataframe_train_downsampled["target"].values[i])
    dataframe_train_images.append("../input/jpeg-melanoma-256x256/train/" + dataframe_train_downsampled["image_name"].values[i] + ".jpg")
    
# create a dataframe from the columns
nparray_train_reduced_tuples = zip(dataframe_train_images, dataframe_train_labels)
# dataframe_train_reduced = pd.DataFrame(np.array([dataframe_train_labels, dataframe_train_images]), columns = ["label","image"])
dataframe_train_reduced = pd.DataFrame(nparray_train_reduced_tuples, columns = ["image","label"])
# dataframe_train_reduced = pd.DataFrame(np.array([dataframe_train_labels, dataframe_train_images]))
dataframe_train_reduced.head()

## split the data into train and validation ##

In [None]:
x_train, x_val, y_train, y_val = train_test_split(dataframe_train_reduced["image"], dataframe_train_reduced["label"],
                                                 test_size = 0.2, random_state = random_state)
dataframe_train_split = pd.DataFrame(zip(x_train,y_train), columns = ["image","label"])
dataframe_val = pd.DataFrame(zip(x_val,y_val), columns = ["image","label"])

## data normaliztion & augmentation ##

In [None]:
gen_train = ImageDataGenerator(
    rescale = 1./255, # rescale the images (RGB [0,255])
    width_shift_range = 0.15, height_shift_range = 0.15, # randomly shift the pictures by 15% in both axes
    horizontal_flip = True, vertical_flip = True, # randomly flip the images in both axes
)
train_generator = gen_train.flow_from_dataframe(dataframe_train_reduced, x_col = "image", y_col = "label",
                                               target_size = (256,256), batch_size = 8,
                                               shuffle = True, # important as mentioned above
                                               class_mode = "raw")
val_generator = gen_train.flow_from_dataframe(dataframe_val, x_col = "image", y_col = "label",
                                               target_size = (256,256), batch_size = 8,
                                               shuffle = True, # not sure if important
                                               class_mode = "raw")


## define the model ##

for the model a pretrained VGG16 model will be used, pre-weighted on "imagenet", with a flatten and a dense model placed on top.



In [None]:
model = VGG16(weights = "imagenet",
             include_top = False, # because a new top will be added to match the dimensions of this dataset
             input_shape = (256,256,3))
x = Flatten()(model.output)
output = Dense(1,activation = "sigmoid")(x)
model = Model(model.input, output)



## compile the model ##

In [None]:
model.compile(loss = "binary_crossentropy", metrics = [tf.keras.metrics.AUC()], optimizer = Adam(lr=0.00001))

## train the model ##

batch_size = 8
steps_per_epoch = dataframe_train_reduced.shape[0] // batch_size
epochs = 3
validation_steps = dataframe_val.shape[0] // batch_size
history = model.fit_generator(train_generator, steps_per_epoch = steps_per_epoch, epochs = epochs,
                    validation_data = val_generator, validation_steps = validation_steps)

## results ##

Epoch 1/3
198/198 [==============================] - 2103s 11s/step - loss: 0.5735 - auc_1: 0.7306 - val_loss: 0.5575 - val_auc_1: 0.8149
Epoch 2/3
198/198 [==============================] - 2106s 11s/step - loss: 0.5137 - auc_1: 0.8001 - val_loss: 0.4799 - val_auc_1: 0.8388
Epoch 3/3
198/198 [==============================] - 2104s 11s/step - loss: 0.4836 - auc_1: 0.8234 - val_loss: 0.4836 - val_auc_1: 0.8667

## using focal loss instead of crossentropy ##

https://arxiv.org/abs/1708.02002

"We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training."

In [None]:
def focal_loss(alpha=0.25,gamma=2.0):
    def focal_crossentropy(y_true, y_pred):
        bce = backend.binary_crossentropy(y_true, y_pred)
        
        y_pred = backend.clip(y_pred, backend.epsilon(), 1.- backend.epsilon())
        p_t = (y_true*y_pred) + ((1-y_true)*(1-y_pred))
        
        alpha_factor = 1
        modulating_factor = 1

        alpha_factor = y_true*alpha + ((1-alpha)*(1-y_true))
        modulating_factor = backend.pow((1-p_t), gamma)

        # compute the final loss and return
        return backend.mean(alpha_factor*modulating_factor*bce, axis=-1)
    return focal_crossentropy

In [None]:
model.compile(loss=BinaryFocalLoss(gamma=2), metrics = [tf.keras.metrics.AUC()], optimizer = Adam(lr=0.00001))

batch_size = 8
steps_per_epoch = dataframe_train_reduced.shape[0] // batch_size
epochs = 3
validation_steps = dataframe_val.shape[0] // batch_size
history2 = model.fit_generator(train_generator, steps_per_epoch = steps_per_epoch, epochs = epochs,
                    validation_data = val_generator, validation_steps = validation_steps)

## results ##

Epoch 1/3
198/198 [==============================] - 2109s 11s/step - loss: 0.1244 - auc_6: 0.8345 - val_loss: 0.0998 - val_auc_6: 0.8968
Epoch 2/3
198/198 [==============================] - 2109s 11s/step - loss: 0.1140 - auc_6: 0.8596 - val_loss: 0.1055 - val_auc_6: 0.8924
Epoch 3/3
198/198 [==============================] - 2113s 11s/step - loss: 0.1112 - auc_6: 0.8663 - val_loss: 0.1087 - val_auc_6: 0.8911

better than with the crossentropy loss metric

## using DenseNet201 instead of VGG16 ##

In [None]:
model2 = DenseNet201(weights = "imagenet",
             include_top = False, # because a new top will be added to match the dimensions of this dataset
             input_shape = (256,256,3))
x = Flatten()(model2.output)
output = Dense(1,activation = "sigmoid")(x)
model2 = Model(model2.input, output)



In [None]:
model2.compile(loss=BinaryFocalLoss(gamma=2), metrics = [tf.keras.metrics.AUC()], optimizer = Adam(lr=0.00001))

In [None]:
batch_size = 8
steps_per_epoch = dataframe_train_reduced.shape[0] // batch_size
epochs = 3
validation_steps = dataframe_val.shape[0] // batch_size
history3 = model2.fit_generator(train_generator, steps_per_epoch = steps_per_epoch, epochs = epochs,
                    validation_data = val_generator, validation_steps = validation_steps)

## results ##

almost twice as fast.

Epoch 1/3
198/198 [==============================] - 1194s 6s/step - loss: 0.2557 - auc_2: 0.6912 - val_loss: 0.2462 - val_auc_2: 0.6968
Epoch 2/3
198/198 [==============================] - 1223s 6s/step - loss: 0.2170 - auc_2: 0.7629 - val_loss: 0.1782 - val_auc_2: 0.8301
Epoch 3/3
198/198 [==============================] - 1197s 6s/step - loss: 0.1912 - auc_2: 0.8044 - val_loss: 0.1833 - val_auc_2: 0.8421

## results summary ##

so far the best result is that of VGG16 with focal loss


## submit the results ##

In [None]:
predictions = [] # the test predictions will be stored here
# read the test csv file and obtain the image paths from it
dataframe_test = pd.read_csv("../input/siim-isic-melanoma-classification/test.csv")
test_image_paths = ["../input/jpeg-melanoma-256x256/test/" + image_name + ".jpg" for image_name in dataframe_test["image_name"]]
print(test_image_paths[5])

In [None]:
predictions = [] # the test predictions will be stored here
# read the test csv file and obtain the image paths from it
dataframe_test = pd.read_csv("../input/siim-isic-melanoma-classification/test.csv")
test_image_paths = ["../input/jpeg-melanoma-256x256/test/" + image_name + ".jpg" for image_name in dataframe_test["image_name"]]
# go over the image paths, load their respective images, make a prediction for them and save the predictions
i=0
for test_image_path in test_image_paths:
    img = cv2.imread(test_image_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = np.reshape(img,(1,256,256,3))
    predictions.append(model.predict(img))
    if i%100 == 0:
        print("finished " + str(i) + " out of " + str(len(test_image_paths)))
    i += 1



In [None]:
submission = pd.read_csv("../input/siim-isic-melanoma-classification/sample_submission.csv")
submission["target"] = predictions
submission.to_csv("submission.csv", index = False)

submission.head(30)