## SIIM-ISIC: EDA and Model training
---
<!-- <font size="2"> -->
Skin cancer is the most prevalent type of cancer. Melanoma, specifically, is responsible for 75% of skin cancer deaths, despite being the least common skin cancer. The American Cancer Society estimates over 100,000 new melanoma cases will be diagnosed in 2020. It's also expected that almost 7,000 people will die from the disease. As with other cancers, early and accurate detection—potentially aided by data science—can make treatment more effective.

Through this notebook, I aim to provide you with an insight to the data, what to expect, and also help you build a model using thhe image, sex of patient, the anatomical site of the image, and approximate age of the patient as model inputs. 

Melanoma is a deadly disease, but if caught early, most melanomas can be cured with minor surgery. Image analysis tools that automate the diagnosis of melanoma will improve dermatologists' diagnostic accuracy. Better detection of melanoma has the opportunity to positively impact millions of people.
<!-- </font>  -->

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
# !pip install tf-nightly
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
import matplotlib.pyplot as plt
import cv2
from sklearn.model_selection import train_test_split
from kaggle_datasets import KaggleDatasets

AUTOTUNE = tf.data.experimental.AUTOTUNE
GCS_path = KaggleDatasets().get_gcs_path('melanoma-256x256')
GCS_path

<font size="3">The block of code below handles the itsy-bitsy tasks like imputing missing values and converting paths to usable formats.</font>

---

In [None]:
dataframe = pd.read_csv('../input/siim-isic-melanoma-classification/train.csv')

dataframe['image_path'] ='../input/siim-isic-melanoma-classification/jpeg/train/' + dataframe['image_name']+'.jpg'
dataframe.pop('image_name')

dataframe['anatom_site'] = pd.Categorical(dataframe['anatom_site_general_challenge'])
anatom_site_categories = dataframe.anatom_site.cat.categories
dataframe['anatom_site'] = dataframe.anatom_site.cat.codes
dataframe['anatom_site'].fillna((dataframe['anatom_site'].mean()), inplace=True)

dataframe['sex'] = pd.Categorical(dataframe['sex'])
sex_categories = dataframe.sex.cat.categories
dataframe['sex'] = dataframe.sex.cat.codes
dataframe['sex'].fillna((dataframe['sex'].mean()), inplace=True)

dataframe['age_approx'].fillna((dataframe['age_approx'].mean()), inplace=True)
dataframe = dataframe.astype({"sex":np.int32,"age_approx":np.int32, 'target':np.int8})

dataframe = dataframe.drop(columns=['diagnosis', 'patient_id', 'anatom_site_general_challenge'])
dataframe.to_csv('dataframe.csv')

Since the variables 'sex' and 'anatomical site' are categorical,they have to be changed to numbers, so as to be fed to the model. The following are the respective mappings for the categories and the numbers:

In [None]:
print("Anatom_sites:")
for idx, anatom_site in enumerate(anatom_site_categories):
    print(str(idx)+": "+anatom_site, end='\n')

print("\nSex:")
for idx, sex in enumerate(sex_categories):
    print(str(idx)+": "+sex, end='\n')

<font size="3">Now let's look at some EDA.</font>

In [None]:
dataframe.sample(5)

In [None]:
dataframe.benign_malignant.value_counts()

<font size="3">
From the above value counts, we infer that our dataset is heavily imbalanced. There are significantly more benign images than actual malignant images. <br/> <br/>
    
This is what happens mostly, medically. Although cancer is truly something very serious and dangerous, our model wouldn't be able to find a proper distinction between a benign and malignant skin cancer image, because of the hugely imbalanced data. <br/>

---
    
What I have experimented over here is class weights. It tries to magnify loss of the model when it classifies a image of a person with malignant melanoma as benign.<br/> 
    
There is also another technique calles clas oversampling. It samples the class whose datapoints are low in number, and tries to bring homogenity to the dataset by making copies of this under-represented data.
    

### Some images with Melanoma

In [None]:
pos_values = dataframe['target'].values != 0
pos_dataframe = dataframe.iloc[pos_values, :]

plt.figure(figsize=(20,8))


for i in range(24):
    img = cv2.imread(pos_dataframe.iloc[i,4])
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    ax = plt.subplot(4,6,i+1)
    plt.imshow(img)
    plt.axis('off')


### Some images without Melanoma

In [None]:
neg_dataframe =  dataframe.iloc[~pos_values, :]

plt.figure(figsize=(20,8))


for i in range(24):
    img = cv2.imread(neg_dataframe.iloc[i,4])
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    ax = plt.subplot(4,6,i+1)
    plt.imshow(img)
    plt.axis('off')

Age is an important factor in carciongenous growth, because it helps you to understand who is more vulnerable at an early age and who is more vulnerable at later stages of their life.

In [None]:
plt.suptitle("Distribution of Melanoma patient skin images with respect to age ")
plt.hist(pos_dataframe['age_approx'].values,bins=15)
plt.show()

In [None]:
plt.suptitle("Distribution of Non-Melanoma patient skin images with respect to age")
plt.hist(neg_dataframe['age_approx'].values,bins=15)
plt.show()

It is evident from the plot below that most of the images correspond to be taken at torso. And there are relativel much examples for lower body. It leads to the hypothesis that this skin cancer can b caused by exposure to sunlight, as our torso is exposed to the sun almost always, being vulnerable to the harmful UV rays which could have caused this type of cancer.

In [None]:
plt.bar(dataframe['anatom_site'].value_counts().keys(),dataframe['anatom_site'].value_counts().values)
plt.suptitle('Occurence of images of various body parts, of both targets')

From the block down below, data is being prepared so as to be fed to our accelerator(GPU), for efficient and fast experimentation and training. In the consecutice cells, TF-record files are being downloaded and decoded to form a tuple of (image, sex,age_approx, anatom_site) and target, which is also batched later.

In [None]:
BATCH_SIZE = 10
EPOCHS = 10
validation_split = 0.15

files_train = tf.io.gfile.glob(GCS_path+'/train*.tfrec')
files_test = tf.io.gfile.glob(GCS_path+'/test*.tfrec')

split = int(len(files_train) * validation_split)
validation_filenames = files_train[:split]
training_filenames = files_train[split:]
print("Pattern matches {} data files. Splitting dataset into {} training files and {} validation files".format(len(files_train), len(training_filenames), len(validation_filenames)))


In [None]:
@tf.function
def read_tfrecord(example):
    features = {
        'image'                        : tf.io.FixedLenFeature([], tf.string),
        'image_name'                   : tf.io.FixedLenFeature([], tf.string),
        'patient_id'                   : tf.io.FixedLenFeature([], tf.int64),
        'sex'                          : tf.io.FixedLenFeature([], tf.int64),
        'age_approx'                   : tf.io.FixedLenFeature([], tf.int64),
        'anatom_site_general_challenge': tf.io.FixedLenFeature([], tf.int64),
        'diagnosis'                    : tf.io.FixedLenFeature([], tf.int64),
        'target'                       : tf.io.FixedLenFeature([], tf.int64)
    }
    example = tf.io.parse_single_example(example, features)
    
    image = tf.image.decode_jpeg(example["image"], channels=3)
    image = tf.cast(image, dtype= 'float32') / 255.0
    image = tf.image.resize(image, [256,256])
    
    return (image, 
            example['sex'],
            example['age_approx'],
            example['anatom_site_general_challenge']
           ), example['target']

@tf.function
def augment(data, target):
    img = data[0]
    img = tf.image.random_brightness(img, 0.25)
    img = tf.image.random_contrast(img, 0.5, 0.6)
    img = tf.image.random_flip_left_right(img)
    img = tf.image.random_flip_up_down(img)
    img = tf.image.random_hue(img, 0.02)
    # img = tf.image.random_jpeg_quality(img, 85, 100)
    img = tf.cast(img, tf.float32)
    
    return (img, data[1], data[2], data[3]), target


def load_dataset(filenames):
    # read from TFRecords
    option_no_order = tf.data.Options()
    option_no_order.experimental_deterministic = False
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE)
    dataset = dataset.with_options(option_no_order)
    dataset = dataset.map(read_tfrecord, num_parallel_calls=AUTOTUNE)
    return dataset

In [None]:
def load_batched_dataset(filenames, training=True):
    dataset = load_dataset(filenames)
    if training:
        dataset = dataset.repeat()
    dataset = dataset.shuffle(buffer_size= 100, reshuffle_each_iteration=True)
    dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
    dataset = dataset.map(augment, num_parallel_calls=AUTOTUNE)
    dataset = dataset.prefetch(AUTOTUNE)
    return dataset

training_dataset = load_batched_dataset(training_filenames)
validation_dataset = load_batched_dataset(validation_filenames, training=False)

In [None]:
neg, pos = np.bincount(dataframe['target'])
total = len(dataframe)
print("negative (benign): "+str(neg))
print("potitive (malignant): "+str(pos))
print("total examples: "+str(total))

This is our Deep Learning model. It's is inspired by Andrej Karpathy's talk at Pytorch as Tesla, in which he introduces the idea of a hydranet. In that model, it is processing images as multiple heads of the network. But in our model, only one head is a CNN, and the other heads are for the other variables. 

This CNN is an ensemble of the SOTA implementations in Computer Vision like DenseNet169, InceptionResNetV2, ResNet152V2, and Xception. 

In [None]:
input_1 = tf.keras.Input(shape=(256,256,3), name='image')
input_2 = tf.keras.Input(shape=(1), name='sex')
input_3 = tf.keras.Input(shape=(1), name='age_approx')
input_4 = tf.keras.Input(shape=(1), name='anatom_site')

cnn1 = tf.keras.applications.densenet.DenseNet169(input_shape=(256,256,3), include_top=False, weights=None)(input_1)
cnn1 = tf.keras.layers.Conv2D(filters=64, kernel_size=(1,1), activation='relu', use_bias=True)(cnn1)
cnn1 = tf.keras.layers.Flatten()(cnn1)
cnn1 = tf.keras.layers.Dense(2048, activation='relu', use_bias=False)(cnn1)
cnn1 = tf.keras.layers.Dropout(0.4)(cnn1)
cnn1 = tf.keras.layers.Dense(1024, activation='relu', use_bias=False)(cnn1)
cnn1 = tf.keras.layers.Dropout(0.3)(cnn1)
cnn1 = tf.keras.layers.Dense(1, activation='tanh', use_bias=True)(cnn1)

cnn2 = tf.keras.applications.resnet_v2.ResNet152V2(input_shape=(256,256,3), include_top=False, weights=None)(input_1)
cnn2 = tf.keras.layers.Conv2D(filters=64, kernel_size=(1,1), activation='relu', use_bias=True)(cnn2)
cnn2 = tf.keras.layers.Flatten()(cnn2)
cnn2 = tf.keras.layers.Dense(2048, activation='relu', use_bias=False)(cnn2)
cnn2 = tf.keras.layers.Dropout(0.4)(cnn2)
cnn2 = tf.keras.layers.Dense(1024, activation='relu', use_bias=False)(cnn2)
cnn2 = tf.keras.layers.Dropout(0.3)(cnn2)
cnn2 = tf.keras.layers.Dense(1, activation='tanh', use_bias=True)(cnn2)

cnn3 = tf.keras.applications.inception_resnet_v2.InceptionResNetV2(input_shape=(256,256,3), include_top=False, weights=None)(input_1)
cnn3 = tf.keras.layers.Conv2D(filters=64, kernel_size=(1,1), activation='relu', use_bias=True)(cnn3)
cnn3 = tf.keras.layers.Flatten()(cnn3)
cnn3 = tf.keras.layers.Dense(2048, activation='relu', use_bias=False)(cnn3)
cnn3 = tf.keras.layers.Dropout(0.4)(cnn3)
cnn3 = tf.keras.layers.Dense(1024, activation='relu', use_bias=False)(cnn3)
cnn3 = tf.keras.layers.Dropout(0.3)(cnn3)
cnn3 = tf.keras.layers.Dense(1, activation='tanh', use_bias=True)(cnn3)

cnn4 = tf.keras.applications.xception.Xception(input_shape=(256,256,3), include_top=False, weights=None)(input_1)
cnn4 = tf.keras.layers.Conv2D(filters=64, kernel_size=(1,1), activation='relu', use_bias=True)(cnn4)
cnn4 = tf.keras.layers.Flatten()(cnn4)
cnn4 = tf.keras.layers.Dense(2048, activation='relu', use_bias=False)(cnn4)
cnn4 = tf.keras.layers.Dropout(0.4)(cnn4)
cnn4 = tf.keras.layers.Dense(1024, activation='relu', use_bias=False)(cnn4)
cnn4 = tf.keras.layers.Dropout(0.3)(cnn4)
cnn4 = tf.keras.layers.Dense(1, activation='tanh', use_bias=True)(cnn4)

cnn_output = tf.keras.layers.Concatenate(axis=-1)([cnn1, cnn2, cnn3, cnn4])
cnn_output = tf.keras.layers.Dense(1, activation='relu', use_bias=True)(cnn_output)

x = tf.keras.layers.Concatenate(axis=-1)([cnn_output, input_2, input_3, input_4])
x = tf.keras.layers.Dense(10, activation='relu', use_bias=True)(x)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Dense(10, activation='relu', use_bias=True)(x)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Dense(1, activation='sigmoid', use_bias=True)(x)

model = tf.keras.models.Model(inputs=[input_1,input_2,input_3,input_4], outputs=x)

model.summary()
METRICS = [
    tf.keras.metrics.TruePositives(name='tp'),
    tf.keras.metrics.FalsePositives(name='fp'),
    tf.keras.metrics.TrueNegatives(name='tn'),
    tf.keras.metrics.FalseNegatives(name='fn'), 
    tf.keras.metrics.BinaryAccuracy(name='accuracy'),
    tf.keras.metrics.Precision(name='precision'),
    tf.keras.metrics.Recall(name='recall'),
    tf.keras.metrics.AUC(name='auc'),
]
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=METRICS)


In [None]:
weight_for_0 = (1 / neg)*(total)/2.0 
weight_for_1 =(1 / pos)*(total)/2.0

class_weight = {0: weight_for_0, 1: weight_for_1}

print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))
STEPS = np.ceil(total/10)

model.fit(training_dataset, 
          epochs=1, 
          validation_steps=5,
          validation_data = validation_dataset,
          # class_weight=class_weight,
          steps_per_epoch=STEPS)

The accuracy of this model might be pretty high, but technically, what the model is learning is to say that the cancer in image is benign. It is a very common situation whenthe data is so highly imbalanced. 

Another problem I encountered with this type of model training is that the ram get progressivvely filled up, and after around 2.5 epochs, it just goes out of memory, even though the dataset isnt being cached. And even if it is, the tf-records I am using are only 700mb at max. 

If you, the reader of this notebook, find or have an opinion on how the performance can be improved, do comment and let me know. 

Thank you!! 