## Melanoma(skin cancer) Classification

Skin cancer is the most prevalent type of cancer. Melanoma, specifically, is responsible for 75% of skin cancer deaths, despite being the least common skin cancer.It has an ability to spread to other organs more rapidly if it is not treated at an early stage.
The American Cancer Society estimates over 100,000 new melanoma cases will be diagnosed in 2020. It's also expected that almost 7,000 people will die from the disease. As with other cancers, early and accurate detection—potentially aided by data science—can make treatment more effective.

In this competition given an image of the cancer we are asked to predict whether it's beingn or malignant.

So let's get started.

In [None]:
!pip install livelossplot

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import cv2
import PIL
from IPython.display import Image, display
from keras.applications.vgg16 import VGG16,preprocess_input
# Plotly for the interactive viewer (see last section)
import plotly.graph_objs as go
import plotly.graph_objects as go
from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import train_test_split
from keras.models import Sequential, Model,load_model
from keras.applications.vgg16 import VGG16,preprocess_input
from keras.applications.resnet50 import ResNet50
from keras.preprocessing.image import ImageDataGenerator,load_img, img_to_array
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Input, Flatten,BatchNormalization,Activation,AveragePooling2D
from keras.layers import GlobalMaxPooling2D
from keras.layers import UpSampling2D
from keras.layers import Activation
from keras.layers import MaxPool2D
from keras.layers import Add
from keras.layers import Multiply
from keras.layers import Lambda
from keras.regularizers import l2
from keras.models import Model
from keras.optimizers import Adam, SGD, RMSprop
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
from keras.utils import to_categorical
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import skimage.io
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.python.keras import backend as K
from livelossplot import PlotLossesKeras

I'm gonna be using the jpeg files for training and testing.

In [None]:
train_dir='/kaggle/input/siim-isic-melanoma-classification/jpeg/train/'
test_dir='/kaggle/input/siim-isic-melanoma-classification/jpeg/test/'
train=pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/train.csv')
test=pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/test.csv')
submission=pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/sample_submission.csv')

In [None]:
train.head()

Since this is medical data I'm expecting it to be unbalanced.

In [None]:
train['target'].value_counts()

In [None]:
dist=train['target'].value_counts()
print("Benign cases are",(32542/(32542+584))*100)


The difference is huge and only 1.7% patients in our data have malignant cancer.

**anatom_site_general_challenge** in the dataset refers to the location of the skin cancer given in the image.

In [None]:
labels=train['anatom_site_general_challenge'].value_counts().index
values=train['anatom_site_general_challenge'].value_counts().values
fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent',
                             insidetextorientation='radial'
                            )])
fig.show()


In more than half of the patients in our dataset, the cancer is found on the torso.

Now if we look at the diagnosis provided by Dermatologists.(I have removed cases marked "unknown")

In [None]:
labels=train['diagnosis'].value_counts().index[1:]
values=train['diagnosis'].value_counts().values[1:]
fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent',
                             insidetextorientation='radial'
                            )])
fig.show()


A "nevus" is basically a visible, circumscribed, chronic lesion of the skin. Since they are also called moles and also cover majority of the data, I think this diagnosis is for benign cases.

Let's check it out.

In [None]:
new=train.drop(labels=['image_name','patient_id','sex','age_approx','anatom_site_general_challenge','target'],axis=1)
pd.crosstab(new['diagnosis'].values,new['benign_malignant'])

> So my presumption was true and most benign cases are diagnosed as **nevus**. 

> All patients diagnosed as "melanoma" have malignant cancers. I think this term is only reserved for severe cases.



Before going any further with training let's take a look at sample photos from both classes.

In [None]:
df_0=train[train['target']==0]
df_1=train[train['target']==1]

In [None]:
print('Benign Cases')
benign=[]
df_benign=df_0.sample(40)
df_benign=df_benign.reset_index()
for i in range(40):
    img=cv2.imread(str(train_dir + df_benign['image_name'].iloc[i]+'.jpg'))
    img = cv2.resize(img, (224,224))
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = img.astype(np.float32)/255.
    benign.append(img)
f, ax = plt.subplots(5,8, figsize=(10,8))
for i, img in enumerate(benign):
        ax[i//8, i%8].imshow(img)
        ax[i//8, i%8].axis('off')
        
plt.show()

In [None]:
print('Malignant Cases')
malignant=[]
df_malignant=df_1.sample(40)
df_malignant=df_malignant.reset_index()
for i in range(40):
    img=cv2.imread(str(train_dir + df_malignant['image_name'].iloc[i]+'.jpg'))
    img = cv2.resize(img, (224,224))
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = img.astype(np.float32)/255.
    malignant.append(img)
f, ax = plt.subplots(5,8, figsize=(10,8))
for i, img in enumerate(malignant):
        ax[i//8, i%8].imshow(img)
        ax[i//8, i%8].axis('off')
        
plt.show()

## Preparing the Datasets

For training I'm going to use external [dataset](https://www.kaggle.com/shonenkov/melanoma-merged-external-data-512x512-jpeg) with duplicates removed from [Alex Shonenkov](https://www.kaggle.com/shonenkov). This dataset provides a boost from the original dataset in the competition. 
It has more images from previous melanoma competitions and as shown in this [discussion](https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/157701) it has removed duplicate images found in the train set.

I'm also balancing the classes a bit.

In [None]:
train_dir='../input/melanoma-merged-external-data-512x512-jpeg/512x512-dataset-melanoma/512x512-dataset-melanoma/'
marking=pd.read_csv('../input/melanoma-merged-external-data-512x512-jpeg/marking.csv')
df_1=marking[marking['target']==1]#5479 images
df_0=marking[marking['target']==0].sample(6000)
train=pd.concat([df_0,df_1])
train=train.reset_index()


In [None]:
labels=[]
data=[]
for i in range(train.shape[0]):
    data.append(train_dir + train['image_id'].iloc[i]+'.jpg')
    labels.append(train['target'].iloc[i])
df=pd.DataFrame(data)
df.columns=['images']
df['target']=labels

In [None]:
test_data=[]
for i in range(test.shape[0]):
    test_data.append(test_dir + test['image_name'].iloc[i]+'.jpg')
df_test=pd.DataFrame(test_data)
df_test.columns=['images']

In [None]:
X_train, X_val, y_train, y_val = train_test_split(df['images'],df['target'], test_size=0.2, random_state=1234)

train=pd.DataFrame(X_train)
train.columns=['images']
train['target']=y_train

validation=pd.DataFrame(X_val)
validation.columns=['images']
validation['target']=y_val

In [None]:
#let's initialize some things
IMG_SIZE=(224,224)
BATCH_SIZE=64
EPOCHS=2

I'll do some very basic preprocessing like 
* normalizing
* reshaping
* augmentation(only for tarin data)

In [None]:
train_datagen = ImageDataGenerator(rescale=1./255,rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,horizontal_flip=True)
val_datagen=ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_dataframe(
    train,
    x_col='images',
    y_col='target',
    target_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    shuffle=True,
    class_mode='raw')

validation_generator = val_datagen.flow_from_dataframe(
    validation,
    x_col='images',
    y_col='target',
    target_size=IMG_SIZE,
    shuffle=False,
    batch_size=BATCH_SIZE,
    class_mode='raw')



## Modelling
I'm using pretrained VGG-16 and adding the last dense layer.
> **I know VGG is not a common choice for these competitions but it's a fairly simple architecture to start with compared to a Resnet or EfficientNet, also it takes less time to train and gives a decent baseline score on the Leaderboard.**

The competition is evaluated on AUC scores, so we'll use that as a metric. Focal loss is a better when it comes to class imbalance so I 'll be using it instead of Binary CrossEntropy.You can read more about it [here](https://arxiv.org/abs/1708.02002)

In [None]:
def vgg16_model( num_classes=None):

    model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    x=Flatten()(model.output)
    output=Dense(1,activation='sigmoid')(x) # because we have to predict the AUC
    model=Model(model.input,output)
    
    return model

vgg_conv=vgg16_model(1)

In [None]:
def focal_loss(alpha=0.25,gamma=2.0):
    def focal_crossentropy(y_true, y_pred):
        bce = K.binary_crossentropy(y_true, y_pred)
        
        y_pred = K.clip(y_pred, K.epsilon(), 1.- K.epsilon())
        p_t = (y_true*y_pred) + ((1-y_true)*(1-y_pred))
        
        alpha_factor = 1
        modulating_factor = 1

        alpha_factor = y_true*alpha + ((1-alpha)*(1-y_true))
        modulating_factor = K.pow((1-p_t), gamma)

        # compute the final loss and return
        return K.mean(alpha_factor*modulating_factor*bce, axis=-1)
    return focal_crossentropy

In [None]:
opt = Adam(lr=1e-4)
vgg_conv.compile(loss=focal_loss(), metrics=[tf.keras.metrics.AUC()],optimizer=opt)

In [None]:
nb_train_steps = train.shape[0]//BATCH_SIZE
nb_val_steps=validation.shape[0]//BATCH_SIZE
print("Number of training and validation steps: {} and {}".format(nb_train_steps,nb_val_steps))

In [None]:
cb=[PlotLossesKeras()]
vgg_conv.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_steps,
    epochs=EPOCHS,
    validation_data=validation_generator,
    callbacks=cb,
    validation_steps=nb_val_steps)

## Submission

In [None]:
target=[]
for path in df_test['images']:
    img=cv2.imread(str(path))
    img = cv2.resize(img, IMG_SIZE)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = img.astype(np.float32)/255.
    img=np.reshape(img,(1,IMG_SIZE[0],IMG_SIZE[1],3))
    prediction=vgg_conv.predict(img)
    target.append(prediction[0][0])

submission['target']=target
    
        

In [None]:
submission.to_csv('submission.csv', index=False)
submission.head()

I'll keep on updating this kernel with new experiments.

If you liked it please upvote the kernel.