# **Melanoma Detection Using EfficientNet and MetaData Ensemble** <br>
Author: TeYang, Lau<br>
Created: 1/5/2020<br>
Last update: 17/8/2020<br>

<img src = 'https://www.norrisderm.com/app/uploads/2013/06/portland-basal-cell-carcinoma-1200x600.jpg' width="700">
<br>

This notebook was created to try out **tensor processing unit** (TPU) for deep learning. TPU is developed by Google for neural network machine learning using Tensorflow, providing very high accuracy and speed. GPUs, on the other hand, have been around for a long time, and excel at matrix multiplication and parallelization, compared to CPUs. Although TPUs are more expensive than GPUs on a per hour basis, the time saved when training the same model will actually make it more cost-efficient when using a TPU. 

Here, I am using **EfficientNet** (EffNet) via **transfer learning**, one of the latest deep learning neural network model, to detect melanoma from images of skin lesions. This dataset consists of around 33,000 and 10,000 images in the training and testing set respectively. By leveraging the power of transfer learning, we are able to 'transfer' the weights of low-level features (e.g., lines, shapes, etc) that were detected in a pretrained model. Doing so saves us the need to train a model from scrach, which can be challenging for those who do not have a 'big' enough dataset or computational resources. I will also be trying out **Test Time Augmentation** to reduce errors for validation and test predictions. This will be explained further below.

This notebook uses code from [Chris Deotte's](https://www.kaggle.com/cdeotte) excellent [notebook](https://www.kaggle.com/cdeotte/triple-stratified-kfold-with-tfrecords) with some adjustments. All credits go to him!

The process is as follows:
1. [Data Loading and Structure](#Data_loading_structure)
2. [Set Up TPU](#TPU)
3. [Setting Hyperparameters](#Hyperparameters)
    - [Learning Rate Train Schedule](#TrainSchedule)
4. [Image Loading Functions](#ImageLoading)
5. [Train and Evaluate Model](#Train_model)
    - [EfficientNets](#EffNets)
    - [Test Time Augmentation](#TTA) 
    - [Out of Fold (OOF) in K-Fold Cross Validation](#OOF)
    - [Label Smoothing](#LabelSmooth)
    - [Calculate OOF AUC](#OOF_AUC)
    - [ROC and PR Curves](#AUROC_AUPRC)
6. [Predicting on Test Set](#Predict_test)
7. [Meta-Data Train and Prediction](#MetaData) 
    - [Clean Train and Test Sets](#Clean)
    - [Dummy Encoding](#Dummy)
    - [Prepare Train and Test Sets](#Prepare)
    - [Train, Predict and Blend](#TrainPredict)
8. [Ensemble Model](#Ensemble)
9. [Conclusion](#Conclusion)<br><br>


Before we begin, let's watch a short video to get an idea of how melanoma is diagnosed!

In [None]:
from IPython.display import YouTubeVideo,HTML
YouTubeVideo("hXYd0WRhzN4", width=700, height=500)

<a id='Data_loading_structure'></a>
# 1. Data Loading and Structure

We start by loading the dependencies and data, and exploring the dataset to look at its structure and distribution. We also print some images to get a hang of it.

In [None]:
# import modules

import numpy as np 
import pandas as pd 
import math
import os
import glob
import matplotlib.pyplot as plt
import seaborn as sns
import PIL

from sklearn import model_selection
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split
from sklearn.model_selection import cross_validate, cross_val_score, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn import model_selection
from sklearn.metrics import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

from random import choices
from functools import partial
import re
from kaggle_datasets import KaggleDatasets

import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.layers import Dense 
from tensorflow.keras.models import Model,Sequential
from tensorflow.keras import optimizers
from keras.utils.vis_utils import plot_model

!pip install -q efficientnet
import efficientnet.tfkeras as efn


This datasets that we will be using has been preprocessed for Tensorflow by [Chris Deotte](https://www.kaggle.com/cdeotte). The images have been resized and stored in tensorflow records (TFRecords), which will be more efficient when using TPU to load the images. Furthermore, the images have been triple stratified:
1. All images from one patient are fully contained in a single TFRecord
2. All TFRecords have equal data class proportion (i.e. each TFRecord now contains 1.8% malignant images)
3. Each TFRecord has an equzal number of patients with 115 images, with 100 images, with 70 and so on.

Refer to Chris's [post](https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/165526) for more information.

In [None]:
train = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/train.csv')

test = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/test.csv')

sample = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/sample_submission.csv')

train.head()

In [None]:
# look at distribution

fig = plt.figure(figsize=(14,15))

ax1 = fig.add_subplot(321)
ax = train['benign_malignant'].value_counts().plot(kind='barh', color=['blue','red'], alpha=.5,
                                                  title='Melanoma Target Distribution')

ax2 = fig.add_subplot(322)
ax2 = train['sex'].value_counts().plot(kind='barh', color=['blue','red'], alpha=.5,
                                                  title='Sex Distribution')

ax3 = fig.add_subplot(323)
ax3 = train['diagnosis'].value_counts().plot(kind='barh', color=['blue','red'], alpha=.5,
                                                  title='Diagnosis Distribution')

ax4 = fig.add_subplot(324)
ax4 = train['anatom_site_general_challenge'].value_counts().plot(kind='barh', 
                                                            color=['blue','red'], alpha=.5, 
                                                            title='Anatomical Site Distribution')

ax5 = fig.add_subplot(325)
ax5 = sns.distplot(train['age_approx'], kde=False)


Notice that there are significantly more benign images compared to malignant images in the train dataset. Stratifying when splitting the train and validation set as well as during cross-validaton will thus be important. Another option is to apply some kind of weights to the loss so that the benign images will be weighted less.

Let's look at some benign and malignant images. Can you identify the differences based on the video we just saw?

In [None]:
%matplotlib inline 
from matplotlib.image import imread
import random
import cv2

# Function for plotting samples
def plot_samples(samples):  
    fig, ax = plt.subplots(nrows=4, ncols=5, figsize=(30,16))
    for i in range(len(samples)):
        image = imread(samples[i])
        ax[i//5][i%5].imshow(image)
        if i<10:
            ax[i//5][i%5].set_title("Benign", fontsize=20)
        else:
            ax[i//5][i%5].set_title("Malignant", fontsize=20)
        ax[i//5][i%5].axis('off')

In [None]:
# sample images

dirname = '/kaggle/input/siim-isic-melanoma-classification/jpeg/train/'
# get benign filepaths 
sampleimg = [dirname + i + '.jpg' for i in train[train['benign_malignant'] == 'benign']['image_name'][:10]] + \
[dirname + i + '.jpg' for i in train[train['benign_malignant'] == 'malignant']['image_name'][:10]]

plot_samples(sampleimg)
plt.suptitle('Melanoma Samples', fontsize=30)
plt.show()

<a id='TPU'></a>
# 2. Set Up TPU

Now we set up the TPU so that training can be done faster. As the sizes of our models and datasets increase, we need to use either TPUs or GPUs to train our models within a reasonable amount of time. Built using 4 custom made ASICs or application-specific integrated circuit by Google specifically designed for machine learning using TensorFlow, TPUs offer a truly robust 180 TFLOPS of performance with 64GB of high bandwidth memory. This makes TPUs perfect for both training and inferencing of machine learning models.

<img src="https://qph.fs.quoracdn.net/main-qimg-9ecde3fc6d69116db89aacd83bdf15e5">

In [None]:
AUTO = tf.data.experimental.AUTOTUNE

# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is set. 
    # On Kaggle this is always the case.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy() 

REPLICAS = strategy.num_replicas_in_sync
print("REPLICAS: ", strategy.num_replicas_in_sync)


<a id='Hyperparameters'></a>
# 3. Setting Hyperparameters

The reason why there are multiple values for some of the hyperparameter is for training and experimenting with different models during each kfold. However, when training the final dataset, the hyperparameters should all be the same for all kfolds.

In [None]:
# Hyperparameters

SEED = 42
FOLDS=3
EFF_NETS = [6]*FOLDS
BATCH_SIZES = [bs * strategy.num_replicas_in_sync for bs in [32]*FOLDS]
IMG_SIZES = [512]*FOLDS
EPOCHS = [10]*FOLDS
DROPOUT = 0.25
LR = 0.00004
WARMUP = 5
# CLASS WEIGHT SCALING
# Scaling by total/2 helps keep the loss to a similar magnitude.
# The sum of the weights of all examples stays the same.
CLASS_WEIGHT = {0: train['benign_malignant'].value_counts().malignant/len(train),
                1: train['benign_malignant'].value_counts().benign/len(train)}
# WEIGHTS FOR FOLD MODELS WHEN PREDICTING TEST
WGTS = [1/FOLDS]*FOLDS
LABEL_SMOOTHING = 0.05
TTA = 15 # test time augmentation

# INCLUDE OLD COMP DATA? YES=1 NO=0
INC2018 = [1]*FOLDS

# seed
np.random.seed(SEED)
tf.random.set_seed(SEED)

Here, we set the paths to the different datasets containing different image sizes.

In [None]:
# Get file paths for train, validation and test

DATASET = {512: '512x512-melanoma-tfrecords-70k-images',
           384: 'melanoma-384x384',
           192: 'melanoma-192x192'}

GCS_PATH = [None]*FOLDS; GCS_PATH2 = [None]*FOLDS
for i,k in enumerate(IMG_SIZES):
    GCS_PATH[i] = KaggleDatasets().get_gcs_path(DATASET[IMG_SIZES[0]])
    #GCS_PATH[i] = KaggleDatasets().get_gcs_path('melanoma-%ix%i'%(k,k))
    GCS_PATH2[i] = KaggleDatasets().get_gcs_path('isic2019-%ix%i'%(k,k))

train_filenames = tf.io.gfile.glob(GCS_PATH[0] + '/train*.tfrec')

test_filenames = tf.io.gfile.glob(GCS_PATH[0] + '/test*.tfrec')

<a id='TrainSchedule'></a>
### Learning Rate Train Schedule ###

This is a common train schedule for transfer learning. The learning rate starts near zero, then increases to a maximum, then decays over time. Consider changing the schedule and/or learning rates. Note how the learning rate max is larger with larger batches sizes. This is a good practice to follow.

The learning rate schedule we are using `get_lr_callback` is somewhat similar to the 1cycle learning rate policy. The 1cycle policy anneals the learning
rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate. This policy was initially described in the paper [Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates](https://arxiv.org/abs/1708.07120). Refer to this [post](https://sgugger.github.io/the-1cycle-policy.html) to get a better understanding.

In [None]:
def lrfn(epoch):
    if epoch < lr_ramp_ep:
        lr = (lr_max - lr_start) / lr_ramp_ep * epoch + lr_start

    elif epoch < lr_ramp_ep + lr_sus_ep:
        lr = lr_max

    else:
        lr = (lr_max - lr_min) * lr_decay**(epoch - lr_ramp_ep - lr_sus_ep) + lr_min

    return lr

def get_lr_callback(batch_size=8):
    lr_start   = 0.000005
    lr_max     = 0.000003 * batch_size
    lr_min     = 0.000001
    lr_ramp_ep = 5
    lr_sus_ep  = 0
    lr_decay   = 0.3
       
    lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=False)
    return lr_callback

In [None]:
lr_start   = 0.000005
lr_max     = 0.000003 * BATCH_SIZES[0]
lr_min     = 0.000001
lr_ramp_ep = 5
lr_sus_ep  = 0
lr_decay   = 0.3

def lrfn(epoch):
    if epoch < lr_ramp_ep:
        lr = (lr_max - lr_start) / lr_ramp_ep * epoch + lr_start

    elif epoch < lr_ramp_ep + lr_sus_ep:
        lr = lr_max

    else:
        lr = (lr_max - lr_min) * lr_decay**(epoch - lr_ramp_ep - lr_sus_ep) + lr_min

    return lr

rng = [i for i in range(8 if EPOCHS[0]<8 else EPOCHS[0])]
y = [lrfn(x) for x in rng]
plt.plot(rng, y)
print("Learning rate schedule: {:.3g} to {:.3g} to {:.3g}".format(y[0], max(y), y[-1]))

<a id='ImageLoading'></a>
# 4. Image Loading Functions

Here we create some functions for loading TFRecords to get the images, target and image names.

* **decode_image:** to transform images to a tensor, normalize it, and reshape it into the correct shape for TPU <br>
* **read_tfrecord:** read the TFRecord, returns the image tensor, and based on the input arguments, return the label value, image name, or nothing (0) <br>
* **load_dataset:** reads data from the TFRecords. Here we can choose whether to shuffle the data or not. We will do that for the train, but not the validation and test dataset.
* **count_data_items:** counts the number of images in a file
* **data_augment:** performs data augmentation
* **plot_transform:** plots some examples of augmented images

### Other augmentations
Here we apply some manual augmentations that cannot be done with `tf.image`, such as shearing, zooming and translation. Rotation can be done in `tf.image` but only in factors of 90 degrees, so we do it manually instead.

In [None]:
ROT_ = 180.0
SHR_ = 2.0
HZOOM_ = 8.0
WZOOM_ = 8.0
HSHIFT_ = 8.0
WSHIFT_ = 8.0

def get_mat(rotation, shear, height_zoom, width_zoom, height_shift, width_shift):
    # returns 3x3 transformmatrix which transforms indicies
        
    # CONVERT DEGREES TO RADIANS
    rotation = math.pi * rotation / 180.
    shear    = math.pi * shear    / 180.

    def get_3x3_mat(lst):
        return tf.reshape(tf.concat([lst],axis=0), [3,3])
    
    # ROTATION MATRIX
    c1   = tf.math.cos(rotation)
    s1   = tf.math.sin(rotation)
    one  = tf.constant([1],dtype='float32')
    zero = tf.constant([0],dtype='float32')
    
    rotation_matrix = get_3x3_mat([c1,   s1,   zero, 
                                   -s1,  c1,   zero, 
                                   zero, zero, one])    
    # SHEAR MATRIX
    c2 = tf.math.cos(shear)
    s2 = tf.math.sin(shear)    
    
    shear_matrix = get_3x3_mat([one,  s2,   zero, 
                                zero, c2,   zero, 
                                zero, zero, one])        
    # ZOOM MATRIX
    zoom_matrix = get_3x3_mat([one/height_zoom, zero,           zero, 
                               zero,            one/width_zoom, zero, 
                               zero,            zero,           one])    
    # SHIFT MATRIX
    shift_matrix = get_3x3_mat([one,  zero, height_shift, 
                                zero, one,  width_shift, 
                                zero, zero, one])
    
    return K.dot(K.dot(rotation_matrix, shear_matrix), 
                 K.dot(zoom_matrix,     shift_matrix))


def transform(image, DIM=512):    
    # input image - is one image of size [dim,dim,3] not a batch of [b,dim,dim,3]
    # output - image randomly rotated, sheared, zoomed, and shifted
    XDIM = DIM%2 #fix for size 331
    
    rot = ROT_ * tf.random.normal([1], dtype='float32')
    shr = SHR_ * tf.random.normal([1], dtype='float32') 
    h_zoom = 1.0 + tf.random.normal([1], dtype='float32') / HZOOM_
    w_zoom = 1.0 + tf.random.normal([1], dtype='float32') / WZOOM_
    h_shift = HSHIFT_ * tf.random.normal([1], dtype='float32') 
    w_shift = WSHIFT_ * tf.random.normal([1], dtype='float32') 

    # GET TRANSFORMATION MATRIX
    m = get_mat(rot,shr,h_zoom,w_zoom,h_shift,w_shift) 

    # LIST DESTINATION PIXEL INDICES
    x   = tf.repeat(tf.range(DIM//2, -DIM//2,-1), DIM)
    y   = tf.tile(tf.range(-DIM//2, DIM//2), [DIM])
    z   = tf.ones([DIM*DIM], dtype='int32')
    idx = tf.stack( [x,y,z] )
    
    # ROTATE DESTINATION PIXELS ONTO ORIGIN PIXELS
    idx2 = K.dot(m, tf.cast(idx, dtype='float32'))
    idx2 = K.cast(idx2, dtype='int32')
    idx2 = K.clip(idx2, -DIM//2+XDIM+1, DIM//2)
    
    # FIND ORIGIN PIXEL VALUES           
    idx3 = tf.stack([DIM//2-idx2[0,], DIM//2-1+idx2[1,]])
    d    = tf.gather_nd(image, tf.transpose(idx3))
        
    return tf.reshape(d,[DIM, DIM,3])

In [None]:
def decode_image(image):
    # decode a JPEG-encoded image to a uint8 tensor
    image = tf.image.decode_jpeg(image, channels=3) 
    # cast tensor to float32 and normalize to [0, 1] range
    image = tf.cast(image, tf.float32)/255.0 
    # explicit size needed for TPU
    image = tf.reshape(image, [*IMG_SIZES[0:2], 3]) 
    return image

def read_tfrecord(example, labeled, return_imgname=False):
    tfrecord_format = {
        "image": tf.io.FixedLenFeature([], tf.string),
        "target": tf.io.FixedLenFeature([], tf.int64)
    } if labeled else {
        "image": tf.io.FixedLenFeature([], tf.string),
        "image_name": tf.io.FixedLenFeature([], tf.string)
    }
    example = tf.io.parse_single_example(example, tfrecord_format)
    image = decode_image(example['image'])
    # returns a dataset of (image, label) pairs if labeled=True
    if labeled:
        label = tf.cast(example['target'], tf.int32)
        return image, label
    idnum = example['image_name']
    # returns a dataset of (image, image_name) pairs if return_imgname=True
    if return_imgname:
        return image, idnum
    # else returns a dataset of (image, 0) pairs 
    return image, 0

def load_dataset(filenames, labeled=True, ordered=False, return_imgname=False):
    # Read from TFRecords. For optimal performance, reading from multiple files at once and
    # disregarding data order. Order does not matter since we will be shuffling the data anyway
    ignore_order = tf.data.Options()
    if not ordered:
        ignore_order.experimental_deterministic = False # disable order, increase speed
    # automatically interleaves reads from multiple files
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO) 
    # uses data as soon as it streams in, rather than in its original order
    dataset = dataset.with_options(ignore_order) 
    dataset = dataset.map(partial(read_tfrecord, labeled=labeled, 
                                  return_imgname=return_imgname), num_parallel_calls=AUTO)
    , or (image, id) pairs if labeled=False
    return dataset

def count_data_items(filenames):
    n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
    return np.sum(n)

def data_augment(image, label=None, seed=SEED):
    # data augmentation. Thanks to the dataset.prefetch(AUTO) statement when 
    # loading dataset (below cell), this happens essentially for free on TPU. Data pipeline
    # code is executed on the "CPU" part of the TPU while the TPU itself is 
    # computing gradients.
    image = transform(image, IMG_SIZES[0])
    #image = tf.image.rot90(image,k=np.random.randint(4)) # rotate
    image = tf.image.random_flip_left_right(image, seed=seed) # flip horizontal
    image = tf.image.random_flip_up_down(image, seed=seed) # flip vertical
    image = tf.image.random_brightness(image, max_delta=0.2) # random brightness
    image = tf.image.random_contrast(image, 0.8, 1.2) # random contrast
    image = tf.image.random_saturation(image, 0.7, 1.3) # random saturation
    
    if label is None:
        return image
    else:
        return image, label

# plot augmented images sample
def plot_transform(num_images):
    fig, ax = plt.subplots(nrows=3, ncols=num_images, figsize=(12,5))
    x = (load_dataset(train_filenames, labeled=True)
                     .shuffle(SEED)
                     .batch(BATCH_SIZES[0],drop_remainder=True)                 
                     .prefetch(AUTO)
                     .unbatch().take(5))
    images = []
    imgs=[]
    for r in range(3):
        image,_ = iter(x).next()
        images.append(image)
        for i in range(0,num_images):
            image = data_augment(image=images[r])
            imgs.append(image)
    for img in range(len(imgs)):          
        ax[img//num_images][img%num_images].imshow(imgs[img])
        ax[img//num_images][img%num_images].axis('off')

Let's test out the functions to make sure they are working.

In [None]:
# Function for plotting images in grid
def show_dataset(thumb_size, cols, rows, ds):
    mosaic = PIL.Image.new(mode='RGB', size=(thumb_size*cols + (cols-1), 
                                             thumb_size*rows + (rows-1)))
   
    for idx, data in enumerate(iter(ds)):
        img, target_or_imgid = data
        ix  = idx % cols
        iy  = idx // cols
        img = np.clip(img.numpy() * 255, 0, 255).astype(np.uint8)
        img = PIL.Image.fromarray(img)
        img = img.resize((thumb_size, thumb_size), resample=PIL.Image.BILINEAR)
        mosaic.paste(img, (ix*thumb_size + ix, 
                           iy*thumb_size + iy))

    display(mosaic)
    
eg_ds = (load_dataset(train_filenames, labeled=True)
                     .batch(BATCH_SIZES[0],drop_remainder=True)                 
                     .prefetch(AUTO)
                     .unbatch().take(10*6))  

show_dataset(64, 10, 6, eg_ds)

Here are some examples of the image augmentations

In [None]:
plot_transform(7)

<a id='Train_model'></a>
# 5. Train and Evaluate Model

Before we train the model, let's take a moment to go through some of the concepts.

<a id='EffNets'></a>
### EfficientNets ###

Here is a quick explanation of **EfficientNets**. 

In training convolutional neural networks (CNNs), **scaling** is usually done to improve the model's accuracy, but it can also help in improving the efficiency of a model (i.e. how quickly it trains). There are three scaling dimensions of a CNN: *depth*, *width*, and *resolution*. **Depth** refers to how deep the networks is which is equivalent to the number of layers in it. **Width** refers to how wide the network is. One measure of width, for example, is the number of channels in a Conv layer whereas **Resolution** is simply the image resolution that is being passed to a CNN. 

Here is a picture showing the different scalings applied to a CNN:

<img src="https://miro.medium.com/max/700/1*xQCVt1tFWe7XNWVEmC6hGQ.png">

<br>

While depth scaling can allow a network to learn more complex and richer features, there is usually a diminishing return as the depth gets too large due to vanishing gradients, even when using techniques such as skip-connections in **Residual Networks**. Width scaling on the other hand allows models to be small and wider networks tend to capture more fine-grained features and are faster to train since they are smaller. However, performance again suffers with larger width. Resolution scaling is quite intuitive. Using larger resolutions is always better but the relationship is not linear. Therefore, ***scaling up any dimension of network (width, depth or resolution) improves accuracy, but the accuracy gain diminishes for bigger models.***

EfficientNet was created with the idea of using **Compound Scaling**. This uses a **compound coefficient ɸ** to uniformly scale the network width, depth and resolution. ɸ is a user-specified coefficient, which produces **EfficientNets** ***B1-B7***. Refer to this [post](https://medium.com/@nainaakash012/efficientnet-rethinking-model-scaling-for-convolutional-neural-networks-92941c5bfb95) for a better intuition behind the scaling of EfficientNets.

<a id='TTA'></a>
### Test Time Augmentation ###
Similar to what Data Augmentation is doing to the training set, the purpose of **Test Time Augmentation** (TTA) is to perform random modifications to the test images. Instead of using the trained model to make a prediction on the *cleaned* and *original* test images, we augment the images several times, make a prediction on each of them, and get the average of each corresponding image. This will be the final prediction for that image. The reason why TTA works is that by averaging our predictions, we are also ***averaging the errors***. We can also apply TTA to the validation dataset to get a better performance on the validation set.

Refer to this [post](https://towardsdatascience.com/test-time-augmentation-tta-and-how-to-perform-it-with-keras-4ac19b67fb4d) for a better intuition and examples of TTA.

<a id='OOF'></a>
### Out of Fold (OOF) in K-Fold Cross Validation ###

It is common to evaluate the performance of a machine learning algorithm on a dataset using a resampling technique such as k-fold cross-validation.

The k-fold cross-validation procedure involves splitting a training dataset into k groups, then using each of the k groups of examples on a validation set (hold out set/out of fold set) while the remaining examples are used as a training set. This means that k different models are trained and evaluated. The performance of the model is estimated using the predictions by the models made across all k-folds.

An **out-of-fold (OOF)** prediction is a prediction by the model during the k-fold cross-validation procedure. That is, out-of-fold predictions are those predictions made on the holdout datasets during the resampling procedure. If performed correctly, there will be one prediction for each example in the training dataset.

<a id='LabelSmooth'></a>
### Label Smoothing ###

Label smoothing is a regularization technique for classification problems to prevent the model from predicting the training examples too confidently, thus improving the robustness of the model to generalize well. In classification, the softmax function is usually applied to the penultimate layer's logit vectors to compute its output probabilities. The problem with using hard targets (e.g., [1,0,0,0] for one-hot encoding) in classification is that they encourage the largest logit gaps to be fed into the softmax function, which results in a model that is ***too confident*** about its predictions. This can also lead to overfitting.

Label smoothing attempts to overcome this by applying a smoothing parameter `α` to modify the targets into softer targets, thus reducing the gap between the largest logit and the rest. Refer to this [post](https://towardsdatascience.com/what-is-label-smoothing-108debd7ef06) and this [post](https://medium.com/@nainaakash012/when-does-label-smoothing-help-89654ec75326) for a more in-depth discussion.

Finally, let's train and evaluate the model. Notice that for the prediction on the test set, we are using each fold's best model to predict on the entire test set. The predictions are then averaged across the number of folds (through the `WGTS` hyperparameter, which is set as 1/fold). 

In [None]:
# USE VERBOSE=0 for silent, VERBOSE=1 for interactive, VERBOSE=2 for commit
VERBOSE = 2
DISPLAY_PLOT = True

skf = KFold(n_splits=FOLDS,shuffle=True,random_state=SEED)
# initialize for out of fold storage
oof_pred = []; oof_tar = []; oof_val = []; oof_names = []; oof_folds = [] 
# initialize for prediction storage
preds = np.zeros((count_data_items(test_filenames),1))                    

# idxT=train tfrec num, idxV=val tfrec num
# loop through the kfolds and get the stratified sets
for fold,(idxT,idxV) in enumerate(skf.split(np.arange(15))):
    
    if tpu: tf.tpu.experimental.initialize_tpu_system(tpu) # initialize TPU
          
    # CREATE TRAIN, VALIDATION & TEST SETS FOR CURRENT FOLD
    # get the files in the stratified fold
    files_train = tf.io.gfile.glob([GCS_PATH[fold] + '/train%.2i*.tfrec'%x for x in idxT]) 
    # include 2018 comp data
    if INC2018[fold]:
        files_train += tf.io.gfile.glob([GCS_PATH2[fold] + '/train%.2i*.tfrec'%x for x in idxT*2])
    np.random.seed(SEED)
    np.random.shuffle(files_train); print('#'*25)   # shuffle training set
    train_dataset = (load_dataset(files_train, labeled=True)
                     .repeat()                      # repeat to continue getting data for aug
                     .map(data_augment, num_parallel_calls=AUTO)
                     .shuffle(SEED)
                     .batch(BATCH_SIZES[fold],drop_remainder=True)
                     # prefetch next batch while training (autotune prefetch buffer size)
                     .prefetch(AUTO)) 
    files_valid = tf.io.gfile.glob([GCS_PATH[fold] + '/train%.2i*.tfrec'%x for x in idxV])
    files_test = np.sort(np.array(tf.io.gfile.glob(GCS_PATH[fold] + '/test*.tfrec')))

          
    # BUILD MODEL
    # when creating multiple models during CV, each model adds nodes to the graph,
    # best to clear the nodes to prevent memory hogging, which slows down training
    K.clear_session()
    with strategy.scope():
        model = tf.keras.Sequential([
            efn.EfficientNetB6(input_shape=(IMG_SIZES[fold],IMG_SIZES[fold], 3),
                               weights='imagenet',pooling='avg',include_top=False),
        # add fully connected layer, with sigmoid activation since only 2 categories
            Dense(1, activation='sigmoid') 
        ])
        model.compile(
            optimizer='adam',
            loss = tf.keras.losses.BinaryCrossentropy(label_smoothing = LABEL_SMOOTHING),
            metrics=[tf.keras.metrics.AUC(name='auc')])
    
    # for saving best model from the best epoch in each kfold
    sv = tf.keras.callbacks.ModelCheckpoint(
            'fold-%i.h5'%fold, monitor='val_loss', verbose=0, save_best_only=True,
            save_weights_only=True, mode='min', save_freq='epoch')

    # DISPLAY FOLD INFO
    print('#'*25); print('#### FOLD',fold+1)
    print('#### Image Size %i with EfficientNet B%i and batch_size %i'%
          (IMG_SIZES[fold],EFF_NETS[fold],BATCH_SIZES[fold]))
    
    # TRAIN
    print('Training...')         
    history = model.fit(
        train_dataset, 
        epochs=EPOCHS[fold], 
        callbacks=[sv,get_lr_callback(BATCH_SIZES[fold])],                      # lr schedule
        steps_per_epoch=count_data_items(files_train) // BATCH_SIZES[fold],
        validation_data=load_dataset(files_valid, labeled=True)                                        
                         .cache()
                         .batch(BATCH_SIZES[fold])
                         .prefetch(AUTO),
        #class_weight=CLASS_WEIGHT,                                             # class weights
        verbose=VERBOSE)
    
    # LOAD BEST MODEL
    print('Loading best model...')
    model.load_weights('fold-%i.h5'%fold)
    
    # PREDICT OOF USING TTA
    print('Predicting OOF with TTA...')
    ds_valid = (load_dataset(files_valid, labeled=True, ordered=True)                                        
                     .cache()     
                     .repeat()                              # repeat for data aug during oof val
                     .map(data_augment, num_parallel_calls=AUTO)  # data augmentation 
                     .batch(BATCH_SIZES[fold]*4)            # X4 to speed up training
                     .prefetch(AUTO))
    ct_valid = count_data_items(files_valid)
    STEPS = TTA * ct_valid/BATCH_SIZES[fold]/4    #number of steps to go through all TTA images
    # slice to throw away images that pass the steps
    pred = model.predict(ds_valid,steps=STEPS,verbose=VERBOSE)[:TTA*ct_valid,] 
    # store the average of each valid image
    oof_pred.append( np.mean(pred.reshape((ct_valid,TTA),order='F'),axis=1) ) 
    
    # GET OOF TARGETS, FOLDS, AND NAMES
    # get targets
    # do not repeat=True here as we only want the target values
    ds_valid = (load_dataset(files_valid, labeled=True, ordered=True) # do not shuffle
                        .cache()
                        .batch(BATCH_SIZES[fold]*4)
                        .prefetch(AUTO))
    oof_tar.append( np.array([target.numpy() for img, target in iter(ds_valid.unbatch())]) )
    # track which fold data comes from
    oof_folds.append( np.ones_like(oof_tar[-1],dtype='int8')*fold ) 
    # get names
    ds = (load_dataset(files_valid, labeled=False, return_imgname=True, ordered=True)                                        
                .cache()     
                .batch(BATCH_SIZES[fold]*4)                   
                .prefetch(AUTO))
    oof_names.append( np.array([img_name.numpy().decode("utf-8") for img, img_name in iter(ds.unbatch())]))      
          
    # PREDICT TEST USING TTA
    print('Predicting Test with TTA...')
    ds_test = (load_dataset(files_test, labeled=False, ordered=True) # do not shuffle
              .repeat()                                              # repeat for TTA
              .map(data_augment, num_parallel_calls=AUTO)            # data augmentation 
              .batch(BATCH_SIZES[fold]*4)
              .prefetch(AUTO))
    ct_test = count_data_items(files_test)
    STEPS = TTA * ct_test/BATCH_SIZES[fold]/4 #number of steps to go through all TTA images
    # slice to throw away images that pass the steps
    pred = model.predict(ds_test,steps=STEPS,verbose=VERBOSE)[:TTA*ct_test,] 
    # store the average pred of each test image
    preds[:,0] += np.mean(pred.reshape((ct_test,TTA),order='F'),axis=1) * WGTS[fold] 
    
    # REPORT RESULTS
    auc = roc_auc_score(oof_tar[-1],oof_pred[-1])
    oof_val.append(np.max( history.history['val_auc'] ))
    print('#### FOLD %i OOF AUC without TTA = %.3f, with TTA = %.3f'%(fold+1,oof_val[-1],auc))
          
    # PLOT TRAINING AND VALIDATION
    if DISPLAY_PLOT:
        plt.figure(figsize=(15,5))
        plt.plot(np.arange(EPOCHS[fold]),history.history['auc'],'-o',
                 label='Train AUC',color='#ff7f0e')
        plt.plot(np.arange(EPOCHS[fold]),history.history['val_auc'],'-o',
                 label='Val AUC',color='#1f77b4')
        x = np.argmax( history.history['val_auc'] ); y = np.max( history.history['val_auc'] )
        xdist = plt.xlim()[1] - plt.xlim()[0]; ydist = plt.ylim()[1] - plt.ylim()[0]
        plt.scatter(x,y,s=200,color='#1f77b4')
        plt.text(x-0.03*xdist,y-0.13*ydist,'max auc\n%.2f'%y,size=14)
        plt.ylabel('AUC',size=14); plt.xlabel('Epoch',size=14)
        plt.legend(loc=2)
        plt2 = plt.gca().twinx()
        plt2.plot(np.arange(EPOCHS[fold]),history.history['loss'],'-o',
                  label='Train Loss',color='#2ca02c')
        plt2.plot(np.arange(EPOCHS[fold]),history.history['val_loss'],'-o',
                  label='Val Loss',color='#d62728')
        x = np.argmin( history.history['val_loss'] )
        y = np.min( history.history['val_loss'] )
        ydist = plt.ylim()[1] - plt.ylim()[0]
        plt.scatter(x,y,s=200,color='#d62728')
        plt.text(x-0.03*xdist,y+0.05*ydist,'min loss',size=14)
        plt.ylabel('Loss',size=14)
        plt.xticks(ticks=list(range(EPOCHS[fold])),labels=list(range(1, EPOCHS[fold]+1)))
        plt.title('FOLD %i - Image Size %i, EfficientNet B%i, inc2018=%i'%
                (fold+1,IMG_SIZES[fold],EFF_NETS[fold],INC2018[fold]),size=18)
        plt.legend(loc=3)
        plt.show()

We can see that the validation and train AUC are quite close together, although on some folds, they start to diverge towards the end of the epochs, a sign that it might be overfitting.

Nevertheless, max performance seems to be ~0.9 AUC.

<a id='OOF_AUC'></a>
### Calculate OOF AUC ###

Here we calculate the out-of-fold AUC for all train data to get an idea of the overall validation performance.

In [None]:
# COMPUTE OVERALL OOF AUC
oof = np.concatenate(oof_pred); true = np.concatenate(oof_tar);
names = np.concatenate(oof_names); folds = np.concatenate(oof_folds)
auc = roc_auc_score(true,oof)
print('Overall OOF AUC with TTA = %.3f'%auc)

# SAVE OOF TO DISK
df_oof = pd.DataFrame(dict(
    image_name = names, target=true, pred = oof, fold=folds))
df_oof.to_csv('oof.csv',index=False)
df_oof.head()

<a id='AUROC_AUPRC'></a>
### ROC and PR Curves ###

Let's plot out the area under receiver operating characteristic curve (AUROC) and the area under precision recall curve (AUPRC). When there is a high imalance in the classes, in this case - number of benign greatly exceeds the number of malignant images, the AUROC may give a over-optimistic view of the performance of the model and the AUPRC might be better. 

For example, a big improvement in the number of false positives only leads to a small change in the false positive rate when using ROC. Precision, on the other hand, by comparing false positives to true positives rather than true negatives, captures the effect of the large number of negative examples on the model’s performance. 

Refer to this paper titled [The Relationship Between Precision-Recall and ROC Curves](http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf) for a great explanation of the relationship.

**More on Precision and Recall:**<br>
**Accuracy** is not a good evaluation metric when there is huge **data class imbalance**. Imagine that if we have 100 samples: 99 pneumonia and 1 normal, then a model that predicts everything as pneumonia will get an accuracy of 99%. In this case, its better to look at **precision and recall**, and their harmonic mean, the **F1 score**. 

This can be visualized using a confusion matrix as well.

<img src = 'https://bk9zeg.bn.files.1drv.com/y4mZtoVgcWgAYE59g3lpWQ3PaZWMqnDN7gz1ir2LIgyPjR6a26Ij1vDBmjsETpEmvAkebvyLjSVofcRVSjW8Ux62r8_tIyIK6AZJ7GQOz_sWtAj_hdQIA57pbJaEpHJEeY_pG7odhdU1osvM7jHXfFzpVsIOt76oqNe39j4KZIFRDOguHUr5jPtDe0TIzNTLQuehcuQdw-aIjt7FR9D6Ti9-A?width=618&height=419&cropmode=none' width="600">

<br>
Since this competition has a huge class imbalance, with more benign than malignant images, it makes more sense to use a AUPRC but the competition states to use AUROC. However, we will plot out both. We can see that the PRC shows that there can still be lots of improvement to the model while the ROC curve gives a over-optimistic view of how our model is performing. <br>

<br>

This paper titled [The Relationship Between Precision-Recall and ROC Curves](http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf) provides a great explanation of the relationship between ROC and PRC.

More resources here: [The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432).


In [None]:
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

# AUROC
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(FOLDS):
    fpr[i], tpr[i], _ = roc_curve(df_oof[df_oof.fold==i].target, df_oof[df_oof.fold==i].pred)
    roc_auc[i] = auc(fpr[i], tpr[i])
fpr["mean"], tpr["mean"], _ = roc_curve(df_oof.target, df_oof.pred)
roc_auc['mean'] = auc(fpr['mean'], tpr['mean'])

# AUPRC
precision = dict()
recall = dict()
average_precision = dict()
for i in range(FOLDS):
    precision[i], recall[i], _ = precision_recall_curve(df_oof[df_oof.fold==i].target, 
                                                        df_oof[df_oof.fold==i].pred)
    average_precision[i] = average_precision_score(df_oof[df_oof.fold==i].target, 
                                                        df_oof[df_oof.fold==i].pred)
precision["mean"], recall["mean"], _ = precision_recall_curve(df_oof.target, df_oof.pred)
average_precision['mean'] = average_precision_score(df_oof.target, df_oof.pred)

# PLOT
# auroc
fig = plt.figure(figsize=(18,6))
ax1 = fig.add_subplot(121)

ax1.plot([0, 1], [0, 1], linestyle='--', lw=4, color='r',
        label='Chance', alpha=.8)

colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
          '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
for i in range(FOLDS):
    ax1.plot(fpr[i], tpr[i], color=colors[i],
        label=r'ROC Fold %i (AUC = %0.2f)' % (i+1,roc_auc[i]),
        lw=3, alpha=.5)

ax1.plot(fpr['mean'], tpr['mean'], color='b',
        label=r'Mean ROC (AUC = %0.2f)' % (roc_auc['mean']),
        lw=4, alpha=.8)    
    
ax1.set(xlim=[-0.05, 1.05], ylim=[-0.05, 1.05])
plt.title("Receiver Operating Characteristic Curve", size=25)
plt.xlabel('False Positive Rate',size=20); plt.xticks(size=15)
plt.ylabel('True Positive Rate',size=20); plt.yticks(size=15)
ax1.legend(loc="lower right",prop={"size":15})

# auprc
ax2 = fig.add_subplot(122)
for i in range(FOLDS):
    ax2.step(recall[i], precision[i], where='post', color=colors[i],
             label=r'AP Fold %i (AP = %0.2f)' % (i+1,average_precision[i]),
             lw=3, alpha=.5)

ax2.step(recall['mean'], precision['mean'], where='post', color='b',
        label=r'Mean AP (AP = %0.2f)' % (average_precision['mean']),
        lw=4, alpha=.8)    
    
ax2.set(xlim=[-0.05, 1.05], ylim=[-0.05, 1.05])
plt.title("Precision Recall Curve", size=25)
plt.xlabel('Recall',size=20); plt.xticks(size=15)
plt.ylabel('Precision',size=20); plt.yticks(size=15)
ax2.legend(loc="lower left",prop={"size":15})

plt.show()

<a id='Predict_test'></a>
# 6. Test Set Predictions

Let's gather the test set predictions and organize them into a dataframe.

In [None]:
test_ds = (load_dataset(files_test, labeled=False, return_imgname=True, ordered=True)
              .batch(BATCH_SIZES[fold])
              .prefetch(AUTO))
image_names = np.array([img_name.numpy().decode("utf-8") 
                        for img, img_name in iter(test_ds.unbatch())])

submission = pd.DataFrame(dict(image_name=image_names, target=preds[:,0]))
submission = submission.sort_values('image_name').reset_index(drop=True) 
submission.to_csv('submission_noensemble.csv', index=False)
submission.head()

In [None]:
plt.hist(submission.target,bins=100)
plt.show()

<a id='MetaData'></a>
# 7. Meta-Data Train and Prediction

In addition to the images themselves, they also come together with some meta-data. This include information like `age`, `sex` and `site of lesion`. We can train a weak model to predict melanoma based on these features, then blend the predictions together (ensemble) with the EffNet prediction. We will also need to apply some weightings to the predictions as the predictions from the meta-data are weak and should not contribute equally towards the final predictions.

Here we are using the 2020 metadata together with past years' data.

<a id='Clean'></a>
### Clean Train and Test Sets ###

Here we account for missing data. For `age`, we simply replace with the mean. For the categorical features `sex` and `anatom_site_general_challenge`, we replace missing values with a class level with probability proportional to the class distribution of the feature. For example for sex, if males made up 60% of the sample, then a missing value will be assigned as male 60% of the time. Since the external data have more detailed anatomical location classes than 2020, we will have to deal with them. All the different sites on the torso will be merged to just torso.

In [None]:
train = pd.read_csv('/kaggle/input/melanomaextendedtabular/external_upsampled_tabular.csv')
train.head()

In [None]:
# marge different sites of torso to just torso
train['anatom_site_general_challenge'].replace(['anterior torso','lateral torso','posterior torso'], 
                                               'torso', inplace=True)

In [None]:
train.isnull().sum()

In [None]:
def fillna(df, column):
    na_idx = df[df[column].isnull()].index.tolist() # get idx of nulls
    prob = df[column].value_counts(normalize=True).sort_index().tolist()
    for i in na_idx:
        df.iloc[i, df.columns.get_loc(column)] = choices(sorted(df[column].dropna().unique().tolist()), prob)[0]

train.sex.replace('unknown', np.nan, inplace=True) # replace unknown sex with NaNs
fillna(train, 'sex')
fillna(train, 'anatom_site_general_challenge')   
fillna(test, 'anatom_site_general_challenge')
# replace null age with mean age
train['age_approx'] = train['age_approx'].fillna(round(train['age_approx'].mean())) 

# Check missing data cleaning
print('Train Missing Data Count: %i, Test Missing Data Count: %i'%(train.isnull().sum().sum(), test.isnull().sum().sum()))

<a id='Dummy'></a>
### Dummy Encoding ###

In [None]:
# dummy encode categorical features
train = pd.get_dummies(train, columns=["anatom_site_general_challenge"], prefix='site',
                       drop_first=True)
train.replace({'sex': {'female':0, 'male': 1}}, inplace=True)

train.head()

In [None]:
# dummy encode categorical features
test = pd.get_dummies(test, columns=["anatom_site_general_challenge"], prefix='site',
                      drop_first=True)
test.replace({'sex': {'female':0, 'male': 1}}, inplace=True)
test.head()

<a id='Prepare'></a>
### Prepare Train and Test Sets ###

In [None]:
y_train = train['target']
X_train = train.drop(['image_name','height','width','target'],
                     axis=1)
X_test = test.drop(['image_name','patient_id'],axis=1)

# Scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_s = X_train.copy()
X_train_s[['age_approx']] = scaler.fit_transform(X_train_s[['age_approx']])
X_test_s = X_test.copy()
X_test_s[['age_approx']] = scaler.fit_transform(X_test_s[['age_approx']])

X_train_s.head()

<a id='TrainPredict'></a>
### Train, Predict and Blend###
We will train 3 models using cross-validation then average their predictions.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

lr = LogisticRegression(penalty='l2', class_weight='balanced', random_state=SEED)
nb = GaussianNB()
rf = RandomForestClassifier(n_estimators=1000, max_depth=2, class_weight='balanced',
                            n_jobs=-1, random_state=SEED)

estimators = [lr, rf, nb]
cv = StratifiedKFold(5, shuffle=True, random_state=SEED)

def model_cv(X_train, y_train, estimators, cv):
    
    for est in estimators:
        model_name = est.__class__.__name__
        cv_results = cross_validate(est, X_train, y_train, cv=cv,
                                scoring='roc_auc',
                                return_train_score=True,
                                n_jobs=-1)
        train_score = cv_results['train_score'].mean()
        test_score = cv_results['test_score'].mean()
        
        print('%s - Train Score: %.3f, Val Score: %.3f'%(model_name,train_score,test_score))
        
        
model_cv(X_train_s, y_train, estimators, cv)

In [None]:
# Predict and Blend
def model_blend(X_train, y_train, X_test, estimators):
    mean_prob = 0
    
    for est in estimators:
        est.fit(X_train, y_train)
        mean_prob += est.predict_proba(X_test)[:,1]
        
    return mean_prob/len(estimators) # return the average

meta_df = pd.DataFrame(columns=['image_name', 'target'])
meta_df['image_name'] = sample['image_name']
# predict and add to meta df
meta_df['target'] = model_blend(X_train_s, y_train, X_test_s, estimators)

meta_df.to_csv('meta_pred.csv', header=True, index=False)
meta_df.head()

<a id='Ensemble'></a>
# 8. Ensemble Model

Here we assign the meta-data prediction a 0.1 weight while giving the majority of the weights to the EffNets. Here I uploaded the EffNets ensemble that I did outside of the notebook which consists of 6 models from EffNets B2,B4,B6 on 384 and 512 images. They have already been given equal weights of 0.15 (0.15 x 6 = 0.9).

In [None]:
submission_effnet_ensemble = pd.read_csv('../input/effnet-ensemble/submission_effnet_ensemble.csv')
submission.target = (submission_effnet_ensemble.target) + (meta_df.target * 0.1)
submission.to_csv('submission.csv', index=False)
submission.head()

<a id='Conclusion'></a>
# 9. Conclusion

In conclusion, our model performs reasonably well based on AUROC score but there can still be room for vast improvement according to the AUPRC score. In the diagnosis of melanoma (to classify lesions as malignant or benign), we often want our recall/sensitivity to be high since we want to correctly identify all true malignant cases. For our model, if we set a low threshold to get a high recall, our precision will greatly suffer, which means that during melanoma screening, many people will get false diagnosis of melanoma (high false positives). Therefore, the current model is still not great for deployment.

**Below are some of the difficulties that I faced for this dataset:**

* **Data class imbalance:** This is quite prevalent in the real world and as data scientists, we should all learn to embrace it and find ways to get around this problem. Usually, one way is to simply collect more data for the undersampled class. However, this is not possible for this dataset and especially for healthcare data, which is very difficult to collect and share. In reality, malignant cases will also be less prevalent than benign cases. In this notebook, I used stratified sampling to get around this problem but this does not totally solve the issue. One other way might be to apply weighted loss to give more weights to the loss of the malignant images. However, I tried it and it seems to hinder performance for this dataset. Here is a [post](https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#class_weights) on some ways to deal with inbalanced datasets.

* **Overfitting:** This dataset is prone to overfitting, especially if one uses a quickly increasing learning rate. Therefore, I only ran the model for ~ 8 epochs as it starts to overtrain by then.


**What I learnt:**
* **Learning rate scheduling using** [**One fit cycle:**](https://pytorch.org/docs/stable/_modules/torch/optim/lr_scheduler.html#OneCycleLR) The 1cycle policy anneals the learning rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate. This policy was initially described in the paper [Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates](https://arxiv.org/abs/1708.07120). Refer to this [post](https://sgugger.github.io/the-1cycle-policy.html) to get a better understanding.
* **EfficientNets:** Who would have thought that such a simple scaling of the 3 dimensions of a CNN can lead to such efficiency in model training!
* **Using TPUs:** TPUs are really efficient for training large neural networks and data compared to CPU and GPU. The future of deep learning and AI is becoming more impressive.   
* **Test Time Augmentation:** TTA provides a way to reduce the errors in the predictions and is a really useful technique. Combined with TPU, it provides the opportunity to train and evaluate models with a lot more diversity in the data.
* **Label Smoothing:** A useful regularization technique for preventing the model from predicting too confidently by modifying hard labels into soft labels, which reduces the gap between the largest logit and the rest.
* **Kfold Cross Validation in Neural Networks:** For my past deep learning projects, I have always trained using a separate train and validation sets. K-fold CV provide a more robust way to evaluate the performance of the model, and it has the advantage that every image is trained and evaluated.  
* **Using Ensemble models from different data:** Combining predictions trained from images using CNNs and metadata is a great way to improving the performance of the model.  

<font size="+3" color="steelblue"><b>My other works</b></font><br>

<div class="row">
    
  <div class="col-sm-4">
    <div class="card">
      <div class="card-body" style="width: 20rem;">
         <h5 class="card-title"><u>Pneumonia Detection with PyTorch</u></h5>
         <img style='height:170' src="https://raw.githubusercontent.com/teyang-lau/Pneumonia_Detection/master/Pictures/train_grid.png" class="card-img-top" alt="..."><br>
         <p class="card-text">Pneumonia Detection using Transfer Learning via ResNet in PyTorch</p>
         <a href="https://www.kaggle.com/teyang/pneumonia-detection-resnets-pytorch" class="btn btn-primary" style="color:white;">Go to Post</a>
      </div>
    </div>
  </div>   
    
  <div class="col-sm-4">
    <div class="card">
      <div class="card-body" style="width: 20rem;" style='background:red'>
        <h5 class="card-title"><u>Neural Style Transfer</u></h5> <br>
        <img style='width:300px' src="https://19qfnq.bn.files.1drv.com/y4moRovvJtBA5ccvsFv6iZadlO2ZxIOTXTof4On35sxnsKBLgN6bLHpfYsdl9CtqT4H8pcM_UjRNLqqtWKxf3FMhvAzdOg-I4-6UbA35rI6UMoHkCLxkpvc6iy-mqm_AoOdsWsbDy1ka8ek8a7kzHWjd-boIug4FyjCnAOtv4ipgLjP4XCXRgmv0qK1rKvB_I774lTJDqeQl_0v6jgZM3m-dA?width=829&height=469&cropmode=none" class="card-img-top" alt="..."><br>
        <p class="card-text">Neural Style Transfer Using VGG19 to Create Artistic Pictures.</p>
        <a href="https://www.kaggle.com/teyang/neural-style-transfer-for-unique-artistic-photos" class="btn btn-primary" style="color:white;">Go to Post</a>
      </div>
    </div>
  </div>
    
  <div class="col-sm-4">
    <div class="card">
      <div class="card-body" style="width: 20rem;">
         <h5 class="card-title"><u>Covid-19 & Google Trends</u></h5>
         <img style='height:135px' src="https://miro.medium.com/max/821/1*Fi6masemXJT3Q8YWekQCDQ.png" class="card-img-top" alt="..."><br>
         <p class="card-text">Covid-19-Google Trend Analysis and Data Vizualization</p>
         <a href="https://www.kaggle.com/teyang/covid-19-google-trends-auto-arima-forecasting" class="btn btn-primary" style="color:white;">Go to Post</a>
      </div>
    </div>
  </div>  