# 1. OVERVIEW

The goal of this notebook is to demonstrate a technique that can improve the classification performance by learning from both training and test data. This is done by pre-training a model on the complete data set. This approach can help to reduce the impact of sampling bias by exposing the model to the test data and benefit from a larger sample size while learning. We will demonstarte how this simple technique can get an AUC improvement on CV and private LB.

So, how to make use of the test sample? The labels are only observed on the training data. Luckily, this competition also provides a bunch of meta-data per each training and test image. What we can do is the following:
1. Pre-train a model on the complete train+test data using one of the meta-features as a surrogate label.
2. Initialize from the pre-trained weights when training a final melanoma classification model.

The intuition behind this approach is that by learning to classify images according to one of meta variables such as `sex` or `age_approx`, the model can learn some of the visual features that might be useful for the malignant lesion classification. For instance, size of lesions and color of the skin can be helpful in determining both patient age and lesion type. Exposing the model to the test data also allows it to take a sneak peek at test images, which may help to learn patterns that are more prevalent in the test distribution.

In this notebook, we will train a model to classify `anatom_site_general_challenge` on both training and test data and store the pre-trained weights of the backbone. Next, we will build a melanoma classification model that initializes from the pre-trained weights and check its performance on CV and on the leaderboard.

P.S. The notebook hevaily relies on the [great pipeline](https://www.kaggle.com/cdeotte/triple-stratified-kfold-with-tfrecords) developed by [Chris Deotte](https://www.kaggle.com/cdeotte) and reuses much of his original code. I know that many teams have been using this pipeline to train their models, so relying on it here should be familiar to you. Please allow me to thank Chris and give him credit for his hard work! Kindly refer to his notebook for general questions on the modeling pipeline where he provided extensive comments and documentation.

# 2. INITIALIZATION

In [None]:
!pip install -q efficientnet >> /dev/null

In [None]:
import pandas as pd, numpy as np
from kaggle_datasets import KaggleDatasets
import tensorflow as tf, re, math
import tensorflow.keras.backend as K
import efficientnet.tfkeras as efn
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
from scipy.stats import rankdata
import PIL, cv2

In addition to other training parameters, we introduce `USE_PRETRAIN_WEIGHTS` variable to reflect whether we want to train a pre-trained model on full data before training a final melanoma classification model. 

For demonstartion purposes, we use EfficientNet `B0`, `128x128` image size with no TTA and no external data from previous competitions. You can easily incoroprate the external data by following the Chris' notebook and experiment with larger architectures and images sizes.

In [None]:
# DEVICE
DEVICE = "TPU"

# USE DIFFERENT SEED FOR DIFFERENT STRATIFIED KFOLD
SEED = 42

# NUMBER OF FOLDS. USE 3, 5, OR 15 
FOLDS = 5

# WHICH IMAGE SIZES TO LOAD EACH FOLD
IMG_SIZES = [128]*FOLDS

# BATCH SIZE AND EPOCHS
BATCH_SIZES = [32]*FOLDS
EPOCHS      = [10]*FOLDS

# WHICH EFFICIENTNET TO USE
EFF_NETS = [0]*FOLDS

# WEIGHTS FOR FOLD MODELS WHEN PREDICTING TEST
WGTS = [1/FOLDS]*FOLDS

# PRETRAINED WEIGHTS
USE_PRETRAIN_WEIGHTS = True

In [None]:
# CONNECT TO DEVICE
if DEVICE == "TPU":
    print("connecting to TPU...")
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
        print('Running on TPU ', tpu.master())
    except ValueError:
        print("Could not connect to TPU")
        tpu = None

    if tpu:
        try:
            print("initializing  TPU ...")
            tf.config.experimental_connect_to_cluster(tpu)
            tf.tpu.experimental.initialize_tpu_system(tpu)
            strategy = tf.distribute.experimental.TPUStrategy(tpu)
            print("TPU initialized")
        except _:
            print("failed to initialize TPU")
    else:
        DEVICE = "GPU"

if DEVICE != "TPU":
    print("Using default strategy for CPU and single GPU")
    strategy = tf.distribute.get_strategy()

if DEVICE == "GPU":
    print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

AUTO     = tf.data.experimental.AUTOTUNE
REPLICAS = strategy.num_replicas_in_sync
print(f'REPLICAS: {REPLICAS}')

# 3. IMAGE PROCESSING


In [None]:
# IMAGE PATHS
GCS_PATH = [None]*FOLDS

for i,k in enumerate(IMG_SIZES):
    GCS_PATH[i]  = KaggleDatasets().get_gcs_path('melanoma-%ix%i'%(k,k))
    
files_train = np.sort(np.array(tf.io.gfile.glob(GCS_PATH[0] + '/train*.tfrec')))
files_test  = np.sort(np.array(tf.io.gfile.glob(GCS_PATH[0] + '/test*.tfrec')))

The `read_labeled_tfrecord()` function is modified to provide two outputs: 

1. Image tensor.

2. Either `anatom_site_general_challenge` or `target` as a label. The former is one-hot-encoded since it is a categorical feature with six possible values. The selection of the label is controlled by the `pretraining` argument that is read from the `get_dataset()` function provided below. Setting `pretraining = True` implies reading `anatom_site_general_challenge` as a surrogate label.

In [None]:
def read_labeled_tfrecord(example, pretraining = False):
    if pretraining:
        tfrec_format = {
            'image'                        : tf.io.FixedLenFeature([], tf.string),
            'image_name'                   : tf.io.FixedLenFeature([], tf.string),
            'anatom_site_general_challenge': tf.io.FixedLenFeature([], tf.int64),
        }      
    else:
        tfrec_format = {
            'image'                        : tf.io.FixedLenFeature([], tf.string),
            'image_name'                   : tf.io.FixedLenFeature([], tf.string),
            'target'                       : tf.io.FixedLenFeature([], tf.int64)
        }   
    example = tf.io.parse_single_example(example, tfrec_format)
    return example['image'], tf.one_hot(example['anatom_site_general_challenge'], 6) if pretraining else example['target']


def read_unlabeled_tfrecord(example, return_image_name=True):
    tfrec_format = {
        'image'                        : tf.io.FixedLenFeature([], tf.string),
        'image_name'                   : tf.io.FixedLenFeature([], tf.string),
    }
    example = tf.io.parse_single_example(example, tfrec_format)
    return example['image'], example['image_name'] if return_image_name else 0

 
def prepare_image(img, dim = 256):    
    img = tf.image.decode_jpeg(img, channels = 3)
    img = tf.cast(img, tf.float32) / 255.0
    img = img * circle_mask
    img = tf.reshape(img, [dim,dim, 3])
            
    return img

def count_data_items(filenames):
    n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) 
         for filename in filenames]
    return np.sum(n)

In [None]:
def get_dataset(files, 
                shuffle            = False, 
                repeat             = False, 
                labeled            = True, 
                pretraining        = False,
                return_image_names = True, 
                batch_size         = 16, 
                dim                = 256):
    
    ds = tf.data.TFRecordDataset(files, num_parallel_reads = AUTO)
    ds = ds.cache()
    
    if repeat:
        ds = ds.repeat()
    
    if shuffle: 
        ds = ds.shuffle(1024*2) #if too large causes OOM in GPU CPU
        opt = tf.data.Options()
        opt.experimental_deterministic = False
        ds = ds.with_options(opt)
        
    if labeled: 
        ds = ds = ds.map(lambda example: read_labeled_tfrecord(example, pretraining), 
                         num_parallel_calls=AUTO)
    else:
        ds = ds.map(lambda example: read_unlabeled_tfrecord(example, return_image_names), 
                    num_parallel_calls = AUTO)
    
    ds = ds.map(lambda img, imgname_or_label: (
                prepare_image(img, dim = dim), 
                imgname_or_label), 
                num_parallel_calls = AUTO)
    
    ds = ds.batch(batch_size * REPLICAS)
    ds = ds.prefetch(AUTO)
    return ds

We also use a circular crop (a.k.a. [microscope augmentation](https://www.kaggle.com/c/siim-isic-melanoma-classification/discussion/159476)) to improve image consistency. The snippet below creates a circular mask, which is applied in the `prepare_image()` function.

In [None]:
# CIRCLE CROP PREPARATIONS
circle_img  = np.zeros((IMG_SIZES[0], IMG_SIZES[0]), np.uint8)
circle_img  = cv2.circle(circle_img, (int(IMG_SIZES[0]/2), int(IMG_SIZES[0]/2)), int(IMG_SIZES[0]/2), 1, thickness = -1)
circle_img  = np.repeat(circle_img[:, :, np.newaxis], 3, axis = 2)
circle_mask = tf.cast(circle_img, tf.float32)

In [None]:
# LOAD DATA AND APPLY AUGMENTATIONS
def show_dataset(thumb_size, cols, rows, ds):
    mosaic = PIL.Image.new(mode='RGB', size=(thumb_size*cols + (cols-1), 
                                             thumb_size*rows + (rows-1)))
    for idx, data in enumerate(iter(ds)):
        img, target_or_imgid = data
        ix  = idx % cols
        iy  = idx // cols
        img = np.clip(img.numpy() * 255, 0, 255).astype(np.uint8)
        img = PIL.Image.fromarray(img)
        img = img.resize((thumb_size, thumb_size), resample = PIL.Image.BILINEAR)
        mosaic.paste(img, (ix*thumb_size + ix, 
                           iy*thumb_size + iy))
        nn = target_or_imgid.numpy().decode("utf-8")

    display(mosaic)
    return nn

files_train = tf.io.gfile.glob(GCS_PATH[0] + '/train*.tfrec')
ds = tf.data.TFRecordDataset(files_train, num_parallel_reads = AUTO).shuffle(1024)
ds = ds.take(10).cache()
ds = ds.map(read_unlabeled_tfrecord, num_parallel_calls = AUTO)
ds = ds.map(lambda img, target: (prepare_image(img, dim = IMG_SIZES[0]),
                                 target), num_parallel_calls = AUTO)
ds = ds.take(12*5)
ds = ds.prefetch(AUTO)

# DISPLAY IMAGES
name = show_dataset(128, 5, 2, ds)

# 4. PRE-TRAINED MODEL

The `build_model()` function incorporates three important features that depend on the training regime:
    
1. When building a model for pre-training, we use `CategoricalCrossentropy` as a loss because `anatom_site_general_challenge` is a categorical variable. When building a model that classifies lesions as benign/malgnant, we use `BinaryCrossentropy` as a loss.

2. When training a final binary classification model, we load the saved pre-trained weights by using `base.load_weights('base_weights.h5')` if `use_pretrain_weights == True`.

3. We use a dense layer with six output nodes and softmax activation when doing pre-training and a dense layer with a single output node and sigmoid activation when training a final model.

In [None]:
EFNS = [efn.EfficientNetB0, efn.EfficientNetB1, efn.EfficientNetB2, efn.EfficientNetB3, 
        efn.EfficientNetB4, efn.EfficientNetB5, efn.EfficientNetB6, efn.EfficientNetB7]

def build_model(dim = 256, ef = 0, pretraining = False, use_pretrain_weights = False):
    
    # base
    inp  = tf.keras.layers.Input(shape = (dim,dim,3))
    base = EFNS[ef](input_shape = (dim,dim,3), weights = 'imagenet', include_top = False)
    
    # base weights
    if use_pretrain_weights:
        base.load_weights('base_weights.h5')
    
    x = base(inp)
    x = tf.keras.layers.GlobalAveragePooling2D()(x)
    
    if pretraining:
        x     = tf.keras.layers.Dense(6, activation = 'softmax')(x)
        model = tf.keras.Model(inputs = inp, outputs = x)
        opt   = tf.keras.optimizers.Adam(learning_rate = 0.001)
        loss  = tf.keras.losses.CategoricalCrossentropy()    
        model.compile(optimizer = opt, loss = loss)
    else:
        x     = tf.keras.layers.Dense(1, activation = 'sigmoid')(x)
        model = tf.keras.Model(inputs = inp, outputs = x)
        opt   = tf.keras.optimizers.Adam(learning_rate = 0.001)
        loss  = tf.keras.losses.BinaryCrossentropy(label_smoothing = 0.01)  
        model.compile(optimizer = opt, loss = loss, metrics = ['AUC'])
    
    return model

In [None]:
def get_lr_callback(batch_size=8):
    
    lr_start   = 0.000005
    lr_max     = 0.00000125 * REPLICAS * batch_size
    lr_min     = 0.000001
    lr_ramp_ep = 5
    lr_sus_ep  = 0
    lr_decay   = 0.8
   
    def lrfn(epoch):
        if epoch < lr_ramp_ep:
            lr = (lr_max - lr_start) / lr_ramp_ep * epoch + lr_start
            
        elif epoch < lr_ramp_ep + lr_sus_ep:
            lr = lr_max
            
        else:
            lr = (lr_max - lr_min) * lr_decay**(epoch - lr_ramp_ep - lr_sus_ep) + lr_min
            
        return lr

    lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=False)
    return lr_callback

The pre-trained model is trained on both training and test data. Here, we use the original training data merged with the complete test set as a training sample. We fix the number of training epochs to `EPOCHS` and do not perform early stopping. You can also experiment with setting up a small validation sample from both training and test data to perform early stopping.

Here, we don't produce any predictions by the pre-trained model since we will only utilize it to extract the weights. 

In [None]:
### TRAIN MODEL
if USE_PRETRAIN_WEIGHTS:

    # USE VERBOSE=0 for silent, VERBOSE=1 for interactive, VERBOSE=2 for commit
    VERBOSE = 2

    # DISPLAY INFO
    if DEVICE == 'TPU':
        if tpu: tf.tpu.experimental.initialize_tpu_system(tpu)
    print('#### Image Size %i with EN B%i and batch_size %i'%
          (IMG_SIZES[0],EFF_NETS[0],BATCH_SIZES[0]*REPLICAS))

    # CREATE TRAIN AND VALIDATION SUBSETS
    files_train = tf.io.gfile.glob(GCS_PATH[0] + '/train*.tfrec')
    print('#### Using 2020 train data')
    files_train += tf.io.gfile.glob(GCS_PATH[0] + '/test*.tfrec')
    print('#### Using 2020 test data')
    np.random.shuffle(files_train)

    # BUILD MODEL
    K.clear_session()
    tf.random.set_seed(SEED)
    with strategy.scope():
        model = build_model(dim         = IMG_SIZES[0],
                            ef          = EFF_NETS[0], 
                            pretraining = True)

    # SAVE BEST MODEL EACH FOLD
    sv = tf.keras.callbacks.ModelCheckpoint(
        'weights.h5', monitor='loss', verbose=0, save_best_only=True,
        save_weights_only=True, mode='min', save_freq='epoch')

    # TRAIN
    print('Training...')
    history = model.fit(
        get_dataset(files_train, 
                    dim         = IMG_SIZES[0], 
                    batch_size  = BATCH_SIZES[0],
                    shuffle     = True, 
                    repeat      = True, 
                    pretraining = True), 
        epochs          = EPOCHS[0], 
        callbacks       = [sv, get_lr_callback(BATCH_SIZES[0])], 
        steps_per_epoch = count_data_items(files_train)/BATCH_SIZES[0]//REPLICAS,
        verbose = VERBOSE)
    
else:
    
    print('#### NOT using a pre-trained model')

The pre-training is complete! Now, we need to resave weights of our pre-trained model to make it easier to load them in the future! We are not really interested in the classification head, so we only export the weights of the convolutional part of the network. We can index these layers using `model.layers[1]`.

In [None]:
# LOAD WEIGHTS AND CHECK MODEL
if USE_PRETRAIN_WEIGHTS:
    model.load_weights('weights.h5')
    model.summary()

In [None]:
# EXPORT BASE WEIGHTS
if USE_PRETRAIN_WEIGHTS:
    model.layers[1].save_weights('base_weights.h5')

# 5. FINAL MODEL

Now we can train a final classification model using a regular cross-validation framework on the training data! 

We need to take care of a couple of changes:
1. Make sure that we don't use test data in the training folds anymore
2. Run the model on all fold combinations.
3. Set `use_pretrain_weights = True` and `pretraining = False` in the `build_model()` function to initialize from the pre-trained weights in the beginning of each fold.

In [None]:
# USE VERBOSE=0 for silent, VERBOSE=1 for interactive, VERBOSE=2 for commit
VERBOSE = 0

skf = KFold(n_splits = FOLDS, shuffle = True, random_state = SEED)
oof_pred = []; oof_tar = []; oof_val = []; oof_names = []; oof_folds = []
preds = np.zeros((count_data_items(files_test),1))

for fold,(idxT,idxV) in enumerate(skf.split(np.arange(15))):
    
    # DISPLAY FOLD INFO
    if DEVICE == 'TPU':
        if tpu: tf.tpu.experimental.initialize_tpu_system(tpu)
    print('#'*25); print('#### FOLD',fold+1)
    print('#### Image Size %i with EfficientNet B%i and batch_size %i'%
          (IMG_SIZES[fold],EFF_NETS[fold],BATCH_SIZES[fold]*REPLICAS))
    
    # CREATE TRAIN AND VALIDATION SUBSETS
    files_train = tf.io.gfile.glob([GCS_PATH[fold] + '/train%.2i*.tfrec'%x for x in idxT])      
    print('#### Using 2020 train data')
    np.random.shuffle(files_train); print('#'*25)
    
    files_valid = tf.io.gfile.glob([GCS_PATH[fold] + '/train%.2i*.tfrec'%x for x in idxV])
    files_test = np.sort(np.array(tf.io.gfile.glob(GCS_PATH[fold] + '/test*.tfrec')))
    
    # BUILD MODEL
    K.clear_session()
    tf.random.set_seed(SEED)
    with strategy.scope():
        model = build_model(dim                  = IMG_SIZES[fold],
                            ef                   = EFF_NETS[fold],
                            use_pretrain_weights = USE_PRETRAIN_WEIGHTS, 
                            pretraining          = False)
        
    # SAVE BEST MODEL EACH FOLD
    sv = tf.keras.callbacks.ModelCheckpoint(
        'fold-%i.h5'%fold, monitor='val_auc', verbose=0, save_best_only=True,
        save_weights_only=True, mode='max', save_freq='epoch')
   
    # TRAIN
    print('Training...')
    history = model.fit(
        get_dataset(files_train, 
                    shuffle    = True, 
                    repeat     = True, 
                    dim        = IMG_SIZES[fold], 
                    batch_size = BATCH_SIZES[fold]), 
        epochs = EPOCHS[fold], 
        callbacks = [sv,get_lr_callback(BATCH_SIZES[fold])], 
        steps_per_epoch = count_data_items(files_train)/BATCH_SIZES[fold]//REPLICAS,
        validation_data = get_dataset(files_valid,
                                      shuffle = False,
                                      repeat  = False, 
                                      dim     = IMG_SIZES[fold]),
        verbose = VERBOSE
    )
    print('Loading best model...')
    model.load_weights('fold-%i.h5'%fold)
    
    # PREDICT OOF
    print('Predicting OOF...')
    ds_valid = get_dataset(files_valid,labeled=False,return_image_names=False,shuffle=False,dim=IMG_SIZES[fold],batch_size=BATCH_SIZES[fold]*4)
    ct_valid = count_data_items(files_valid); STEPS = ct_valid/BATCH_SIZES[fold]/4/REPLICAS
    pred     = model.predict(ds_valid,steps=STEPS,verbose=VERBOSE)[:ct_valid,] 
    oof_pred.append(pred)      

    # GET OOF TARGETS AND NAMES
    ds_valid = get_dataset(files_valid,dim=IMG_SIZES[fold],labeled=True, return_image_names=True)
    oof_tar.append(np.array([target.numpy() for img, target in iter(ds_valid.unbatch())]) )
    oof_folds.append(np.ones_like(oof_tar[-1],dtype='int8')*fold )
    ds = get_dataset(files_valid,dim=IMG_SIZES[fold],labeled=False,return_image_names=True)
    oof_names.append(np.array([img_name.numpy().decode("utf-8") for img, img_name in iter(ds.unbatch())]))
    
    # PREDICT TEST
    print('Predicting Test...')
    ds_test     = get_dataset(files_test,labeled=False,return_image_names=False,shuffle=False,dim=IMG_SIZES[fold],batch_size=BATCH_SIZES[fold]*4)
    ct_test     = count_data_items(files_test); STEPS = ct_test/BATCH_SIZES[fold]/4/REPLICAS
    pred        = model.predict(ds_test,steps=STEPS,verbose=VERBOSE)[:ct_test,]
    preds[:,0] += (pred * WGTS[fold]).reshape(-1)

    # REPORT RESULTS
    auc = roc_auc_score(oof_tar[-1],oof_pred[-1])
    print('#### FOLD %i OOF AUC = %.4f'%(fold+1,auc))

How does the OOF AUC compare to a model without the pre-training stage? To check this, we can simply set `USE_PRETRAIN_WEIGHTS = False` in the begining of the notebeook. This is done [in the previous version](https://www.kaggle.com/kozodoi/pre-training-on-full-data-with-surrogate-labels?scriptVersionId=41201266) of this notebook and yields a model with a lower OOF AUC.

Compared to a model initialized from the Imagenet weights, pre-training on a surrogate label brings a CV improvement, which also translates into an AUC gain on public and private LB. Great news!

In [None]:
# COMPUTE OVERALL OOF AUC
oof      = np.concatenate(oof_pred);  true  = np.concatenate(oof_tar);
names    = np.concatenate(oof_names); folds = np.concatenate(oof_folds)
auc      = roc_auc_score(true,oof)
print('Overall OOF AUC = %.4f'%auc)

# SAVE OOF TO DISK
df_oof = pd.DataFrame(dict(image_name = names, target = true, pred = oof.reshape(-1), fold = folds))
df_oof.to_csv('oof.csv', index = False)
df_oof.head()

In [None]:
# CREATE SUBMISSION
ds = get_dataset(files_test, 
                 dim                = IMG_SIZES[fold],
                 labeled            = False, 
                 return_image_names = True)

image_names = np.array([img_name.numpy().decode("utf-8") for img, img_name in iter(ds.unbatch())])

submission = pd.DataFrame(dict(image_name = image_names, target = preds[:,0]))
submission = submission.sort_values('image_name') 
submission.to_csv('submission.csv', index = False)
submission.head()

# 6. CONCLUSIONS

This is the end of this notebook. We demonstrated how to use meta-data to construct a surrogate label and pre-train a CNN model on both training and test data. This technique improved the resulting performance on both CV and LB. 

The pre-trained model can be further optimized to increase performance gains. Using a validation subset on the pre-training stage can help to tune the number of epochs and other learning parameters. Another idea could be to construct a surrogate label with more unique values (e.g., combination of `anatom_site_general_challenge` and `sex`) to make the pre-training task more challenging and motivate the model to learn better. On the other hand, further optimizing the final classification model may reduce the benefit of pre-training. I will leave these options to those who are interested to experiment :)

Please don't hesitate to ask questions in the comments section if something is not clear. Happy Kaggling!