# Pipeline for dual input CNN
#### This is a simple notebook which is made modular so that one can pick up pieces of code and use to their own requirement. 

#### The code is made such that you can try out a bunch of different things while changing only a few lines of code!

#### You can try different ensembles from here to boost your scores.

You can control which size images are loaded, which efficientNets are used, and whether external data is used. You can experiment with different data augmentation, model architecture, loss, optimizers, and learning schedules. The TFRecords contain meta data, so you can input that into your CNN too.

# Kaggle's SIIM-ISIC Melanoma Classification
In this competition, we need to identify melanoma in images of skin lesions. Full description [here][1]. This is a very challenging image classification task as seen by looking at the sample images below. Can you recognize the differences between images? Below are example of skin images with and without melanoma.

[1]: https://www.kaggle.com/c/siim-isic-melanoma-classification




Special thanks to @cdeotte for the motivation behind this notebook. I have implemented many of his ideas ! 

In [None]:
import cv2, pandas as pd, matplotlib.pyplot as plt
train = pd.read_csv('../input/siim-isic-melanoma-classification/train.csv')
print('Examples WITH Melanoma')
imgs = train.loc[train.target==1].sample(10).image_name.values
plt.figure(figsize=(10,4))
for i,k in enumerate(imgs):
    img = cv2.imread('../input/jpeg-melanoma-128x128/train/%s.jpg'%k)
    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
    plt.subplot(2,5,i+1); plt.axis('off')
    plt.imshow(img)
plt.show()
print('Examples WITHOUT Melanoma')
imgs = train.loc[train.target==0].sample(10).image_name.values
plt.figure(figsize=(10,4))
for i,k in enumerate(imgs):
    img = cv2.imread('../input/jpeg-melanoma-128x128/train/%s.jpg'%k)
    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
    plt.subplot(2,5,i+1); plt.axis('off')
    plt.imshow(img)
plt.show()

# The purpose of this notebook

*   Here , we write code that can train a model on both **image + metadata**. This helps the model get more information about the image.

*   Such ***dual-input models*** try to give an "Idea" to the model about what is it looking at.(Ex. a scan of melanoma on the palm will be completely different from that of the head of a person)

 
*   We also use ***test-time-augmentation*** to improve model performance.
*   Coarse Dropout has also been added to prevent any overfitting.
*   Notebook is clearly divided into segments for easy understanding.
* 
*  If you are a beginner, some things may not come to you on the first read. Understand it one at a time and keep googling if you cant understand something.

# 1. Setup environment

In [None]:
colab=0
show_files=0
tstamp=0



if colab:
    from google.colab import drive
    drive.mount('/content/gdrive')

if (not colab)&show_files:
    import os
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output

### Loading libraries

!pip install -q efficientnet

import math
import pytz
import random
import numpy as np
import pandas as pd
import math, re, os, gc
import tensorflow as tf
from pathlib import Path
from datetime import datetime
from scipy.stats import rankdata
import efficientnet.tfkeras as efn
from matplotlib import pyplot as plt
from sklearn.utils import class_weight
from sklearn.metrics import roc_auc_score

print("Tensorflow version " + tf.__version__)
AUTO = tf.data.experimental.AUTOTUNE

if not colab:
    from kaggle_datasets import KaggleDatasets

### Loading data

try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

# 2. Input

## C O N F I G U R A T I O N

In [None]:
# Configuration
NAME='EffNB6_512'
NFOLDS=5
NBEST=1 # the number of best models to use for predictions, can set as 1 for simplicity
SEED=311 # random seed
ef = 3   # Version of efficientNetB? to use


#For Coarse dropout 
DROPOUT = True # Whether to use coarse dropout technique or not
droprate=0.5 # Between 0 and 1
dropct= 4 # May slow training if CT>16
dropsize=0.2 # between 0 and 1



TTA = 3 # Test Time Augmentation Steps


EPOCHS = 10
BATCH_SIZE = 8 * strategy.num_replicas_in_sync
IMAGE_SIZE = [384,384]

ANATOM = 1 # Whether or not to use anatom_site_general_challenge feature in metadata


dim = IMAGE_SIZE[0] #image dimensions
DIM = dim
# IMAGE_SIZE = [256 , 256]

# For tf.dataset
AUTO = tf.data.experimental.AUTOTUNE


# GCS_PATH = [None]*FOLDS; GCS_PATH2 = [None]*FOLDS
# for i,k in enumerate(IMG_SIZES):
k = IMAGE_SIZE[0]

#Competition data
GCS_PATH = KaggleDatasets().get_gcs_path('melanoma-%ix%i'%(k,k))  

#External Data from 2019 competition 
GCS_PATH2 = KaggleDatasets().get_gcs_path('isic2019-%ix%i'%(k,k)) 
GCS_PATH3 = KaggleDatasets().get_gcs_path('malignant-v2-%ix%i'%(k,k))
 

USE_EXT_DATA = True # use ext data or not in training

files_train = np.sort(np.array(tf.io.gfile.glob(GCS_PATH + '/train*.tfrec')))
files_test  = np.sort(np.array(tf.io.gfile.glob(GCS_PATH + '/test*.tfrec')))

## About Coarse dropout technique
* Coarse dropout is a data augmentation technique to prevent your model from overfitting. We randomly remove squares from training images. (Discussion [here][1]).
![dropout](http://playagricola.com/Kaggle/drop-7-24.jpg)


In [None]:
# files_train

In [None]:
def append_path(pre):
    return np.vectorize(lambda file: os.path.join(GCS_DS_PATH, pre, file))

sub = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/sample_submission.csv')

train = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/train.csv')

import seaborn as sns
sns.countplot(train['target'])

* ### CREATING TRAIN AND VALID

In [None]:
%%time
ALL_TRAIN=tf.io.gfile.glob(GCS_PATH + '/train*.tfrec')
ALL_TRAIN1 = tf.io.gfile.glob(GCS_PATH2 + '/train*.tfrec')
ALL_TRAIN2 = tf.io.gfile.glob(GCS_PATH3 + '/train*.tfrec')
ALL_TRAIN

In [None]:
## generating Validation files
tot_train = len(ALL_TRAIN)
skip = tot_train//NFOLDS
VAL_FNAMES={}
x = 0


for n in range(1, NFOLDS+1):
    VAL_FNAMES[f"fold_{n}"] = ALL_TRAIN[x : x+skip]
    x += skip

if USE_EXT_DATA :    
    ALL_TRAIN += ALL_TRAIN1
    ALL_TRAIN += ALL_TRAIN2

    
### Train files    
TRAIN_FNAMES={f'fold_{i}': list(set(ALL_TRAIN)-set(VAL_FNAMES[f'fold_{i}']))
              for i in range(1, NFOLDS+1)}


### Test files
TEST_FILENAMES = tf.io.gfile.glob(GCS_PATH + '/test*.tfrec')

In [None]:
TRAIN_FNAMES

In [None]:

print(f"total folds : {len(TRAIN_FNAMES)} Total per fold : {len(TRAIN_FNAMES['fold_1'])}")

To enable CV

In [None]:
def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    tf.random.set_seed(seed)

In [None]:

if colab:
    PATH=Path('/content/gdrive/My Drive/kaggle/input/siim-isic-melanoma-classification/') 
    train=pd.read_csv(PATH/'train.csv.zip')
else:
    PATH=Path('/kaggle/input/siim-isic-melanoma-classification/')
    train=pd.read_csv(PATH/'train.csv')

test=pd.read_csv(PATH/'test.csv')
sub=pd.read_csv(PATH/'sample_submission.csv')

seed_everything(SEED)

# 3. TPU-friendly functions


In [None]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment 
    # variable is set. On Kaggle this is always the case.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy() 

print("REPLICAS: ", strategy.num_replicas_in_sync)

In [None]:
def count_data_items(filenames):
    # the number of data items is written in the name of the .tfrec files, 
    # i.e. test10-687.tfrec = 687 data items
    n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
    
    return np.sum(n)


def decode_image(image_data):
    image = tf.image.decode_jpeg(image_data, channels=3)
    # convert image to floats in [0, 1] range
    image = tf.cast(image, tf.float32) / 255.0 
    # explicit size needed for TPU
    image = tf.reshape(image, [*IMAGE_SIZE, 3])
    
    return image

def read_labeled_tfrecord(example) :
    LABELED_TFREC_FORMAT = {
      'image':  tf.io.FixedLenFeature([], tf.string),
      'image_name':tf.io.FixedLenFeature([], tf.string),
      'patient_id': tf.io.FixedLenFeature([], tf.int64),
      'sex': tf.io.FixedLenFeature([], tf.int64),
      'age_approx': tf.io.FixedLenFeature([], tf.int64),
      'anatom_site_general_challenge': tf.io.FixedLenFeature ([], tf.int64),
#       'diagnosis': tf.io.FixedLenFeature([], tf.int64),
      'target': tf.io.FixedLenFeature([], tf.int64)
#        'target': tf.io.FixedLenFeature([], tf.int64) 
    }
    example = tf.io.parse_single_example(example, LABELED_TFREC_FORMAT)
    image = decode_image(example['image'])
    label = tf.cast(example['target'], tf.int32)
    data = {}
    dumpp = {}
    data['age_approx'] = tf.cast(example['age_approx'], tf.int32)
    data['sex'] = tf.cast(example['sex'], tf.int32)
    data['image_name'] = tf.cast(example['image_name'], tf.string)
    if ANATOM :
        dumpp['anatom_site_general_challenge'] = tf.cast(tf.one_hot(example['anatom_site_general_challenge'], 7), tf.int32)
        for i in range(7) : 
            data[f'anatom_site_general_challenge{i}'] = dumpp['anatom_site_general_challenge'][i]

#         data['anatom_site_general_challenge'] = tf.cast(example['anatom_site_general_challenge'], tf.int64)
#     label_s = tf.cast(example['target'], tf.string)

    return image, label ,data , label # returns a dataset of (image, label) pairs


def read_unlabeled_tfrecord(example):
    UNLABELED_TFREC_FORMAT = {
      'image':  tf.io.FixedLenFeature([], tf.string),
      'image_name':tf.io.FixedLenFeature([], tf.string),
      'patient_id': tf.io.FixedLenFeature([], tf.int64),
      'sex': tf.io.FixedLenFeature([], tf.int64),
      'age_approx': tf.io.FixedLenFeature([], tf.int64),
      'anatom_site_general_challenge': tf.io.FixedLenFeature ([], tf.int64),
#       'diagnosis': tf.io.FixedLenFeature([], tf.int64),
#       'target': tf.io.FixedLenFeature([], tf.int64)
  }
    example = tf.io.parse_single_example(example, UNLABELED_TFREC_FORMAT)
    image = decode_image(example['image'])
#     label = tf.cast(example['target'], tf.int32)
    data = {}
    dumpp = {}
    data['age_approx'] = tf.cast(example['age_approx'], tf.int32)
    data['sex'] = tf.cast(example['sex'], tf.int32)
    data['image_name'] = tf.cast(example['image_name'], tf.string)
    if ANATOM :
        dumpp['anatom_site_general_challenge'] = tf.cast(tf.one_hot(example['anatom_site_general_challenge'], 7), tf.int32)
        for i in range(7) : 
            data[f'anatom_site_general_challenge{i}'] = dumpp['anatom_site_general_challenge'][i]
#         data['0'] , data['1'] , data['2'] , data['3'] , data['4'] , data['5'] , data['6'] ,data['6'] , data[''] = tf.reshape(data, [-1, 9])  
#         data['anatom_site_general_challenge'] = tf.cast(example['anatom_site_general_challenge'], tf.int32)

#     data['anatom_site_general_challenge'] = tf.cast(tf.one_hot(example['anatom_site_general_challenge'], 7),
#                                                     tf.int32)
#     data['diagnosis'] = tf.cast(example['sex'], tf.int32)
    
    return image , data # returns a dataset of (image, label) pairs




def load_dataset(filenames, labeled=True, ordered=False):
    # Read from TFRecords. For optimal performance, reading from multiple files 
    # at once and disregarding data order. Order does not matter since we will 
    # be shuffling the data anyway.

    ignore_order = tf.data.Options()
    if not ordered:
        # disable order, increase speed
        ignore_order.experimental_deterministic = False

    # automatically interleaves reads from multiple files
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO)
    # uses data as soon as it streams in, rather than in its original order
    dataset = dataset.with_options(ignore_order)
    # returns a dataset of (image, label) pairs if labeled=True 
    # or (image, id) pairs if labeled=False
    dataset = dataset.map(read_labeled_tfrecord if labeled 
                          else read_unlabeled_tfrecord, num_parallel_calls=AUTO)
    
    return dataset

# def get_test_dataset(ordered=False):
#     dataset = load_dataset(TEST_FNAMES, labeled=False, ordered=ordered)
#     dataset = dataset.batch(BATCH_SIZE)
#     # prefetch next batch while training (autotune prefetch buffer size)
#     dataset = dataset.prefetch(AUTO)
    
#     return dataset

# ===============================================

And here are some quick examples of the test data:

In [None]:
# %%time
# TEST_FNAMES = TEST_FILENAMES
# test_dataset = get_test_dataset()

# print("Examples of the test data:")
# for image, data in test_dataset.take(1):
#     print("The image batch size:", image.numpy().shape)
#     print("Ages, 5 examples:", data['age_approx'].numpy()[:5])
# #     print("Age (scaled), 5 examples:", data['age_scaled'].numpy()[:5])
# #     print("Height, 5 examples:", data['height'].numpy()[:5])
# #     print("Width, 5 examples:", data['width'].numpy()[:5])

### Visualization utilities

In [None]:
# # numpy and matplotlib defaults
# np.set_printoptions(threshold=15, linewidth=80)

# def batch_to_numpy_images_and_labels(databatch):
#     if len(databatch)==4:
#         images, labels, _, _ = databatch
#         numpy_images = images.numpy()
#         numpy_labels = labels.numpy()
#     else:
#         images, _ = databatch
#         numpy_images = images.numpy()
#         numpy_labels = [None for _ in enumerate(numpy_images)]

#     # If no labels, only image IDs, return None for labels (this is the case for test data)
#     return numpy_images, numpy_labels

# def title_from_label_and_target(label, correct_label):
#     if correct_label is None:
#         return CLASSES[label], True
#     correct = (label == correct_label)
#     return "{} [{}{}{}]".format(CLASSES[label], 'OK' if correct else 'NO', u"\u2192" 
#                                 if not correct else '', 
#                                 CLASSES[correct_label] if not correct else ''), correct

# def display_one_image(image, title, subplot, red=False, titlesize=16):
#     plt.subplot(*subplot)
#     plt.axis('off')
#     plt.imshow(image)
# #     try : 
# #         if len(title) > 0:
# #             plt.title(title, fontsize=int(titlesize) if not red else int(titlesize/1.2), 
# #                       color='red' if red else 'black', fontdict={'verticalalignment':'center'}, 
# #                       pad=int(titlesize/1.5)
# #                      )
# #     except :
# #         plt.title(title, fontsize=int(titlesize) if not red else int(titlesize/1.2), 
# #                   color='red' if red else 'black', fontdict={'verticalalignment':'center'}, 
# #                   pad=int(titlesize/1.5)
# #                  )
#     return (subplot[0], subplot[1], subplot[2]+1)

# def display_batch_of_images(databatch, predictions=None):
#     """This will work with:
#     display_batch_of_images(images)
#     display_batch_of_images(images, predictions)
#     display_batch_of_images((images, labels))
#     display_batch_of_images((images, labels), predictions)
#     """
#     # data
#     images, labels = batch_to_numpy_images_and_labels(databatch)
#     if labels is None:
#         labels = [None for _ in enumerate(images)]
        
#     # auto-squaring: this will drop data that does  
#     # not fit into square or square-ish rectangle
#     rows = int(math.sqrt(len(images)))
#     cols = len(images)//rows
        
#     # size and spacing
#     FIGSIZE = 4.0
#     SPACING = 0.1
#     subplot=(rows,cols,1)
#     if rows < cols:
#         plt.figure(figsize=(FIGSIZE,FIGSIZE/cols*rows))
#     else:
#         plt.figure(figsize=(FIGSIZE/rows*cols,FIGSIZE))
    
#     # display
#     for i, (image, label) in enumerate(zip(images[:rows*cols], labels[:rows*cols])):
#         title = '' if label is None else CLASSES[label]
#         correct = True
#         if predictions is not None:
#             title, correct = title_from_label_and_target(predictions[i], label)
#         # magic formula tested to work from 1x1 to 10x10 images
#         dynamic_titlesize = FIGSIZE*SPACING/max(rows,cols)*40+3
#         subplot = display_one_image(image, title, subplot, 
#                                      not correct, titlesize=dynamic_titlesize)
    
#     #layout
#     plt.tight_layout()
#     if label is None and predictions is None:
#         plt.subplots_adjust(wspace=0, hspace=0)
#     else:
#         plt.subplots_adjust(wspace=SPACING, hspace=SPACING)
#     plt.show()

In [None]:
def dropout(image, DIM=256, PROBABILITY = 0.75, CT = 8, SZ = 0.2):
    # input image - is one image of size [dim,dim,3] not a batch of [b,dim,dim,3]
    # output - image with CT squares of side size SZ*DIM removed
    
    # DO DROPOUT WITH PROBABILITY DEFINED ABOVE
    P = tf.cast( tf.random.uniform([],0,1)<PROBABILITY, tf.int32)
    if (P==0)|(CT==0)|(SZ==0): return image
    
    for k in range(CT):
        # CHOOSE RANDOM LOCATION
        x = tf.cast( tf.random.uniform([],0,DIM),tf.int32)
        y = tf.cast( tf.random.uniform([],0,DIM),tf.int32)
        # COMPUTE SQUARE 
        WIDTH = tf.cast( SZ*DIM,tf.int32) * P
        ya = tf.math.maximum(0,y-WIDTH//2)
        yb = tf.math.minimum(DIM,y+WIDTH//2)
        xa = tf.math.maximum(0,x-WIDTH//2)
        xb = tf.math.minimum(DIM,x+WIDTH//2)
        # DROPOUT IMAGE
        one = image[ya:yb,0:xa,:]
        two = tf.zeros([yb-ya,xb-xa,3]) 
        three = image[ya:yb,xb:DIM,:]
        middle = tf.concat([one,two,three],axis=1)
        image = tf.concat([image[0:ya,:,:],middle,image[yb:DIM,:,:]],axis=0)
            
    # RESHAPE HACK SO TPU COMPILER KNOWS SHAPE OF OUTPUT TENSOR 
    image = tf.reshape(image,[DIM,DIM,3])
    return image

## Data Augment

In [None]:
ROT_ = 180.0
SHR_ = 2.0
HZOOM_ = 8.0
WZOOM_ = 8.0
HSHIFT_ = 8.0
WSHIFT_ = 8.0
import tensorflow.keras.backend as K
def get_mat(rotation, shear, height_zoom, width_zoom, height_shift, width_shift):
    # returns 3x3 transformmatrix which transforms indicies
        
    # CONVERT DEGREES TO RADIANS
    rotation = math.pi * rotation / 180.
    shear    = math.pi * shear    / 180.

    def get_3x3_mat(lst):
        return tf.reshape(tf.concat([lst],axis=0), [3,3])
    
    # ROTATION MATRIX
    c1   = tf.math.cos(rotation)
    s1   = tf.math.sin(rotation)
    one  = tf.constant([1],dtype='float32')
    zero = tf.constant([0],dtype='float32')
    
    rotation_matrix = get_3x3_mat([c1,   s1,   zero, 
                                   -s1,  c1,   zero, 
                                   zero, zero, one])    
    # SHEAR MATRIX
    c2 = tf.math.cos(shear)
    s2 = tf.math.sin(shear)    
    
    shear_matrix = get_3x3_mat([one,  s2,   zero, 
                                zero, c2,   zero, 
                                zero, zero, one])        
    # ZOOM MATRIX
    zoom_matrix = get_3x3_mat([one/height_zoom, zero,           zero, 
                               zero,            one/width_zoom, zero, 
                               zero,            zero,           one])    
    # SHIFT MATRIX
    shift_matrix = get_3x3_mat([one,  zero, height_shift, 
                                zero, one,  width_shift, 
                                zero, zero, one])
    
    return K.dot(K.dot(rotation_matrix, shear_matrix), 
                 K.dot(zoom_matrix,     shift_matrix))


def transform(image, DIM=256):    
    # input image - is one image of size [dim,dim,3] not a batch of [b,dim,dim,3]
    # output - image randomly rotated, sheared, zoomed, and shifted
    XDIM = DIM%2 #fix for size 331
    
    rot = ROT_ * tf.random.normal([1], dtype='float32')
    shr = SHR_ * tf.random.normal([1], dtype='float32') 
    h_zoom = 1.0 + tf.random.normal([1], dtype='float32') / HZOOM_
    w_zoom = 1.0 + tf.random.normal([1], dtype='float32') / WZOOM_
    h_shift = HSHIFT_ * tf.random.normal([1], dtype='float32') 
    w_shift = WSHIFT_ * tf.random.normal([1], dtype='float32') 

    # GET TRANSFORMATION MATRIX
    m = get_mat(rot,shr,h_zoom,w_zoom,h_shift,w_shift) 

    # LIST DESTINATION PIXEL INDICES
    x   = tf.repeat(tf.range(DIM//2, -DIM//2,-1), DIM)
    y   = tf.tile(tf.range(-DIM//2, DIM//2), [DIM])
    z   = tf.ones([DIM*DIM], dtype='int32')
    idx = tf.stack( [x,y,z] )
    
    # ROTATE DESTINATION PIXELS ONTO ORIGIN PIXELS
    idx2 = K.dot(m, tf.cast(idx, dtype='float32'))
    idx2 = K.cast(idx2, dtype='int32')
    idx2 = K.clip(idx2, -DIM//2+XDIM+1, DIM//2)
    
    # FIND ORIGIN PIXEL VALUES           
    idx3 = tf.stack([DIM//2-idx2[0,], DIM//2-1+idx2[1,]])
    d    = tf.gather_nd(image, tf.transpose(idx3))
        
    return tf.reshape(d,[DIM, DIM,3])



def data_augment(data, label):
    # data augmentation. Thanks to the dataset.prefetch(AUTO) statement 
    # in the next function (below), this happens essentially for free on TPU. 
    # Data pipeline code is executed on the "CPU" part
    # of the TPU while the TPU itself is computing gradients.
    img = data['inp1']
    img = transform(img,DIM=dim)
    if DROPOUT :
        img = dropout(img, DIM=dim, PROBABILITY=droprate, CT=dropct, SZ=dropsize)
    img = tf.image.random_flip_left_right(img)
    #img = tf.image.random_hue(img, 0.01)
    img = tf.image.random_saturation(img, 0.7, 1.3)
    img = tf.image.random_contrast(img, 0.8, 1.2)
    img = tf.image.random_brightness(img, 0.1)
    img = tf.reshape(img, [dim,dim, 3])
    data['inp1'] = img
    
#     data['inp1'] = tf.image.random_flip_left_right(data['inp1'])
#     data['inp1'] = tf.image.random_flip_up_down(data['inp1'])
    #image = tf.image.random_saturation(image, 0, 2)
    
    return data, label

def data_augment_test(data , label = 0):
    # data augmentation. Thanks to the dataset.prefetch(AUTO) statement 
    # in the next function (below), this happens essentially for free on TPU. 
    # Data pipeline code is executed on the "CPU" part
    # of the TPU while the TPU itself is computing gradients.
    img = data['inp1']
    img = transform(img,DIM=dim)
    if DROPOUT :
        img = dropout(img, DIM=dim, PROBABILITY=droprate, CT=dropct, SZ=dropsize)
    img = tf.image.random_flip_left_right(img)
    #img = tf.image.random_hue(img, 0.01)
    img = tf.image.random_saturation(img, 0.7, 1.3)
    img = tf.image.random_contrast(img, 0.8, 1.2)
    img = tf.image.random_brightness(img, 0.1)
    img = tf.reshape(img, [dim,dim, 3])
    data['inp1'] = img
    
#     data['inp1'] = tf.image.random_flip_left_right(data['inp1'])
#     data['inp1'] = tf.image.random_flip_up_down(data['inp1'])
    #image = tf.image.random_saturation(image, 0, 2)
    
    return data , 0

### Dataset visualizations

In [None]:
# # Peek at test data
# test_dataset = test_dataset.unbatch().batch(20)
# test_batch = iter(test_dataset)

In [None]:
# run this cell again for next set of images
# display_batch_of_images(next(test_batch))

In [None]:
# del test_dataset, test_batch
# gc.collect()

# 4. MODEL setup


In [None]:
# Learning rate schedule for TPU, GPU and CPU.
# Using an LR ramp up because fine-tuning a pre-trained model.
# Starting with a high LR would break the pre-trained weights.

LR_START = 0.000005#0.00001
LR_MAX = 0.00000725 * strategy.num_replicas_in_sync
LR_MIN = 0.000005
LR_RAMPUP_EPOCHS = 5
LR_SUSTAIN_EPOCHS = 4
LR_EXP_DECAY = .8

def lrfn(epoch):
    if epoch < LR_RAMPUP_EPOCHS:
        lr = (LR_MAX - LR_START) / LR_RAMPUP_EPOCHS * epoch + LR_START
    elif epoch < LR_RAMPUP_EPOCHS + LR_SUSTAIN_EPOCHS:
        lr = LR_MAX
    else:
        lr = (LR_MAX - LR_MIN) * LR_EXP_DECAY**(epoch - LR_RAMPUP_EPOCHS - LR_SUSTAIN_EPOCHS) + LR_MIN
    return lr
    
lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose = True)

rng = [i for i in range(EPOCHS)]
y = [lrfn(x) for x in rng]
plt.plot(rng, y)
print("Learning rate schedule: {:.3g} to {:.3g} to {:.3g}".format(y[0], max(y), y[-1]))

In [None]:
train['target'].values

class_weights = class_weight.compute_class_weight(class_weight='balanced',
                                                  classes=np.unique(train['target'].values),
                                                  y=train['target'].values,
                                                 )

class_weights = {i : class_weights[i] for i in range(len(class_weights))}

print(class_weights)

In [None]:
if ANATOM :
    tab_feats=[
          'sex',
          'age_approx',
        'anatom_site_general_challenge0',
        'anatom_site_general_challenge1',
        'anatom_site_general_challenge2',
        'anatom_site_general_challenge3',
        'anatom_site_general_challenge4',
        'anatom_site_general_challenge5',
        'anatom_site_general_challenge6'
    ]
else :
    tab_feats=[
          'sex',
          'age_approx',
    ]

N_TAB_FEATS=len(tab_feats) 

print(f"The number of tabular features is {N_TAB_FEATS}.")

In [None]:
%time

EFNS = [efn.EfficientNetB0, efn.EfficientNetB1, efn.EfficientNetB2, efn.EfficientNetB3, 
        efn.EfficientNetB4, efn.EfficientNetB5, efn.EfficientNetB6]

def get_model():
    with strategy.scope():
        pretrained_model = EFNS[ef](input_shape=(*IMAGE_SIZE, 3),
                                              weights='imagenet',
                                              include_top=False
                                             )
        # False = transfer learning, True = fine-tuning
        pretrained_model.trainable = True#False 

        inp1 = tf.keras.layers.Input(shape=(*IMAGE_SIZE, 3), name='inp1')
        inp2 = tf.keras.layers.Input(shape=(N_TAB_FEATS), name='inp2')
        
        # BUILD MODEL HERE
        
        x=pretrained_model(inp1)
        x=tf.keras.layers.GlobalAveragePooling2D()(x)
        x=tf.keras.layers.Dense(512, 
                                kernel_regularizer=tf.keras.regularizers.l2(l=0.01),
                                activation='relu')(x)
        x=tf.keras.layers.Dropout(0.2)(x)
        x=tf.keras.layers.Dense(256, 
                                kernel_regularizer=tf.keras.regularizers.l2(l=0.01),
                                activation='relu')(x)
        x=tf.keras.layers.Dropout(0.2)(x)
        x=tf.keras.layers.Dense(128, 
                                kernel_regularizer=tf.keras.regularizers.l2(l=0.01),
                                activation='relu')(x)
        x=tf.keras.layers.Dropout(0.2)(x)
        x=tf.keras.layers.Dense(64, kernel_regularizer=tf.keras.regularizers.l2(l=0.01),
                                activation='relu')(x)
        x=tf.keras.layers.Dropout(0.2)(x)
        
        y=tf.keras.layers.Dense(100, 
                                kernel_regularizer=tf.keras.regularizers.l2(l=0.01),
                                activation='relu')(inp2)
        
        concat=tf.keras.layers.concatenate([y, x])
        
        output = tf.keras.layers.Dense(1, activation='sigmoid', name='output')(concat)
        
        model = tf.keras.models.Model(inputs=[inp1,inp2], outputs=[output])
    
        model.compile(
        optimizer='adam',
        loss = 'binary_crossentropy',
        metrics=[tf.keras.metrics.AUC()],
        )
        
        return model

In [None]:
# %%time

# model=get_model()

# model.summary()

# del model
# gc.collect()

In [None]:
if colab:
    
    SAVE_FOLDER=NAME
    
    if tstamp:
        time_zone = pytz.timezone('America/Chicago')
        current_datetime = datetime.now(time_zone)
        ts=current_datetime.strftime("%m%d%H%M%S")
        SAVE_FOLDER+='_'+ts
        
    SAVE_FOLDER=PATH/SAVE_FOLDER
    if not os.path.exists(SAVE_FOLDER):
        os.mkdir(SAVE_FOLDER)

else:
    SAVE_FOLDER=Path('/kaggle/working')

In [None]:
class save_best_n(tf.keras.callbacks.Callback):
    def __init__(self, fn, model):
        self.fn = fn
        self.model = model

    def on_epoch_end(self, epoch, logs=None):
        
        if (epoch>0):
            score=logs.get("val_auc")
        else:
            score=-1
      
        if (score > best_score[fold_num].min()):
          
            idx_min=np.argmin(best_score[fold_num])

            best_score[fold_num][idx_min]=score
            best_epoch[fold_num][idx_min]=epoch+1

            path_best_model=f'best_model_fold_{self.fn}_{idx_min}.hdf5'
            self.model.save(SAVE_FOLDER/path_best_model)
            ############# WARNING: ##################################
            # Make sure you have enough space to store your models. 
            # Remember that Kaggle allows you save not more than 5 Gb
            # to disk. It should not be a problem for EfficientNet B0 
            # or B3 but it is not going to work for B7. I am saving my
            # models to Google Drive where I have plenty of space.

# 5. Model Train


In [None]:
def setup_input(image, label, data, label_name):
    
    tab_data=[tf.cast(data[tfeat], dtype=tf.float32) for tfeat in tab_feats]
    
    tabular=tf.stack(tab_data)
    
    return {'inp1': image, 'inp2':  tabular}, label

## Data Augment

In [None]:
def get_training_dataset(dataset):
    dataset = dataset.map(data_augment, num_parallel_calls=AUTO)
    # the training dataset must repeat for several epochs
    dataset = dataset.repeat()
    dataset = dataset.shuffle(2048)
    #dataset = dataset.repeat()
    dataset = dataset.batch(BATCH_SIZE)
    # prefetch next batch while training (autotune prefetch buffer size)
    dataset = dataset.prefetch(AUTO)
    
    return dataset

In [None]:
def get_validation_dataset(dataset):
#     return get_training_dataset(dataset)
    dataset = dataset.map(data_augment, num_parallel_calls=AUTO)

    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.cache()
    # prefetch next batch while training (autotune prefetch buffer size)
    dataset = dataset.prefetch(AUTO)
    
    return dataset

In [None]:
# model = get_model()

In [None]:
train_dataset = load_dataset(TRAIN_FNAMES['fold_1'])
train_dataset

## ************************
## 5.1 Fitting model
## ************************

In [None]:
debug=0
# EPOCHS = 10
histories = []

best_epoch={fn: np.zeros(NBEST) for fn in range(1, NFOLDS+1)}
best_score={fn: np.zeros(NBEST) for fn in range(1, NFOLDS+1)}

for fold_num in range(1, NFOLDS+1):
# for fold_num in range(1, 2):    
    tf.keras.backend.clear_session()
    # clear tpu memory (otherwise can run into Resource Exhausted Error)
    # see https://www.kaggle.com/c/flower-classification-with-tpus/discussion/131045
    tf.tpu.experimental.initialize_tpu_system(tpu)
    
    print("="*50)
    print(f"Starting fold {fold_num} out of {NFOLDS}...")
    
    files_trn=TRAIN_FNAMES[f"fold_{fold_num}"]
    files_val=VAL_FNAMES[f"fold_{fold_num}"]
    
    if debug:
#         TTA = 2
        NUMS = 2
        files_trn=files_trn[0:2]
        files_val=files_val[0:2]
        EPOCHS=2
       
    train_dataset = load_dataset(files_trn)
    train_dataset = train_dataset.map(setup_input, num_parallel_calls=AUTO)
    
    val_dataset = load_dataset(files_val, ordered = True)
    val_dataset = val_dataset.map(setup_input, num_parallel_calls=AUTO)
    
    model = get_model()
    
    STEPS_PER_EPOCH = count_data_items(files_trn) // BATCH_SIZE
#     STEPS_PER_EPOCH *= 1.5 
    
    print(f'STEPS_PER_EPOCH = {STEPS_PER_EPOCH}')

    lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose = True)
    
    history = model.fit(get_training_dataset(train_dataset), 
                        steps_per_epoch=STEPS_PER_EPOCH, 
                        epochs=EPOCHS, 
                        callbacks=[lr_callback,
                                   save_best_n(fold_num, model),
                                   ],
                        validation_data=get_validation_dataset(val_dataset),
                        class_weight=class_weights,
                        verbose=2,
                       )
    
    idx_sorted=np.argsort(best_score[fold_num])
    best_score[fold_num]=np.array(best_score[fold_num])[idx_sorted]
    best_epoch[fold_num]=np.array(best_epoch[fold_num])[idx_sorted]

    print(f"\nFold {fold_num} is finished. The best epochs: {[int(best_epoch[fold_num][i]) for i in range(len(best_epoch[fold_num]))]}")
    print(f"The corresponding scores: {[round(best_score[fold_num][i], 5) for i in range(len(best_epoch[fold_num]))]}")

    histories.append(history)

### 5.2 Visualization of training progress

In [None]:
train.to_csv('train.csv')

In [None]:
history.history

In [None]:
# model.save("2inp.h5")


In [None]:
def display_training_curves(fold_num, data):

    plt.figure(figsize=(10,5), facecolor='#F0F0F0')

    epochs=np.arange(1, EPOCHS+1)

    # AUC
    plt.plot(epochs, data['auc'], label='training auc', color='red')
    plt.plot(epochs, data['val_auc'], label='validation auc', color='orange')

    # Loss
    plt.plot(epochs, data['loss'], label='training loss', color='blue')    
    plt.plot(epochs, data['val_loss'], label='validation loss', color='green')

    # Best
    ls=['dotted', 'dashed', 'dashdot', 'solid'] # don't use more than 4 best epochs 
                                                # or make proper adjustments!
    for i in range(NBEST):
        plt.axvline(best_epoch[fold_num][i], 0, 
                    best_score[fold_num][i], linestyle=ls[i], 
                    color='black', label=f'AUC {best_score[fold_num][i]:.5f}')
    
    plt.title(f"Fold {fold_num}. The best epochs: {[int(best_epoch[fold_num][i]) for i in range(len(best_epoch[fold_num]))]}; the best AUC's: {[round(best_score[fold_num][i], 5) for i in range(len(best_epoch[fold_num]))]}.", 
              fontsize='14')
    plt.ylabel('Loss/AUC', fontsize='12')
    plt.xlabel('Epoch', fontsize='12')
    plt.ylim((0, 1))
    plt.legend(loc='lower left')
    plt.tight_layout()
    plt.show()

# for fn in range(1, NFOLDS+1):
#     display_training_curves(fn, data=histories[fn-1].history)

### Predictions

def setup_test_image(image, data):    
    tab_data=[tf.cast(data[tfeat], dtype=tf.float32) for tfeat in tab_feats]
    tabular=tf.stack(tab_data)

    return {'inp1': image, 'inp2': tabular}

def setup_test_name(image, data):
    return data['image_name']

def get_test_dataset(dataset):
    dataset = dataset.map(data_augment_test, num_parallel_calls=AUTO)

    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.cache()
    # prefetch next batch while training (autotune prefetch buffer size)
    dataset = dataset.prefetch(AUTO)
    
    return dataset
    
def get_test_ds_name(dataset) :
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.prefetch(AUTO)
    
    return dataset

def average_predictions(X, fn):
    
    y_probas=[]
    
    for idx in range(NBEST):
        
        tf.tpu.experimental.initialize_tpu_system(tpu)
        strategy = tf.distribute.experimental.TPUStrategy(tpu)
#         gc.collect()

        print(f"Predicting: fold {fn}, model {idx+1} out of {NBEST}...")

        with strategy.scope():
            path_best_model=f'best_model_fold_{fn}_{idx}.hdf5'
            model=tf.keras.models.load_model(SAVE_FOLDER/path_best_model)

        y=model.predict(X)
        y = rankdata(y)/len(y)
        y_probas.append(y)
    
    y_probas=np.average(y_probas, axis=0)

    return y_probas

# 6. PREDICTIONS

In [None]:
model.history

# Select TTA here

In [None]:
lsti = []
TEST_FNAMES = TEST_FILENAMES
for i in range(TTA):
    lstj = []
    N_TEST_IMGS = count_data_items(TEST_FNAMES)
    # N_TEST_IMGS = count_data_items(TEST_FNAMES)
    preds = pd.DataFrame({'image_name': np.zeros(len(test)), 'target': np.zeros(len(test))})

    test_ds = load_dataset(TEST_FNAMES, labeled=False, ordered=True)
    test_images_ds = test_ds.map(setup_test_image, num_parallel_calls=AUTO)

    test_images_ds = get_test_dataset(test_images_ds)
    test_ds = get_test_ds_name(test_ds)

    test_ids_ds = test_ds.map(setup_test_name, num_parallel_calls=AUTO).unbatch()
    sub = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/sample_submission.csv')

    preds['image_name'] = next(iter(test_ids_ds.batch(N_TEST_IMGS))).numpy().astype('U')
#     lstj = np.average([average_predictions(test_images_ds, fn) for fn in range(1, 3)], axis = 0)

    lstj = np.average([average_predictions(test_images_ds, fn) for fn in range(1, NFOLDS + 1)], axis = 0)
    lsti.append(lstj)
    
    del test_images_ds , test_ds , test_ids_ds
    gc.collect()
    
lsti
# y=model.predict(test_images_ds)


In [None]:
a = np.array(lsti)
a

In [None]:
a.shape

In [None]:
preds['target'] = a.mean(axis=0)

In [None]:
preds

In [None]:
%%time
# preds = []
# for i in range(TTA):
    

del sub['target']
sub = sub.merge(preds, on='image_name')
sub.head()

sub.to_csv('effB1_512_5x15epochs_3Stratif.csv', index=False)

# preds = pd.DataFrame({'image_name': np.zeros(len(test)), 'target': np.zeros(len(test))})

# test_ds = load_dataset(TEST_FNAMES, labeled=False, ordered=True)
# test_images_ds = test_ds.map(setup_test_image, num_parallel_calls=AUTO)

# test_images_ds = get_test_dataset(test_images_ds)
# test_ds = get_test_dataset(test_ds)

# test_ids_ds = test_ds.map(setup_test_name, num_parallel_calls=AUTO).unbatch()
# test_images_ds