# Stacking model predictions and incorporating image data

One method for model stacking that I have not seen explored much on kaggle is the use of model predictions with some form of image encoding. Below is an attempt to build a model that combines the OOF preds of my best models with image embeddings that are extracted from [this](https://www.kaggle.com/cdeotte/rapids-cuml-knn-find-duplicates) notebook. 

I am curious as to why people do not pursue this method? And if anyone has any ideas as to how it can be improved?

I have also experimented with this form of stacking using efficient to learn encodings and concatenating my predictions as though they were meta data, but my TPU hours were too valuable at the close of this competition and so I stopped pursuing that avenue. If there is any interest in that I can post the kernel. I have included at the end of this notebook the way that predictions can be read into training when using tfrecords.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

In [None]:
from sklearn import metrics
from sklearn.model_selection import train_test_split

import tensorflow as tf
import tensorflow.keras.layers as L

from kaggle_datasets import KaggleDatasets

from xgboost import XGBClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold

import random
from functools import reduce

import scipy


## Load the data

In [None]:
def get_df(file,name):
    try:
        df = pd.read_csv(file.format(name))
    except FileNotFoundError:
        # This is garunteed to exist if you haven't entered the filepath incorrectly
        df = pd.read_csv(file.format(''))
    return df

def read_oof(data_ext,name=''):
    return [get_df('../input/'+file+'/{}oof.csv',name).set_index('image_name') for file in data_ext]

def read_pred(attach,mean,name=''):
    df = get_df('../input/'+attach + '/{}preds_all.csv',name)
    img_col = np.where(df.dtypes=='object')[0][0]
    df.columns.values[img_col]='0'
    df = pd.DataFrame(dict(image_name=df['0'], target=mean(df.drop(['0'],axis=1),axis=1)))
    return df.set_index('image_name')

def read_preds(data_ext,mean,name=''):
    return [read_pred(file,mean,name=name) for file in data_ext]

In [None]:
# List all of the data that you wish to include

data_ext = ['meta-128-assemble',
            'meta-192',
            'bs-384-combine',
            'sls-assemble',
           'meta-512-assemble',
           'meta-768-ls-0-05-assemble',
            'meta-1026-ls-0-05-assemble',
           'meta-data-xgb']

# Read data into lists
# Note that preds_all.csv contains all of the predictions from each of the models that I trained, this allows me to 
# use different methods for averaging them.
mn_func = scipy.stats.gmean
sub_list = read_preds(data_ext,mn_func)
oof_list = read_oof(data_ext,'oof')

In [None]:
def collect(df_list):
    return pd.concat(df_list,axis=1)

# Collect all of the data
sub_df = collect(sub_list)
oof_df = collect(oof_list)

# Separate data into useful parts
oof_df = oof_df.dropna()
ids = oof_df.index
oof = np.array( oof_df['pred'] )
labels = np.array(oof_df['target'])[:,0]
print('There are {} values overlapping'.format(len(oof_df)))

idsO = sub_df.index
data_sub = np.array(sub_df)

In [None]:
# Load the embedded data
DIM = 256; EFFN = 0; BATCH_SIZE = 128
PATH_TO_EMBEDDINGS = '../input/embeddingsmelanoma/'
embed = np.load(PATH_TO_EMBEDDINGS+'embed_train_%i_%i.npy'%(DIM,EFFN))
names = np.load(PATH_TO_EMBEDDINGS+'names_train.npy')
embed_test = np.load(PATH_TO_EMBEDDINGS+'embed_test_%i_%i.npy'%(DIM,EFFN))
names_test = np.load(PATH_TO_EMBEDDINGS+'names_test.npy')

In [None]:
# LOAD TRAIN AND TEST CSV
test = pd.read_csv( '../input/siim-isic-melanoma-classification/test.csv' ).set_index('image_name',drop=True)
test = test.loc[names_test]
test = pd.concat((test,sub_df),axis=1).reset_index()
print('Test csv shape',test.shape)

train = pd.read_csv( '../input/melanoma-%ix%i/train.csv'%(DIM,DIM) ).set_index('image_name',drop=True)
train = train.loc[names]
train = pd.concat((train,oof_df['pred']),axis=1).reset_index()
train.target = train.target.astype('float32')
print('Train csv shape',train.shape)

print('Displaying train.csv below...')
train.head()

## Concatenate and preprocess the data

In [None]:
cat_enc = OneHotEncoder(drop='first')
# num_enc = MinMaxScaler()
num_enc = StandardScaler()
embed_enc = MinMaxScaler()
numeric_features = ['age_approx']
cat_features = ['sex','anatom_site_general_challenge']

cats = cat_enc.fit_transform(train[cat_features].fillna('0')).toarray()
nums = num_enc.fit_transform(train[numeric_features].fillna(0))
embed_ENC = embed_enc.fit_transform(embed)
Xtrain = np.concatenate((cats,nums,embed_ENC),axis=1)
Xpreds = np.array(train['pred'])

cats = cat_enc.transform(test[cat_features].fillna('0')).toarray()
nums = num_enc.transform(test[numeric_features].fillna(0))
embed_encT = embed_enc.transform(embed_test)
Xtest = np.concatenate((cats,nums,embed_encT),axis=1)
Xtargets = np.array(test['target'])

In [None]:
# Take a look at the embeddings
print(np.shape(embed_ENC))
for i in range(100):
    plt.hist(embed_ENC[i],alpha=0.5,bins=100)

In [None]:
# Look at how they are distributed so some estimate for augmentation can be performed.
std_all = np.std(embed_ENC,axis=1)
plt.hist(std_all,bins=100)
plt.show()
embed_std = np.mean(std_all)
print('Mean standard deviation is {}'.format(embed_std))

## Set hyperparameters

In [None]:
weights = {0:1, 1:15}
FOLDS=5
SEED=42
DISPLAY_PLOT = 1
REPLICAS=1
EPOCHS=100
TTA=11
batch_size = 64
VERBOSE=0

In [None]:
tf.random.set_seed(5);

In [None]:
def data_augment(data,mean=0.0,std1=embed_std/10,std2=0.03):
#     print(data.shape())
    gauss = np.random.normal(loc=mean, scale=std1, size=Xtrain.shape[1])
    new_data = data[0] + gauss
    gauss = np.random.normal(loc=mean, scale=std2, size=Xpreds.shape[1])
    new_preds = data[1] + gauss
    return (new_data,new_preds)

How to augment meta data and image embeddings properly? Ideally you would take the std from the base models for the predictions and also figure out how the image embeddings change when you augment the image and econde it.

Also, at present I am augmenting the categorical data with random noise, which isn't the best thing to do. But for illustration purposes this serves.

In [None]:
# helper function for loading data
def get_dataset(X,subs,y,augment=True,repeat=True,batch=batch_size):
    ds = tf.data.Dataset.from_tensor_slices(((X,subs), y))
    if repeat:
        ds = ds.repeat()
    if augment:
        ds = ds.map(lambda elem,label: (data_augment(elem),label))
        
    ds = ds.batch(batch)
    return ds

## Build a model that takes meta data, image embeddings and other model predictions as input

In [None]:
dim = Xtrain.shape[1]
meta_dim = Xpreds.shape[1]
def build_model(ls=0.05):
    inp = tf.keras.layers.Input(shape=(dim,))
    x = L.Dropout(0.2)(inp)
    x = L.Dense(int(1024), activation='relu')(x)
    x = L.Dropout(0.3)(x)
    x = L.Dense(int(512), activation='relu')(x)
    x = L.Dropout(0.3)(x)
    x = L.Dense(int(256), activation='relu')(x)
    x = L.Dropout(0.3)(x)
    x = L.Dense(int(128), activation='relu')(x)
    x = L.Dropout(0.3)(x)
    x = L.Dense(64, activation='relu')(x)
    x = L.Dropout(0.3)(x)
    x = L.Dense(32, activation='relu')(x)
    x = L.Dropout(0.3)(x)
    x = L.Dense(16, activation='relu')(x)
    x = L.Dropout(0.3)(x)
    x = L.Dense(12, activation='relu')(x)
    x = L.Dropout(0.3)(x)
    meta_inp = tf.keras.layers.Input(shape=(meta_dim,))
    xm = L.concatenate((x,meta_inp))
    xm = L.Dense(16, activation='relu')(xm)
    xm = L.Dropout(0.2)(xm)
    xm = L.Dense(12, activation='relu')(xm)
    xm = L.Dropout(0.2)(xm)
    xm = L.Dense(8, activation='relu')(xm)
    xm = L.Dropout(0.1)(xm)
    xm = L.Dense(1, activation='sigmoid')(xm)
    model = tf.keras.Model(inputs=(inp,meta_inp), outputs=xm)
    opt = tf.keras.optimizers.Adam(learning_rate=0.00000125* REPLICAS * batch_size)
    loss = tf.keras.losses.BinaryCrossentropy(label_smoothing=ls) 
    model.compile(optimizer=opt,loss=loss,metrics=['AUC'])
    return model

## Train the model

This has been repurposed from [this](https://www.kaggle.com/cdeotte/triple-stratified-kfold-with-tfrecords) notebook.

In [None]:
def get_lr_callback(batch_size=8):
    lr_start   = 0.000005
    lr_max     = 0.00000125 * REPLICAS * batch_size
#     lr_max     = 0.0000125 * REPLICAS * batch_size
    lr_min     = 0.000001
    lr_ramp_ep = 50
    lr_sus_ep  = 0
    lr_decay   = 0.99
   
    def lrfn(epoch):
        if epoch < lr_ramp_ep:
            lr = (lr_max - lr_start) / lr_ramp_ep * epoch + lr_start
            
        elif epoch < lr_ramp_ep + lr_sus_ep:
            lr = lr_max
            
        else:
            lr = (lr_max - lr_min) * lr_decay**(epoch - lr_ramp_ep - lr_sus_ep) + lr_min
            
        return lr

    lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=False)
    return lr_callback

In [None]:
ids = train['index']
labels = train.target.astype('int32')
idsO = test['index']
data_sub = Xtest
# Default strategy for single GPU
strategy = tf.distribute.get_strategy()

# skf = StratifiedKFold(n_splits=FOLDS,shuffle=True,random_state=SEED)
skf = KFold(n_splits=FOLDS,shuffle=True,random_state=SEED)

oof_pred = []; oof_tar = []; oof_val = []; oof_names = []; oof_folds = [] 
preds = np.zeros((test.shape[0],1))
preds_all = np.zeros((test.shape[0],FOLDS))

for fold,(idxT2,idxV2) in enumerate(skf.split(np.arange(15))):

    idxT = train.loc[train.tfrecord.isin(idxT2)].index.values #2020 train
    idxV = train.loc[train.tfrecord.isin(idxV2)].index.values #2020 valid
    
#     X = data_augment(data[idxT]); y = train.target[idxT]
    X = Xtrain[idxT]; y = labels[idxT]; XX=Xpreds[idxT]
    X_val = Xtrain[idxV]; y_val = labels[idxV]; XX_val=Xpreds[idxV]
    
    # BUILD MODEL
    tf.keras.backend.clear_session()
    with strategy.scope():
        model=build_model()

    # SAVE BEST MODEL EACH FOLD
    sv = tf.keras.callbacks.ModelCheckpoint(
        'fold-%i.h5'%fold, monitor='val_loss', verbose=0, save_best_only=True,
        save_weights_only=True, mode='min', save_freq='epoch')
#     sv = tf.keras.callbacks.ModelCheckpoint(
#         'fold-%i.h5'%fold, monitor='val_auc', verbose=0, save_best_only=True,
#         save_weights_only=True, mode='min', save_freq='epoch')

    # Train the model
    history = model.fit(get_dataset(X,XX,y),epochs=EPOCHS,steps_per_epoch=X.shape[0]/batch_size//REPLICAS,
                        verbose=VERBOSE,class_weight=weights, 
#                         callbacks=[sv,get_lr_callback(batch_size=batch_size)],
                        callbacks=[sv],
                       validation_data=get_dataset(X_val,XX_val, y_val,augment=False,repeat=False))

    model.load_weights('fold-%i.h5'%fold)

    # PREDICT OOF USING TTA
    STEPS = TTA*X_val.shape[0]/(batch_size-1)/REPLICAS
    pred = model.predict( get_dataset(X_val,XX_val,y_val), steps=STEPS )[:TTA*X_val.shape[0]]
    oof_pred.append( np.mean(pred.reshape((X_val.shape[0],TTA),order='F'),axis=1) )                

    # GET OOF TARGETS AND NAMES
    oof_tar.append( y_val )
    oof_names.append( ids[idxV] )
    oof_folds.append( np.ones_like(oof_tar[-1],dtype='int8')*fold )

    STEPS = TTA*Xtest.shape[0]/(batch_size-1)/REPLICAS
    psub = model.predict( get_dataset(Xtest,Xtargets,np.zeros(len(Xtest))), steps=STEPS  )[:TTA*Xtest.shape[0]]
    pstore = np.mean(psub.reshape((len(Xtest),TTA),order='F'),axis=1)
    preds[:,0] += pstore*1/FOLDS

    # REPORT RESULTS
    auc = roc_auc_score(oof_tar[-1],oof_pred[-1])
    oof_val.append( np.max( history.history['val_auc'] ) )
    print('#### FOLD %i OOF AUC without TTA = %.3f, with TTA = %.3f'%(fold+1,oof_val[-1],auc))

    # PLOT TRAINING
    if DISPLAY_PLOT:
        plt.figure(figsize=(15,5))
        plt.plot(np.arange(EPOCHS),history.history['auc'],'-o',label='Train AUC',color='#ff7f0e')
        plt.plot(np.arange(EPOCHS),history.history['val_auc'],'-o',label='Val AUC',color='#1f77b4')
        x = np.argmax( history.history['val_auc'] ); y = np.max( history.history['val_auc'] )
        xdist = plt.xlim()[1] - plt.xlim()[0]; ydist = plt.ylim()[1] - plt.ylim()[0]
        plt.scatter(x,y,s=200,color='#1f77b4'); plt.text(x-0.03*xdist,y-0.13*ydist,'max auc\n%.2f'%y,size=14)
        plt.ylabel('AUC',size=14); plt.xlabel('Epoch',size=14)
        plt.legend(loc=2)
        plt2 = plt.gca().twinx()
        plt2.plot(np.arange(EPOCHS),history.history['loss'],'-o',label='Train Loss',color='#2ca02c')
        plt2.plot(np.arange(EPOCHS),history.history['val_loss'],'-o',label='Val Loss',color='#d62728')
        x = np.argmin( history.history['val_loss'] ); y = np.min( history.history['val_loss'] )
        ydist = plt.ylim()[1] - plt.ylim()[0]
        plt.scatter(x,y,s=200,color='#d62728'); plt.text(x-0.03*xdist,y+0.05*ydist,'min loss',size=14)
        plt.ylabel('Loss',size=14)
        plt.legend(loc=3)
        plt.show()  

# COMPUTE OVERALL OOF AUC
oof_c = np.concatenate(oof_pred); true = np.concatenate(oof_tar);
names = np.concatenate(oof_names); folds = np.concatenate(oof_folds)
auc = roc_auc_score(true,oof_c)
print('Overall OOF AUC with TTA = %.5f'%auc)

dd_oof = pd.DataFrame(dict(image_name = names, target=true, pred = oof_c, fold=folds))

submission = pd.DataFrame(dict(image_name=idsO, target=preds[:,0]))
submission = submission.sort_values('image_name')
submission.to_csv('submission.csv', index=False)
submission.head()

With a linear blending approach the submissions that are passed to this model get a max OOF AUC score of 0.9417.

I would be interested to know what other people think of this kernel, and would appreciate any suggestions for improvements, I did not have a huge amount of time to spend on this - and only thought of trying it in the last days of the competition - so I am sure I missed something.

## Performing the same method with tfrecords

When using this method with a TPU you don't want to rewrite the files every time just so you can incorporate the model predictions because you would want to be updating these predictions constantly. This is true more generally I think for any time you have new data.

So when using tfrecords with the above method you have to do the following to read predictions into the training (the tfrecords contain ```example['image_name']``` which can be used as a key to read in the external data).

```
names = pd.concat((oof_df['image_name'],sub_df['image_name']))
data = np.concatenate((oof_df['pred'],sub_df['target']))

#Make a lookup table for the data
with strategy.scope():
    get_index = tf.lookup.StaticHashTable(
      tf.lookup.KeyValueTensorInitializer(names, np.arange( len(names) )), -1
    )
    METADATA = tf.constant(data)
```
    
Then later you can add

```
query = get_index.lookup(example['image_name'])
meta_models = tf.gather(METADATA, query)
```

(into ```read_labeled_tfrecord``` and ```read_unlabeled_tfrecord``` from AgentAuers' notebook [here](https://www.kaggle.com/agentauers/incredible-tpus-finetune-effnetb0-b6-at-once)) and use ``meta_models`` int the same way that ```meta_inp``` is used in this notebook in ```build_model```.