## Objective

#### Cross-validation is a resampling procedure that is used to evaluate machine learning models on limited dataset. This tutorial details how to cross-validate a transformer-based model on limited training data samples. 

#### The classification problem is to identify the semantic relationship between two sentences. Given a hypothesis and premise sentence-pairs, the task is to determine whether the premise `entails` the hypothesis statement, `contradicts` it, or neither (`neutral`). For more information on the problem, you can visit [Contradictory, My Dear Watson competition](https://www.kaggle.com/c/contradictory-my-dear-watson/overview) 

#### The provided [dataset](https://www.kaggle.com/c/contradictory-my-dear-watson/data) in the competition contains only 12,120 training examples. Hence *k*-fold cross-validation is used to reduce overfitting problems by dividing the training data into *k* random parts. The model is trained on *k-1* parts and tested with the remaining part.

#### **Note**: The model is trained on Kaggle kernel with a TPU accelerator

Let's get started!

## Install and Load Necessary Libraries

In [None]:
import numpy as np 
import pandas as pd 
import os
import os.path
from os import path
from transformers import AutoTokenizer, AutoConfig, TFAutoModel    
from transformers import XLMRobertaConfig, XLMRobertaTokenizer, TFXLMRobertaModel           
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.layers import Input, Dropout, Dense, GlobalAveragePooling1D, LayerNormalization
from tqdm import tqdm
import time
import glob
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Resets all state generated by Keras
K.clear_session()

# For reproducibility
np.random.seed(0)

os.environ["WANDB_API_KEY"] = "0" # to silence warning

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
    print('Found TPU: ', tpu.master())
except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU
print('Number of replicas:', strategy.num_replicas_in_sync)

## Read the Data Files

In [None]:
train_df = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")
test_df = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")

# check the number of rows and columns in the datasets
print("Training data shape: {}".format(train_df.shape))
print("Test data shape: {}\n".format(test_df.shape))

train_df.head()

## Load and Process the Data into Batches

In [None]:
# Configuration Settings
EPOCHS = 5
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
MAX_LEN = 120
PATIENCE = 1
LEARNING_RATE = 1e-5
AUTO = tf.data.experimental.AUTOTUNE

In [None]:
name = 'jplu/tf-xlm-roberta-large'
PRETRAINED_MODEL_TYPES = {
    'xlmroberta': (XLMRobertaConfig, TFXLMRobertaModel, XLMRobertaTokenizer, name)
}

config_class, model_class, tokenizer_class, model_name = PRETRAINED_MODEL_TYPES['xlmroberta']
# Download vocabulary from huggingface.co and cache.
# tokenizer = tokenizer_class.from_pretrained(model_name) 
tokenizer = AutoTokenizer.from_pretrained(model_name) #fast tokenizer
tokenizer

In [None]:
def encode(df, tokenizer, max_len=50, cross_val=False):
    
    pairs = df[['premise','hypothesis']].values.tolist() #shape=[num_examples]
    
    print ("Encoding...")
    encoded_dict = tokenizer.batch_encode_plus(pairs, max_length=max_len, padding=True, truncation=True, 
                                               add_special_tokens=True, return_attention_mask=True)
    print ("Complete")
    
    if cross_val:
        input_word_ids = np.array(encoded_dict['input_ids']) #shape=[num_examples, max_len])
        input_mask = np.array(encoded_dict['attention_mask']) #shape=[num_examples, max_len]
    else:
        input_word_ids = tf.convert_to_tensor(encoded_dict['input_ids'], dtype=tf.int32) #shape=[num_examples, max_len])
        input_mask = tf.convert_to_tensor(encoded_dict['attention_mask'], dtype=tf.int32) #shape=[num_examples, max_len]


    inputs = {
        'input_word_ids': input_word_ids,
        'input_mask': input_mask}    
    
    return inputs

In [None]:
def create_dataset(features, labels, batch_size=BATCH_SIZE, validation=False):
    dataset = tf.data.Dataset.from_tensor_slices((features, labels)).shuffle(len(features))
    if validation:
        dataset = dataset.batch(batch_size).prefetch(AUTO)
    else:
        dataset = dataset.repeat().batch(batch_size).prefetch(AUTO)
    return dataset

In [None]:
train_input = encode(train_df, tokenizer=tokenizer, max_len=MAX_LEN, cross_val=True)
train_ids = train_input['input_word_ids'] #[9696, max_len]
train_mask = train_input['input_mask'] #[9696, max_len]
train_labels = train_df.label.values

## Build the Model

In [None]:
def build_model(model_name, max_len=50):
    
    tf.random.set_seed(1234)
    
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    
    # The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.
    base_model = model_class.from_pretrained(model_name)
    output = base_model([input_word_ids, input_mask]) # output from xlmroberta model
    sequence_output = output.pooler_output #shape: [batch_size, embed_size]
    
    # Add a classification layer
    output = Dense(units=3, activation="softmax")(sequence_output)
    
    model = tf.keras.Model(inputs=[input_word_ids, input_mask], outputs=output)
    model.compile(tf.keras.optimizers.Adam(lr=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    return model

## Train and Cross-Validate the Model

Selecting the right number of folds or the *k* value is crucial in the model's ability to generalize well to unseen data. 

Note that the *k* value is a hyperparameter and is selected through experimentation. 

In [None]:
checkpoint_filepath='best_checkpoint.hdf5'

# prepare cross validation
folds = 3
kfold = StratifiedKFold(n_splits=folds, shuffle=True, random_state=1)

# for plotting
val_loss_list = []
val_acc_list = []
train_hist_list = []

# callbacks = [EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE)]
callbacks = [EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE), ModelCheckpoint(filepath=checkpoint_filepath, save_best_only=True, save_weights_only=True, monitor='val_loss', mode='min', verbose=1)]

# enumerate splits
for k, (train_idx, val_idx) in enumerate(kfold.split(train_ids, train_labels)):
    print ('Fold {} of {}'.format(k+1, folds))
    print('Train data shape: {}, Validation data shape: {}'.format(len(train_idx), len(val_idx)))
    print('Train indices: {}, Validation indices: {}'.format(train_idx, val_idx))
    
    tf.tpu.experimental.initialize_tpu_system(tpu)
    
    training_data = create_dataset((train_ids[train_idx], train_mask[train_idx]), train_labels[train_idx], batch_size=BATCH_SIZE, validation=False)
    validation_data = create_dataset((train_ids[val_idx], train_mask[val_idx]), train_labels[val_idx], batch_size=BATCH_SIZE, validation=True)
    
    # instantiating the model in the strategy scope creates the model on the TPU
    with strategy.scope():
        model = build_model(model_name, MAX_LEN)
#         model.summary()
        
    n_steps = int(len(train_idx)/BATCH_SIZE)
    
    train_history = model.fit(x=training_data, validation_data=validation_data, epochs=EPOCHS, verbose=1, steps_per_epoch=n_steps, callbacks=callbacks)
    
    avg_val_loss = np.mean(train_history.history['val_loss'])
    avg_val_acc = np.mean(train_history.history['val_accuracy'])
    
    print ('Average Validation Loss: {}'.format(avg_val_loss))
    print ('Average Validation Accuracy: {}'.format(avg_val_acc))
    print('#############################################\n')
    
    val_loss_list.append(avg_val_loss)
    val_acc_list.append(avg_val_acc)
    train_hist_list.append(train_history)
    
    del model #free up space
            
    # Resets all state generated by Keras before training a new model in the next fold
    K.clear_session()

## Performance Analysis

*k*-fold cross-validation is performed on the model and the table below documents how the model performs across each individual validation fold when *k*=3. We can see that the validation performance is stable and doesn't fluctuate much with ±0.085 in average validation loss and ±0.074 in validation accuracy across all folds. Hence we can say that the model is consistent across the trained dataset.

<a id='table'></a>

| Fold | Avg Val Loss | Avg Val Accuracy |
| --- | --- | --- |
| 1 | 0.789 | 0.635 | 
| 2 | 0.721 | 0.675 | 
| 3 | 0.806 | 0.601 |

Finally calculate the average performance of the metrics over all the iterations

In [None]:
mean_val_loss = round(np.mean(val_loss_list), 3)
mean_val_acc = round(np.mean(val_acc_list), 3)

print('Average validation loss over all folds: {}'.format(mean_val_loss))
print('Average validation accuracy over all folds: {}'.format(mean_val_acc))

That's it!

<span style="color:blue">If you find this notebook helpful, please leave your feedback or any suggestions, and kindly upvote. Thanks!</span>