# Introduction

The notebook is a step-by-step tutorial on using Transformer models for Natural Language Inferencing (NLI). This includes how to load, fine-tune, and evaluate M-BERT and XLM-RoBERTa models with Tensorflow.

Natural Language Inferencing (NLI) is an exciting NLP (Natural Language Processing) problem to identify the semantic relationship between two sentences. Given a hypothesis and premise sentence-pairs, the task is to determine whether the premise `entails` the hypothesis statement, `contradicts` it, or neither (`neutral`). 

For more information on the problem, you can visit [Contradictory, My Dear Watson competition](https://www.kaggle.com/c/contradictory-my-dear-watson/overview)

# Load Libraries and Dependencies

In [None]:
import numpy as np 
import pandas as pd 
import os
import sys
from transformers import BertTokenizer, TFBertModel
from tokenizers import BertWordPieceTokenizer
from transformers import AutoTokenizer, AutoConfig, TFAutoModel    
from transformers import XLMRobertaConfig, XLMRobertaTokenizer, TFXLMRobertaModel  
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.utils import plot_model
from sklearn import metrics
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from itertools import product


# Handle Warnings: Optional
os.environ["WANDB_API_KEY"] = "0" # to silence warning
os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_xla_devices" # enable xla devices
os.environ["TOKENIZERS_PARALLELISM"] = "false" 

np.random.seed(0) # For reproducibility

# Make sure to install the right version of Python and Tensorflow for reproducible results
print("Python version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

# Configure TPU Settings

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU
print('Number of Replicas:', strategy.num_replicas_in_sync)

# Load CSV Data files with Pandas

The dataset contains train and test files that includes premise-hypothesis pairs in fifteen different languages. 

The classification of the relationship between the premise and hypothesis statements is as follows:

- label==`0` for `entailment`
- label==`1` for `neutral`
- label==`2` for `contradiction`

You can look at [competition website](https://www.kaggle.com/c/contradictory-my-dear-watson/data) for elaboration on the dataset.

In [None]:
train = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")
test = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")

train.head()

# Data Exploration and Analysis

Explore data & drop any incomplete rows of data.

Find how many data points and features are in the original, provided training dataset.

In [None]:
# print out stats about data

missing_values_count = train.isnull().sum() # we get the number of missing data points per column
print("Number of missing data points per column:\n")
print (missing_values_count)

Identify any duplicate examples in the dataset

In [None]:
train["is_duplicate"] = train.duplicated()
train[train["is_duplicate"]==True].count() 

Drop the duplicated examples from the dataset before splitting the data into train-validation-test subsets. This prevents any accidental test or validation data leakage into the train subset.

In [None]:
train.drop_duplicates(keep=False, inplace=True, ignore_index=True)
train.drop("is_duplicate", axis=1, inplace=True) 
print("Number of data examples after dropping duplicates: {} \n".format(train.shape[0]))

Let's look at the distribution of languages in the training set.

In [None]:
train.language.unique()
train.language.value_counts(normalize=True)

We can see that more than half of the training examples are in English as data resources are abundant in this language. Rest of the data is fairly shared between other 14 languages.

Let's now visualize the distribution of class labels over the training data

In [None]:
# check distribution of target classes in the augmented data
counts = train['label'].value_counts()

class_labels = ['Entailment', 'Neutral', 'Contradiction']

counts_per_class = [counts[0], counts[1], counts[2]]

plt.figure(figsize = (10,10))
plt.pie(counts_per_class, labels = class_labels, autopct = '%1.1f%%')
plt.title("Training Data")
plt.show()

From the chart above, we can see that the training data is fairly balanced over the 3 classes.

# Split the Training Data

We will be splitting the training dataset into two parts - the data we will train the model with and a validation set. We stratify data during train-valid split to preserve the original distribution of the target classes.

In [None]:
from sklearn.model_selection import train_test_split
train, validation = train_test_split(train, stratify=train.label.values, 
                                                  random_state=42, 
                                                  test_size=0.2, shuffle=True)


train.reset_index(drop=True, inplace=True)
validation.reset_index(drop=True, inplace=True)

In [None]:
# check the number of rows and columns after split
print("Train data: {} \n".format(train.shape))
print("Validation data: {} \n".format(validation.shape))

# Implement M-BERT Model

The Multilingual BERT or M-BERT is a single language model pre-trained on the concatenation of monolingual Wikipedia corpora from 104 languages. We will fine-tune this pretrained model on our training dataset to get the predictions for textual entailment recognition.

## Set up M-BERT Tokenizer

A pretrained model only performs properly if we feed it an input that was tokenized with the same rules that were used to tokenize its training data. The BERT multilingual model does not perform any normalization on the input (no lower casing, accent stripping, or Unicode normalization). Hence we also follow the same rules when tokenizing input data for our task. For more information on data pre-processing, visit [M-BERT github](https://github.com/google-research/bert/blob/master/multilingual.md).

In [None]:
PRETRAINED_MODEL_TYPES = {}
PRETRAINED_MODEL_TYPES['bert'] = (TFBertModel, BertTokenizer, 'bert-base-multilingual-cased')
model_class, tokenizer_class, model_name = PRETRAINED_MODEL_TYPES['bert']

tokenizer = BertTokenizer.from_pretrained(model_name) # Save the slow pretrained tokenizer
save_path = '.'
if not os.path.exists(save_path):
    os.makedirs(save_path)
tokenizer.save_pretrained(save_path) # Save the loaded tokenizer locally
tokenizer = BertWordPieceTokenizer("vocab.txt", lowercase=False, strip_accents=False) # Load the fast tokenizer from saved file

tokenizer

Let's look at the sequence length distribution (e.g. number of tokens in a sequence) for the input data. We will need this information later when setting the `max_len` value since a machine learning algorithm requires all the inputs in a batch to have the same length.

In [None]:
def plot(df, tokenizer):
    """
    Plot histogram of lengths of input sequences
    """
    all_text = df.premise.values.tolist() + df.hypothesis.values.tolist() # list of string texts
    all_text_tokenized = tokenizer.encode_batch(all_text) # list of encoding objects
    all_tokenized_len = [len(encoding.tokens) for encoding in all_text_tokenized] # list of token lengths
       
    plt.hist(all_tokenized_len, bins=30, alpha=0.5)
    plt.title(' Histogram of lengths of input sequences')
    plt.xlabel('Number of tokens')
    plt.ylabel('Count')

    plt.show()

plot(train, tokenizer)

From the histogram above, we can see that majority of the input sequences have less than 50 tokens.

We can also calculate the mean and max input sequence lengths per language.

In [None]:
tokenized_premise = tokenizer.encode_batch(train.premise.values.tolist()) # list of encoding objects
train['premise_seq_length'] = [len(encoding.tokens) for encoding in tokenized_premise] # list of lengths
    
tokenized_hypothesis = tokenizer.encode_batch(train.hypothesis.values.tolist()) # list of encoding objects
train['hypothesis_seq_length'] = [len(encoding.tokens) for encoding in tokenized_hypothesis] # list of lengths

# Calculate max and avg sequence length per language
info_per_lang = train.groupby('language').agg({'premise_seq_length': ['mean', 'max', 'count'], 'hypothesis_seq_length': ['mean', 'max', 'count']})
print (info_per_lang)

From the above table, we can see that the length of premise sentences are greater than those of the hypothesis sentences for all languages. 

Hence, let's visualize the mean sequence length distribution over the languages for the premise inputs.

In [None]:
column_name = info_per_lang.columns.values[0] #premise mean column
info_per_lang[column_name].plot(kind='bar')

The length should be large enough such that we don’t lose much data. Additionally, a very big number would make the model complex.

Since most of the inputs are shorter than 50 words, we can consider length 50 for each input type of hypothesis and premise. 

Hence we set `MAX_LEN`=100.

*Note*: The `MAX_LEN` hyperparameter can be taken as a parameter to be tuned to get optimal results.

## Configure Hyperparameter Settings

In [None]:
# Configuration
EPOCHS = 3
BATCH_SIZE = 64 
MAX_LEN = 100
PATIENCE = 1
LEARNING_RATE = 1e-5

## Encode Input Sequences

For BERT model, the input is represented in the following format:

`CLS` Premise `SEP` Hypothesis `SEP`

The `CLS` and `SEP` are special tokens, where `CLS` is used in the beginning of a sequence for sentence-level classification while `SEP` separates the sentence pairs.

We encode the training data by vectorizing the input strings and applying padding and truncation using `MAX_LEN` value.

The encoded input will include - input word IDs, input masks, and input type IDs

In [None]:
def encode(df, tokenizer, max_len=50):
    """
    Encode the input sequences to feed into the MBERT model. 
    Note that encode_batch() is used as 'BertWordPieceTokenizer' object has no attribute 'batch_encode_plus'
    """
    pairs = df[['premise','hypothesis']].values.tolist()

    tokenizer.enable_truncation(max_len)
    tokenizer.enable_padding()
    
    print ("Encoding...")
    enc_list = tokenizer.encode_batch(pairs)
    print ("Complete")
    
    input_word_ids = tf.ragged.constant([enc.ids for enc in enc_list], dtype=tf.int32) #shape=[num_examples, max_len])
    input_mask = tf.ragged.constant([enc.attention_mask for enc in enc_list], dtype=tf.int32) #shape=[num_examples, max_len]
    input_type_ids = tf.ragged.constant([enc.type_ids for enc in enc_list], dtype=tf.int32) #shape=[num_examples, max_len]
   
    inputs = {
        'input_word_ids': input_word_ids.to_tensor(),
        'input_mask': input_mask.to_tensor(),
        'input_type_ids': input_type_ids.to_tensor()}
    
    return inputs 

In [None]:
train_input = encode(train, tokenizer=tokenizer, max_len=MAX_LEN)
train_ids = train_input['input_word_ids'] #[9696, max_len]
train_mask = train_input['input_mask'] #[9696, max_len]
train_type = train_input['input_type_ids'] #[9696, max_len]
train_labels = train.label.values

In [None]:
validation_input = encode(validation, tokenizer=tokenizer, max_len=MAX_LEN)
validation_ids = validation_input['input_word_ids'] #[num_examples, max_len]
validation_mask = validation_input['input_mask'] #[num_examples, max_len]
validation_type = validation_input['input_type_ids'] #[num_examples, max_len]
validation_labels = validation.label.values #[num_examples]

## Load and Process the Data in Batches

In [None]:
def create_dataset(features, labels, batch_size=BATCH_SIZE, validation=False):
    """
    Load and process input data into batches using TF Dataset 
    """
    AUTO = tf.data.experimental.AUTOTUNE
    dataset = tf.data.Dataset.from_tensor_slices((features, labels)).shuffle(len(features))
    if validation:
        dataset = dataset.batch(batch_size).prefetch(AUTO)
    else:
        dataset = dataset.repeat().batch(batch_size).prefetch(AUTO)
    return dataset

In [None]:
training_data = create_dataset((train_ids, train_mask, train_type), train_labels, batch_size=BATCH_SIZE)
validation_data = create_dataset((validation_ids, validation_mask, validation_type), validation_labels, batch_size=BATCH_SIZE, validation=True)

## Create and Train Model

We extract the hidden state vector of the 'CLS' token in the final BERT layer and pass that as input to the classification layer for further training.

In [None]:
def build_model(model_name, model_class, max_len=50, add_input_type=False):
    """
    Define the model architecture
    """
    
    tf.random.set_seed(123) # For reproducibility
    
    # The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.
    encoder = model_class.from_pretrained(model_name)
#     encoder = TFAutoModel.from_pretrained(model_name)
    
    input_word_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    input_type_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_type_ids")
    
    # Extract final layer feature vectors
    if add_input_type:
        features = encoder([input_word_ids, input_mask, input_type_ids])[0] # shape=(batch_size, max_len, output_size)
    else:
        features = encoder([input_word_ids, input_mask])[0] # shape=(batch_size, max_len, output_size)
    
    # We pass the vector of only the [cls] token (at index=0) to the classification layer
    sequence = features[:,0,:] #shape=(batch_size, output_size)
   
    # Add a classification layer
    output = tf.keras.layers.Dense(3, activation="softmax")(sequence)  
    
    if add_input_type:
        model = tf.keras.Model(inputs=[input_word_ids, input_mask, input_type_ids], outputs=output)
    else:
        model = tf.keras.Model(inputs=[input_word_ids, input_mask], outputs=output)
        
    model.compile(tf.keras.optimizers.Adam(lr=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

In [None]:
# instantiating the model in the strategy scope creates the model on the TPU
with strategy.scope():
    model = build_model(model_name, model_class, MAX_LEN, add_input_type=True)
    model.summary()

The model will be trained on the training subset and early-stopping will be applied on validation subset to avoid overfitting. The best model checkpoint will be saved after `EPOCHS` iterations.

In [None]:
checkpoint_filepath='bert_best_checkpoint.hdf5'
# callbacks = [EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE), ModelCheckpoint(filepath=checkpoint_filepath, save_best_only=True, save_weights_only=True, monitor='val_accuracy', mode='max', verbose=1)]
callbacks = [EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE), ModelCheckpoint(filepath=checkpoint_filepath, save_best_only=True, save_weights_only=True, monitor='val_loss', mode='min', verbose=1)]

n_steps = int(train_ids.shape[0]/BATCH_SIZE)
train_history = model.fit(x=training_data, validation_data=validation_data, epochs=EPOCHS, verbose=1, steps_per_epoch=n_steps, callbacks=callbacks)

## Plot Training and Validation Losses over all Epochs

In [None]:
def plot_loss(history):
    ''' Plot loss history '''
    plt.plot(history.history['loss'], label='train loss')
    plt.plot(history.history['val_loss'], label='validation loss')
    plt.title('Average Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()

In [None]:
plot_loss(train_history)

## Plot Training and Validation Accuracies over all Epochs

In [None]:
def plot_acc(history):
    ''' Plot accuracy history '''
    plt.plot(history.history['accuracy'], label='train accuracy')
    plt.plot(history.history['val_accuracy'], label='validation accuracy')
    plt.title('Average Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.show()

In [None]:
plot_acc(train_history)

From the above plots, we can see that there is a huge gap between the training and validation losses, which suggests that the M-BERT model is not quite good at generalizing to unseen data. The M-BERT model gives final validation accuracy of around 64-66%. In the next section, we'll look at another model, namely XLM-RoBERTa, which improves the validation accuracy and is much better at predictions on new data.  

In [None]:
del model #to free up space

In [None]:
# Resets all state generated by Keras
K.clear_session()

# Implement XLM-RoBERTa Model

The XLM-RoBERTa is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

## Set up the Tokenizer

In [None]:
PRETRAINED_MODEL_TYPES['xlmroberta'] = (TFXLMRobertaModel, XLMRobertaTokenizer, 'jplu/tf-xlm-roberta-large')
model_class, tokenizer_class, model_name = PRETRAINED_MODEL_TYPES['xlmroberta']

# Download vocabulary from huggingface.co and cache.
# tokenizer = tokenizer_class.from_pretrained(model_name) 
tokenizer = AutoTokenizer.from_pretrained(model_name) #fast tokenizer

tokenizer

## Encode Input Sequences

The encoded input will include - input word IDs and input masks, as required by the XLM-RoBERTa model.

In [None]:
def encode(df, tokenizer, max_len=50):
    """
    Encode the input sequences to feed into the XLM-Roberta model
    """
    pairs = df[['premise','hypothesis']].values.tolist() #shape=[num_examples]
    
    print ("Encoding...")
    encoded_dict = tokenizer.batch_encode_plus(pairs, max_length=max_len, padding=True, truncation=True, 
                                               add_special_tokens=True, return_attention_mask=True)
    print ("Complete")
    
    input_word_ids = tf.convert_to_tensor(encoded_dict['input_ids'], dtype=tf.int32) #shape=[num_examples, max_len])
    input_mask = tf.convert_to_tensor(encoded_dict['attention_mask'], dtype=tf.int32) #shape=[num_examples, max_len]
    
    inputs = {
        'input_word_ids': input_word_ids,
        'input_mask': input_mask}    
    
    return inputs

We will use the same train-validation split and hyperparameter settings as in the previous BERT model for results to be comparable.

In [None]:
train_input = encode(train, tokenizer=tokenizer, max_len=MAX_LEN)
train_ids = train_input['input_word_ids'] #[9696, max_len]
train_mask = train_input['input_mask'] #[9696, max_len]
train_labels = train.label.values

In [None]:
validation_input = encode(validation, tokenizer=tokenizer, max_len=MAX_LEN)
validation_ids = validation_input['input_word_ids'] #[num_examples, max_len]
validation_mask = validation_input['input_mask'] #[num_examples, max_len]
validation_labels = validation.label.values #[num_examples]

## Load and Process Data into Batches

In [None]:
training_data = create_dataset((train_ids, train_mask), train_labels, batch_size=BATCH_SIZE, validation=False)
validation_data = create_dataset((validation_ids, validation_mask), validation_labels, batch_size=BATCH_SIZE, validation=True)

## Create and Train Model

In [None]:
# instantiating the model in the strategy scope creates the model on the TPU
with strategy.scope():
    model = build_model(model_name, model_class, MAX_LEN)
    model.summary()

In [None]:
plot_model(model, to_file='model.png', show_shapes=True)

Let's now train the built model on training data and monitor the performance of the model on the validation data

In [None]:
checkpoint_filepath='xlmroberta_best_checkpoint.hdf5'

# callbacks = [EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE), ModelCheckpoint(filepath=checkpoint_filepath, save_best_only=True, save_weights_only=True, monitor='val_accuracy', mode='max', verbose=1)]
callbacks = [EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE), ModelCheckpoint(filepath=checkpoint_filepath, save_best_only=True, save_weights_only=True, monitor='val_loss', mode='min', verbose=1)]

n_steps = int(train_ids.shape[0]/BATCH_SIZE)
train_history = model.fit(x=training_data, validation_data=validation_data, epochs=EPOCHS, verbose=1, steps_per_epoch=n_steps, callbacks=callbacks)

It's always useful to visualize the loss and accuracy history for training and validation sets.

## Plot Loss against Epochs

In [None]:
plot_loss(train_history)

## Plot Accuracy against Epochs

In [None]:
plot_acc(train_history)

From the above plots we can clearly see that the gap between the training and validation losses has decreased by a large margin and best validation accuracy with XLM-RoBERTa model is ~79-80%.

To further evaluate the performance of the XLM-RoBERTa model, we'll generate the confusion matrix and classification report on the validation data.

In [None]:
# The function plot_confusion_matrix() is from scikit-learn’s website to plot the confusion matrix for multiclass. 
# link: https://scikit-learn.org/0.18/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cm, classes,
                        normalize=False,
                        title='Confusion matrix',
                        cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = np.around(cm.astype('float') / cm.sum(axis=1)[:, np.newaxis], 2)
#         print("Normalized confusion matrix")
#     else:
#         print('Confusion matrix, without normalization')

#     print(cm)

    thresh = cm.max() / 2.
    for i, j in product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')


validation_predictions = [np.argmax(i) for i in model.predict(validation_input)] #predictions
validation_labels = validation.label.values.tolist() #ground truth labels

cm_plot_labels = ['entailment','neutral', 'contradiction']
cm = confusion_matrix(y_true=validation_labels, y_pred=validation_predictions)
plot_confusion_matrix(cm=cm, classes=cm_plot_labels, title='Confusion Matrix Without Normalization')
# plot_confusion_matrix(cm=cm, classes=cm_plot_labels, title='Confusion Matrix With Normalization', normalize=True)

target_class = ['entailment' if label==0 else 'neutral' if label==1 else 'contradiction' for label in validation_labels]
prediction_class = ['entailment' if label==0 else 'neutral' if label==1 else 'contradiction' for label in validation_predictions]
print('\nClassification Report')
print(classification_report(y_true=target_class, y_pred=prediction_class))

It would also be interesting to look at the number of correct predictions per language.

In [None]:
def accuracy(x):
    """ Function to print accuracy per language """
    return round(float(x[2]/x[1]), 2)*100

validation['predictions'] = validation_predictions

# Calculate the total number of examples per language
lang_counts = validation.language.value_counts().sort_index()

# Calculate the number of correct predictions per language
tp_per_lang = validation[validation['label'] == validation['predictions']].groupby('language').agg({'language': ['count']}).sort_index()

lang_names = lang_counts.index.tolist()
lang_tuples = list(zip(lang_names, lang_counts.values.tolist(), tp_per_lang.iloc[:, 0].values.tolist()))
acc = map(accuracy, lang_tuples)
for i, score in enumerate(acc):
    print ("Accuracy of {} is {} ".format(lang_tuples[i][0], score))

# Generate Predictions on Test Data

Once we are satisfied with our model's performance, we can get test-data predictions for submission.

In [None]:
# The model weights (that are considered the best) are loaded into the model.
model.load_weights(checkpoint_filepath)

In [None]:
#encode the test-input sequences and get predictions
test_input = encode(test, tokenizer=tokenizer, max_len=MAX_LEN)
predictions = [np.argmax(i) for i in model.predict(test_input)]

# Submit the Predictions

In [None]:
submission = test.id.copy().to_frame()
submission['prediction'] = predictions

submission.head()

In [None]:
submission.to_csv("submission.csv", index = False)

That's it! The submission file has been created, for more information on how to submit to the competition, please visit the following [link](https://www.kaggle.com/c/contradictory-my-dear-watson/overview/evaluation).



<span style="color:blue">If you find this notebook helpful, please leave your feedback or any suggestions, and kindly upvote. Thanks!</span>