I've been playing around with Kaggles TPUs for a bit now but have only used them for computer vision tasks. This looks like a cool opportunity to try them out on a language task. Let's begin by importing the libraries.

I looked at a number of notebooks to get started with transformers but this [one](https://www.kaggle.com/xhlulu/contradictory-watson-concise-keras-xlm-r-on-tpu) from xhlulu was the greatest source of inspiration for my notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModel
from tqdm.notebook import tqdm

In [None]:
!pip install nlp
import nlp

Unlike GPUs a TPU needs to be found and setup to work with the model in the notebook. Specifically a "strategy" needs to be defined regarding how the model will be replicated across the eight GPU chips on the TPU board and how these replica models will be merged back together once training has completed. This piece of code finds a TPU (or gets a GPU or CPU if one is not available) and sets up this strategy.

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy() 

print("REPLICAS: ", strategy.num_replicas_in_sync)

To easily tweak the parameters of the model I have put them as globals at the top of the notebook.

You'll notice that the batch size needed to be multiplied by the number of replicas (8). This is simply to make sure each of the eight GPU chips in the TPU uses the specified batch size and not one eighth of that number.

In [None]:
MODEL_NAME = 'jplu/tf-xlm-roberta-large'
EPOCHS = 10
MAX_LEN = 80
RATE = 1e-5

BATCH_SIZE = 64 * strategy.num_replicas_in_sync

## Load data

Next load the data and have a look at it.

In [None]:
train = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/train.csv')
test = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/test.csv')
submission = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/sample_submission.csv')

In [None]:
train.info()

In [None]:
train.head()

So the data is quite straight forward. Each example contains two sentences (a premise and a hypothesis) and a class telling us if the sentences are saying the same thing (entailment), disagree with one another (contradiction) or they are talking about different things (neutral). So the model needs to take in two inputs (the two sentences) and return one of three classes.

We don't need all the columns so I'll drop the language columns.

In [None]:
train = train[['premise', 'hypothesis', 'label']]

### Augment dataset

I've learnt recently that many data scientists do well in Kaggle competitions because they augment the training data with extra data they have found. This competiton has 12k examples which isn't a huge volume of training data. While a decent model can be trained with this we could do better with more examples. Luckily [hugging face](https://huggingface.co/) (the people who produced the transformers library) have released a library called [nlp](https://huggingface.co/datasets) that contains a bunch of good datasets. Kudos to Yih-Dar SHIEH whose [notebook](https://www.kaggle.com/yihdarshieh/more-nli-datasets-hugging-face-nlp-library#Datasets) I used to learn how to use the nlp library.

I'll start by downloading the [Multi-Genre NLI Corpus](https://cims.nyu.edu/~sbowman/multinli/) dataset.

In [None]:
multigenre_data = nlp.load_dataset(path='glue', name='mnli')

nlp datasets can be reshaped into pandas dataframes using the below code.

In [None]:
index = []
premise = []
hypothesis = []
label = []

for example in multigenre_data['train']:
    premise.append(example['premise'])
    hypothesis.append(example['hypothesis'])
    label.append(example['label'])

In [None]:
multigenre_df = pd.DataFrame(data={
    'premise': premise,
    'hypothesis': hypothesis,
    'label': label
})

In [None]:
multigenre_df.head()

### Add another dataset

The nlp library has another dataset that could be added called the [Stanford Natural Language Inference Corpus](https://huggingface.co/datasets/snli). Let's load it in the same way.

In [None]:
stanford_data = nlp.load_dataset(path='snli')

In [None]:
index = []
premise = []
hypothesis = []
label = []

for example in stanford_data['train']:
    premise.append(example['premise'])
    hypothesis.append(example['hypothesis'])
    label.append(example['label'])

In [None]:
stanford_df = pd.DataFrame(data={
    'premise': premise,
    'hypothesis': hypothesis,
    'label': label
})

### Merge data into one dataframe

Concat the datasets together.

In [None]:
train = pd.concat([train, multigenre_df, stanford_df])

And take a look at how many examples we now have.

In [None]:
train.info()

## Encode training data

Like all Machine Learning models a language model works with numbers, not text. To prepare the sentences for training then they need to be tokenised. These tokens are number indexes that represent each of the words. Each model has it's own unique set of tokens. Let's get the tokeniser for this model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Then get all the sentences from the datasets and convert them from strings to arrays of tokens.

In [None]:
train_text = train[['premise', 'hypothesis']].values.tolist()
test_text = test[['premise', 'hypothesis']].values.tolist()

In [None]:
train_encoded = tokenizer.batch_encode_plus(
    train_text,
    pad_to_max_length=True,
    max_length=MAX_LEN
)

In [None]:
test_encoded = tokenizer.batch_encode_plus(
    test_text,
    pad_to_max_length=True,
    max_length=MAX_LEN
)

So let's have a look at what the tokeniser has done to the sentences. Let's pick one from the dataset and have a look at it in its textual form.

In [None]:
train.premise.values[0]

And now let's have a look at the first few tokens of the sentence in its textual form.

In [None]:
print(train_encoded.input_ids[0][0:14])

So a sentence has been split into an array where each word is represented by a number index. The tokeniser even splits the words themselves up into sub words. The word "formulating" for example is split into the sub words "formula" and "ting". Let's have a look at some of the words that the tokeniser has a token for. The get vocab command can be used for this.

Just a note that the token 0 represents "\<\s\>" which represents the start of the sentence.

In [None]:
vocab = tokenizer.get_vocab()

print(vocab['<s>'])
print(vocab['▁and'])
print(vocab['▁these'])
print(vocab['▁comments'])
print(vocab['▁were'])
print(vocab['▁considered'])
print(vocab['▁in'])
print(vocab['▁formula'])
print(vocab['ting'])
print(vocab['▁the'])
print(vocab['▁inter'])
print(vocab['im'])
print(vocab['▁rules'])
print(vocab['.'])

When a model has two inputs (like a premise and hypothesis) the transformer will merge the tokens from the two sentences into the one array. The "\<\s\>" token is used to denote the end of the the premise and the beginning of the hypothesis. In this example we can see that at character 13 the tokens representing the first premise end with the "\<\s\>" token and the tokens following it represent the hypothesis sentence.

In [None]:
train.hypothesis.values[0]

In [None]:
print(train_encoded.input_ids[0][14:32])

In [None]:
print(vocab['</s>'])
print(vocab['▁The'])
print(vocab['▁rules'])
print(vocab['▁developed'])
print(vocab['▁in'])
print(vocab['▁the'])
print(vocab['▁inter'])
print(vocab['im'])

You may have noticed that there is another imput per sentence. The input ids are the token arrays that were explored above. 

In [None]:
train_encoded.keys()

The attention mask shows where the words are in the sentence (as each sentence was padded with zeros to make them all the same length). The ones represent words while the zeros are padding. The padding doesn't hold any meaningful information so this mask helps the model focus on only the words that contain meaning.

In [None]:
print(train_encoded.attention_mask[0][0:35])

## Train, validation, test split

Now split the training dataset into training anfd validation.

In [None]:
x_train, x_valid, y_train, y_valid = train_test_split(
    train_encoded['input_ids'], 
    train.label.values, 
    test_size=0.2, 
    random_state=2020
)

In [None]:
x_test = test_encoded['input_ids']

## Pipeline

When using tensorflow and TPUs it is best to build a data pipeline using tensorflows data api. This produces better performance during training.

The pipeline is reasinably straight forward. Insert the data using the from tensor slices commmand, shuffle it, batch it and prefetch the next batch while the model is training on the current batch.

In [None]:
auto = tf.data.experimental.AUTOTUNE

train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(auto)
)

In [None]:
valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(auto)
)

In [None]:
test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(x_test)
    .batch(BATCH_SIZE)
)

## Model

If you have trained computer vision models you may have built models with "backbones". These are pre-trained models whose weights can be generalised to a new task. Stick some extra layers (the head) to the end of the model to handle the new task and you have a model that benefits from cutting edge trained but that is still built to complete the current task in mind. 

I don't know if the terminology of backbones and heads apply to language models but I'm going to build a model with this in mind. I'll be using roberta as the backbone and a sofmax layer on the end to apply the correct class (entailment, neutral, contradiction or 0, 1, 2).

BERT (the original language transformer model that models like roberta are based on) is quite a complex model. If you'd like to understand how it works check out this [notebook](https://www.kaggle.com/abhinand05/bert-for-humans-tutorial-baseline). It does an awesome job of explaining BERT.

First I'll load the BERT backbone. If you're new to TPUs in Kaggle the strategy scope here relates to the TPU setup earlier in the notebook. As we load the model the strategy will replicate it across the eight GPU chips of the TPU board.

In [None]:
with strategy.scope():
    backbone = TFAutoModel.from_pretrained(MODEL_NAME)

Then take the backbone and apply the softmax layer that produces the class.

In [None]:
with strategy.scope():
    x_input = tf.keras.Input(shape=(MAX_LEN,), dtype=tf.int32, name="input_ids")

    x = backbone(x_input)[0]

    x = x[:, 0, :]

    x = tf.keras.layers.Dense(3, activation='softmax')(x)

    model = tf.keras.models.Model(inputs=x_input, outputs=x)

Compile the model.

In [None]:
model.compile(
    tf.keras.optimizers.Adam(lr=RATE), 
    loss='sparse_categorical_crossentropy', 
    metrics=['accuracy']
)

And take a look at what it looks like.

In [None]:
model.summary()

## Train

With the pipeline and model ready to go we can begin training.

In [None]:
steps = len(x_train) // BATCH_SIZE

history = model.fit(
    train_dataset,
    validation_data=valid_dataset,
    epochs=EPOCHS,
    steps_per_epoch=steps,
)

## Evaluate

Let's see a summary of how the model did.

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(15, 5))

ax[0,0].set_title('Train Loss')
ax[0,0].plot(history.history['loss'])

ax[0,1].set_title('Train Accuracy')
ax[0,1].plot(history.history['accuracy'])

ax[1,0].set_title('Val Loss')
ax[1,0].plot(history.history['val_loss'])

ax[1,1].set_title('Val Accuracy')
ax[1,1].plot(history.history['val_accuracy'])

## Make predictions

Finally use the model to make predicitons against the test set.

In [None]:
test_preds = model.predict(test_dataset, verbose=1)
submission['prediction'] = test_preds.argmax(axis=1)

In [None]:
submission.head()

And write the test results to file.

In [None]:
submission.to_csv('submission.csv', index=False)