<a href="https://colab.research.google.com/github/tdisheng/Deep-Learning-Projects/blob/main/XLM_RoBERTa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Problem Description

If you have two sentences, there are three ways they could be related: one could entail the other, one could contradict the other, or they could be unrelated. Natural Language Inferencing (NLI) is a popular NLP problem that involves determining how pairs of sentences (consisting of a premise and a hypothesis) are related.

The task is to create an NLI model that assigns labels of 0, 1, or 2 (corresponding to entailment, neutral, and contradiction) to pairs of premises and hypotheses. To make things more interesting, the train and test set include text in fifteen different languages!

We will be taking a look at a model that is suited for the task at hand, the XLM-RoBERTa model released by Facebook, as well as the difference made in amount of training data available.

We chose to use the XLM-RoBERTa model over the M-BERT as it has a better performance across multiple languages using only one single model, as seen below

![picture](https://miro.medium.com/max/650/1*7X3Ov4jasOA_OhzjDQkSkw.png)

In [None]:
!nvidia-smi

Fri Jul 16 13:29:34 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Import Modules, Models and Data

In [None]:
!pip install -q --no-cache-dir transformers sentencepiece

from transformers import AdamW, AutoTokenizer, TFAutoModelForSequenceClassification

import json
import pandas as pd
import numpy as np
import time
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.callbacks import ModelCheckpoint, Callback, LearningRateScheduler, EarlyStopping, ReduceLROnPlateau
from tensorflow.keras import optimizers
from tensorflow.keras.models import Model
import keras

from sklearn.model_selection import train_test_split
!pip install -q datasets==1.2.1
from datasets import load_dataset

[K     |████████████████████████████████| 2.5MB 8.1MB/s 
[K     |████████████████████████████████| 1.2MB 53.2MB/s 
[K     |████████████████████████████████| 3.3MB 52.3MB/s 
[K     |████████████████████████████████| 901kB 59.3MB/s 
[K     |████████████████████████████████| 163kB 8.4MB/s 
[K     |████████████████████████████████| 245kB 37.3MB/s 
[?25h

In [None]:
xnli = load_dataset('xnli', 'all_languages')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2697.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2184.0, style=ProgressStyle(description…


Downloading and preparing dataset xnli/all_languages (download: 461.54 MiB, generated: 1.50 GiB, post-processed: Unknown size, total: 1.95 GiB) to /root/.cache/huggingface/datasets/xnli/all_languages/1.1.0/51ba3a1091acf33fd7c2a54bcbeeee1b1df3ecb127fdca003d31968fa3a1e6a8...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466098360.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=17865352.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset xnli downloaded and prepared to /root/.cache/huggingface/datasets/xnli/all_languages/1.1.0/51ba3a1091acf33fd7c2a54bcbeeee1b1df3ecb127fdca003d31968fa3a1e6a8. Subsequent calls will reuse this data.


In [None]:
model_name = 'bert-base-multilingual-cased'

tokenizer = AutoTokenizer.from_pretrained(model_name)
bert_model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1961828.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1083389348.0, style=ProgressStyle(descr…




All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model_name = "joeddav/xlm-roberta-large-xnli"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)

All model checkpoint layers were used when initializing TFXLMRobertaForSequenceClassification.

All the layers of TFXLMRobertaForSequenceClassification were initialized from the model checkpoint at joeddav/xlm-roberta-large-xnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForSequenceClassification for predictions without further training.


In [None]:
!gdown --id "1t72Z6yEdPR0GtA668-THjLH7FnP7IRXu"
!unzip train.csv.zip
!rm train.csv.zip

Downloading...
From: https://drive.google.com/uc?id=1t72Z6yEdPR0GtA668-THjLH7FnP7IRXu
To: /content/train.csv.zip
  0% 0.00/1.29M [00:00<?, ?B/s]100% 1.29M/1.29M [00:00<00:00, 20.6MB/s]
Archive:  train.csv.zip
  inflating: train.csv               


### Prepare data

In [None]:
train_df = pd.read_csv('train.csv')
train_df = train_df.drop(["id", "lang_abv", "language"], axis=1)
train_df.head()

English       6870
Chinese        411
Arabic         401
French         390
Swahili        385
Urdu           381
Vietnamese     379
Russian        376
Hindi          374
Greek          372
Thai           371
Spanish        366
Turkish        351
German         351
Bulgarian      342
Name: language, dtype: int64


Unnamed: 0,premise,hypothesis,label
0,and these comments were considered in formulat...,The rules developed in the interim were put to...,0
1,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,2
2,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,0
3,you know they can't really defend themselves l...,They can't defend themselves because of their ...,0
4,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,1


In [None]:
xnli_test_df = pd.DataFrame(xnli["test"][:100], columns=xnli["test"].features)
expanded_xnli_test_df = pd.DataFrame(
    ([list(xnli_test_df.premise[i].values())[j], list(xnli_test_df.hypothesis[i].values())[1][j], xnli_test_df.label[i]] for i in range(len(xnli_test_df)) for j in range(15)), columns=["premise","hypothesis","label"]
)

In [None]:
xnli_train_df = pd.DataFrame(xnli["train"][:1200], columns=xnli["train"].features)
xnli_valid_df = pd.DataFrame(xnli["validation"][:100], columns=xnli["validation"].features)
xnli_test_df = pd.DataFrame(xnli["test"][:100], columns=xnli["test"].features)

expanded_xnli_train_df = pd.DataFrame(
    ([list(xnli_train_df.premise[i].values())[j], list(xnli_train_df.hypothesis[i].values())[1][j], xnli_train_df.label[i]] for i in range(len(xnli_train_df)) for j in range(15)), columns=["premise","hypothesis","label"]
)

expanded_xnli_valid_df = pd.DataFrame(
    ([list(xnli_valid_df.premise[i].values())[j], list(xnli_valid_df.hypothesis[i].values())[1][j], xnli_valid_df.label[i]] for i in range(len(xnli_valid_df)) for j in range(15)), columns=["premise","hypothesis","label"]
)

expanded_xnli_test_df = pd.DataFrame(
    ([list(xnli_test_df.premise[i].values())[j], list(xnli_test_df.hypothesis[i].values())[1][j], xnli_test_df.label[i]] for i in range(len(xnli_test_df)) for j in range(15)), columns=["premise","hypothesis","label"]
)

train_df, valid_df = train_test_split(train_df, test_size=0.1)

combined_train_df = pd.concat([train_df, expanded_xnli_train_df]).sample(frac=1)
combined_valid_df = pd.concat([valid_df, expanded_xnli_valid_df]).sample(frac=1)

In [None]:
def get_encoding(dataframe):
  tokenized_data = tokenizer(text=list(dataframe["premise"]), text_pair=list(dataframe["hypothesis"]),
                             max_length=128,
                             padding="max_length",
                             truncation = True,
                             return_attention_mask = True, # where does the sentence pay attention to
                             add_special_tokens = True,
                             return_tensors='tf')
  
  return tokenized_data

tokenized_train = get_encoding(train_df)
tokenized_valid = get_encoding(valid_df)
tokenized_test = get_encoding(expanded_xnli_test_df)

In [None]:
batch_size=16
auto = tf.data.experimental.AUTOTUNE

train_dataset = tf.data.Dataset.from_tensor_slices(((tokenized_train.input_ids,tokenized_train.attention_mask),train_df.label))
train_dataset = train_dataset.shuffle(2048).repeat().batch(batch_size).prefetch(auto)

valid_dataset = tf.data.Dataset.from_tensor_slices(((tokenized_valid.input_ids,tokenized_valid.attention_mask),valid_df.label))
valid_dataset = valid_dataset.shuffle(2048).repeat().batch(batch_size).prefetch(auto)

test_dataset = tf.data.Dataset.from_tensor_slices(((tokenized_test.input_ids,tokenized_test.attention_mask, expanded_xnli_test_df.label)))
# test_dataset = test_dataset.batch(batch_size)

In [None]:
opt = tf.keras.optimizers.Adam(1e-5)

bert_model.compile(optimizer=opt,
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                   metrics=['accuracy'])

evaluation = bert_model.evaluate(test_dataset)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


### Evaluate results with no training for M-BERT

### Evaluate results with no training for XLR-RoBERTa

In [None]:
opt = tf.keras.optimizers.Adam(1e-5)

model.compile(optimizer=opt,
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

evaluation = model.evaluate(test_dataset)



### Dataset with 10,000+ rows for training

In [None]:
!mkdir checkpoints/
!mkdir checkpoints/10000

In [None]:
opt = tf.keras.optimizers.Adam(1e-5)

model.compile(optimizer=opt,
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

filepath="checkpoints/10000/epochs:{epoch:02d}-val_acc:{val_accuracy:.3f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max', save_weights_only=True)
earlystop = EarlyStopping(monitor='val_accuracy', restore_best_weights=True, patience=3)

callbacks_list = [checkpoint, earlystop]
steps_per_epoch = len(train_df) // batch_size
validation_steps = len(valid_df) // batch_size
history = model.fit(train_dataset,
                    steps_per_epoch=steps_per_epoch,
                    epochs=10,
                    validation_data=valid_dataset,
                    validation_steps=validation_steps,
                    callbacks=callbacks_list)

Epoch 1/10
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.

Epoch 00001: val_accuracy improved from -inf to 0.92833, saving model to checkpoints/10000/epochs:01-val_acc:0.928.hdf5
Epoch 2/10

Epoch 00002: val_accuracy did not improve from 0.92833
Epoch 3/10

Epoch 00003: val_accuracy did not improve from 0.92833
Epoch 4/10

Epoch 00004: val_accuracy did not improve from 0.92833


In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [None]:
!mv checkpoints/10000/"epochs:01-val_acc:0.928.hdf5" gdrive/MyDrive/"Colab Notebooks"/"epochs:01-val_acc:0.928.hdf5"

In [None]:
!ls checkpoints/10000

epochs:01-val_acc:0.928.hdf5


### Evaluate results with 10,000+ rows for training

In [None]:
evaluation = model.evaluate(test_dataset)



### Dataset with 100,000+ rows for training

In [None]:
tokenized_train = get_encoding(expanded_xnli_train_df)
tokenized_valid = get_encoding(expanded_xnli_valid_df)

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices(((tokenized_train.input_ids,tokenized_train.attention_mask),expanded_xnli_train_df.label))
train_dataset = train_dataset.shuffle(2048).repeat().batch(batch_size).prefetch(auto)

valid_dataset = tf.data.Dataset.from_tensor_slices(((tokenized_valid.input_ids,tokenized_valid.attention_mask),expanded_xnli_valid_df.label))
valid_dataset = valid_dataset.shuffle(2048).repeat().batch(batch_size).prefetch(auto)

In [None]:
!mkdir checkpoints/100000/

In [None]:
opt = tf.keras.optimizers.Adam(1e-5)

model.compile(optimizer=opt,
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

filepath="checkpoints/100000/epochs:{epoch:02d}-val_acc:{val_accuracy:.3f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max', save_weights_only=True)
earlystop = EarlyStopping(monitor='val_accuracy', restore_best_weights=True, patience=3)

callbacks_list = [checkpoint, earlystop]
steps_per_epoch = len(train_df) // batch_size
validation_steps = len(valid_df) // batch_size
history = model.fit(train_dataset,
                    steps_per_epoch=steps_per_epoch,
                    epochs=10,
                    validation_data=valid_dataset,
                    validation_steps=validation_steps,
                    callbacks=callbacks_list)

Epoch 1/10

Epoch 00001: val_accuracy improved from -inf to 0.93083, saving model to checkpoints/100000/epochs:01-val_acc:0.931.hdf5
Epoch 2/10

Epoch 00002: val_accuracy improved from 0.93083 to 0.93917, saving model to checkpoints/100000/epochs:02-val_acc:0.939.hdf5
Epoch 3/10

Epoch 00003: val_accuracy did not improve from 0.93917
Epoch 4/10

Epoch 00004: val_accuracy did not improve from 0.93917
Epoch 5/10

Epoch 00005: val_accuracy did not improve from 0.93917


In [None]:
evaluation = model.evaluate(test_dataset)



### Conclusion

We can see that there's not much difference with only an increase of 5000 of each language's training examples. However, that might be the case as the XLM-RoBERTa model used here has already been trained through the whole XNLI dataset, albeit only for one epoch, so in order to observe a larger difference, we may need more training examples