# Assignment 4: Multilingual BERT and Zero-Shot Transfer (10 Marks)

## Due: 31 March 2022

Welcome to the 4th and last assignment of the course. In this assignment we will learn how to fine-tune a multilingual BERT or mBERT model on a Natural Language Inference task [XNLI](https://arxiv.org/abs/1809.05053). We will fine-tune the model on English Training data and then evaluate the performance of the fine-tuned models on different languages demonstrating the zero-shot capabilities of mBERT. 

In [1]:
!nvidia-smi

Thu Mar 24 06:38:56 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    data_dir = "gdrive/MyDrive/PlakshaNLP/Assignment4/data/xnli"
except:
    data_dir = "/datadrive/t-kabir/work/repos/PlakshaNLP/source/Assignment4/data/xnli"

Mounted at /content/gdrive


In [3]:
# Install required libraries
!pip install numpy
!pip install pandas
!pip install torch
!pip install tqdm
!pip install matplotlib
!pip install transformers
!pip install tqdm

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 8.3 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 63.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 80.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 68.4 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 7.0 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found ex

In [4]:
# We start by importing libraries that we will be making use of in the assignment.
import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.optim import Adam
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import copy
import tqdm

## XNLI: Task Description

XNLI is a multilingual benchmark for Natural Language Inference, that contains training data available in English which was obtained from the popular [MNLI](https://cims.nyu.edu/~sbowman/multinli/), and test and dev sets available for 15 different languages. In NLI, we are given two sentences, one is a premise and other an hypothesis, and the task is to predict whether the hypothesis is i) entialed in the premise, or ii) contradicts the premise, or iii) neutral to the premise. 

<img src="https://i.ibb.co/bd4P20K/nli-examples.jpg" alt="nli-examples" border="0">

This makes NLI a multi-class classification task where we want to predict the correct label out of the three possible classes. We start by loading the dataset into memory. The training set in XNLI is comparitively huge with around 400k examples, which can lead to higher training times. Hence for the purpose of this assignment we will work with a fraction of the full data i.e. ~40k examples

In [5]:

def load_xnli_dataset(lang, split = "train"):
    filename = os.path.join(data_dir, f"{split}-{lang}.tsv")
    sentence1s = []
    sentence2s = []
    labels = []
    with open(filename) as f:
        for i,line in enumerate(f):
            if i == 0:
                continue
            row = line.split("\t")
            sentence1 = row[0]
            sentence2 = row[1]
            label = row[2].split("\n")[0]
            sentence1s.append(sentence1)
            sentence2s.append(sentence2)
            labels.append((label))
    
    return pd.DataFrame({
        "premise": sentence1s,
        "hypothesis" : sentence2s,
        "label" : labels
    })

In [6]:
# Load Training data in english
train_en_data = load_xnli_dataset("en", "train")[:40000]

#Like last assignment we will use split the training data to get some validation examples as well
train_en_data, val_en_data = train_test_split(train_en_data, test_size=0.05)

print(f"Number of examples in training data: {len(train_en_data)}")
print(f"Number of examples in validation data: {len(val_en_data)}")

train_en_data.head()

Number of examples in training data: 38000
Number of examples in validation data: 2000


Unnamed: 0,premise,hypothesis,label
28951,But the criticism of Java on performance groun...,It is not fair to criticize Java on performanc...,entailment
33322,"In 1985 , for example , RCED was asked how the...",RCED was not asked about the Department of Int...,contradiction
24194,During the millennium before the Christian era...,Their civilization thrived before the Christia...,entailment
712,I 've always heard you Revolutionists held lif...,I 've constantly heard that you Revolutionists...,contradiction
28791,If audacity had successfully carried him so fa...,"He had been courageous all the way , and he wa...",neutral


In [7]:

# Load Test data in other languages
test_langs = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"]

lang2test_df = {lang : load_xnli_dataset(lang, "dev") for lang in test_langs}

In [8]:

print(f"Number of Test examples: {len(lang2test_df['en'])}")
lang2test_df["en"].head()

Number of Test examples: 2489


Unnamed: 0,premise,hypothesis,label
0,"And he said, Mama, I'm home.",He didn't say a word.,contradiction
1,"And he said, Mama, I'm home.",He told his mom he had gotten home.,entailment
2,I didn't know what I was going for or anything...,I have never been to Washington so when I was ...,neutral
3,I didn't know what I was going for or anything...,I knew exactly what I needed to do as I marche...,contradiction
4,I didn't know what I was going for or anything...,I was not quite certain what I was going to do...,entailment


In [9]:
for lang, test_df in lang2test_df.items():
    print(f"{lang} test set:")
    print(test_df.head())
    print("***************************\n")

ar test set:
                                             premise  \
0                        وقال، ماما، لقد عدت للمنزل.   
1                        وقال، ماما، لقد عدت للمنزل.   
2  لم أعرف من أجل ماذا أنا ذاهب أو أي شىْ ، لذلك ...   
3  لم أعرف من أجل ماذا أنا ذاهب أو أي شىْ ، لذلك ...   
4  لم أعرف من أجل ماذا أنا ذاهب أو أي شىْ ، لذلك ...   

                                          hypothesis          label  
0                                  لم ينطق ببنت شفة.  contradiction  
1                        أخبر أمه أنه قد عاد للمنزل.     entailment  
2  لم أذهب إلى واشنطن من قبل، لذا عندما تم تكليفي...        neutral  
3  لقد عرفت بالضبط ما الذي احتجت أن أفعله عندما م...  contradiction  
4  لم أكن متأكدًا مما سأفعله لذلك ذهبت إلى واشنطن...     entailment  
***************************

bg test set:
                                             premise  \
0                      И той каза: Мамо, у дома съм.   
1                      И той каза: Мамо, у дома съм.   
2  Не знаех за какво

## mBERT using HuggingFace's transformers library

mBERT is a multilingual variant of BERT, which is trained on wikipedia articles in around [100 languages](BertTokenizer). Like monolingual BERT the transformers library also provides pre-trained models and tokenizers for multilingual BERT. To create an instance of one, we only need to specify `"bert-base-multilingual-cased"` or `"bert-base-multilingual-uncased"` in `BertTokenizer.from_pretrained` and `BertModel.from_pretrained` methods and that's it! See examples below for a demonstration:

In [10]:
from transformers import BertTokenizer, BertModel

In [11]:
mbert_tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-uncased")

Downloading:   0%|          | 0.00/851k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

In [12]:
mbert_tokenizer.tokenize("thinking machines")

['thinking', 'machines']

In [13]:
mbert_tokenizer.tokenize("maquinas de pensar")

['maquinas', 'de', 'pensar']

In [14]:
mbert_tokenizer.tokenize("सोच मशीन")

['स', '##ो', '##च', 'म', '##शी', '##न']

As you can see mBERT's tokenizer works on different languages. We can similarly load a pretrained mbert model and feed data in different languages

In [15]:
mbert_model = BertModel.from_pretrained("bert-base-multilingual-uncased")

Downloading:   0%|          | 0.00/641M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
mbert_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(105879, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
         

As you can see the architecture is identical to the original BERT model. The only thing that is different is the shape of word_embeddings which is 105879 X 768, meaning there are 105879 unique tokens supported by mBERT (uncased). In contrast BERT (uncased) supports 30522 tokens.

In [17]:
#@title Default title text
en_sent = "thinking machines"
tokenizer_output = mbert_tokenizer(en_sent, return_tensors="pt")
input_ids, attn_mask = tokenizer_output["input_ids"], tokenizer_output["attention_mask"]

mbert_model(input_ids, attention_mask = attn_mask)

BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                               tensor([[[-0.0295,  0.0307,  0.0245,  ...,  0.0328, -0.0562,  0.0789],
                                                        [ 0.2827,  0.4737, -0.1378,  ..., -0.1892,  0.1408, -0.3370],
                                                        [ 0.1871,  0.6193,  0.1692,  ...,  0.0987, -0.0519, -0.0380],
                                                        [-0.0354,  0.4805, -0.2533,  ...,  0.7390,  0.1286, -0.6764]]],
                                                      grad_fn=<NativeLayerNormBackward0>)),
                                              ('pooler_output',
                                               tensor([[ 7.3179e-02,  1.0201e-01,  1.0094e-01,  9.1636e-02,  1.4111e-01,
                                                         3.3841e-01,  9.0549e-02, -6.1101e-02, -1.3902e-01,  2.4607e-01,
                                                        -1.7

In [18]:
es_sent = "maquinas de pensar"
tokenizer_output = mbert_tokenizer(es_sent, return_tensors="pt")
input_ids, attn_mask = tokenizer_output["input_ids"], tokenizer_output["attention_mask"]

mbert_model(input_ids, attention_mask = attn_mask)

BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                               tensor([[[-0.0657, -0.0614,  0.0060,  ..., -0.0240, -0.0420, -0.0673],
                                                        [ 0.2430,  0.5252, -0.0694,  ..., -0.0108,  0.5518, -0.4204],
                                                        [-0.2587, -0.2086,  0.1115,  ..., -0.7734,  0.1412, -0.9133],
                                                        [-0.1032,  0.7808,  0.1022,  ...,  0.0870,  0.1456, -0.5586],
                                                        [-0.0765,  0.3234, -0.2513,  ...,  0.3950,  0.2862, -1.1907]]],
                                                      grad_fn=<NativeLayerNormBackward0>)),
                                              ('pooler_output',
                                               tensor([[ 1.4176e-01,  3.4401e-02,  1.5948e-01,  1.7631e-01,  1.8552e-01,
                                                         4.2113

In [19]:
hi_sent = "सोच मशीन"
tokenizer_output = mbert_tokenizer(hi_sent, return_tensors="pt")
input_ids, attn_mask = tokenizer_output["input_ids"], tokenizer_output["attention_mask"]

mbert_model(input_ids, attention_mask = attn_mask)

BaseModelOutputWithPoolingAndCrossAttentions([('last_hidden_state',
                                               tensor([[[-0.0183,  0.0577,  0.1038,  ..., -0.0349, -0.0897, -0.0034],
                                                        [-0.2319,  0.0178, -0.0228,  ..., -0.4928,  0.1712, -0.3489],
                                                        [-0.1214,  0.0804,  0.2790,  ..., -0.0647, -0.2073, -0.6718],
                                                        ...,
                                                        [-0.0835,  0.4992, -0.1798,  ...,  0.5467, -0.4407,  0.0024],
                                                        [ 0.1764,  0.7496,  0.2607,  ...,  0.2637, -0.2617, -0.2716],
                                                        [-0.3781,  0.5859,  0.2654,  ...,  0.3965, -0.5680, -0.8965]]],
                                                      grad_fn=<NativeLayerNormBackward0>)),
                                              ('pooler_output',
     

Hence, we can very easily use mBERT for generating predictions on texts written in different languages.

## Task 1: Fine-tune mBERT on XNLI

We can now start fine-tuning mBERT on this dataset. We will start by defining the custom `Dataset` class for the task and then define the model and training loop.

## Task 1.1: Custom Dataset Class (2 Marks)

Like in the previous assignments, implement the `XNLImBertDataset` class below that processes and stores the data as well as provides a way to iterate through the dataset. The details about various methods in the class are mentioned in their docstrings

In [20]:
class XNLImBertDataset(Dataset):
    
    def __init__(self, premises,
                 hypotheses,
                 labels,
                 max_length,
                mbert_variant = "bert-base-multilingual-uncased"):
        
        """
        Constructor for the `XNLImBertDataset` class. Stores the `premises`, `hypotheses` and `labels`
        which can then be used by other methods. Also initializes the tokenizer.
        
        Inputs:
            - premises (list) : A list of sentences constituting the premise in each example
            - hypotheses (list) : A list of sentences constituting the hypothesis in each example
            - labels (list) : A list of labels denoting for each premise-hypothesis pair.
            - max_length (int): Maximum length of the encoded sequence.  
                                If number of tokens are lower than `max_length` add padding otherwise truncate
        
        
        Note that labels are in the form of strings "entailment", "contradiction" and "neutral". For training the
        models we will want the labels in the numeric form, so you should define a mapping from the text label
        to a numeric id. You should order the labels in alphabetical order while defining the mapping i.e. 
        contadiction -> 0, entailment -> 1, "neutral" - > 2 (such that we have consistency across everyone) 
        
        """
        
        self.premises = premises
        self.hypotheses = hypotheses
        self.labels = labels
        self.max_length = max_length
        self.tokenizer = BertTokenizer.from_pretrained(mbert_variant)
        self.label2id = {'contradiction':0, 'entailment':1, 'neutral' : 2}
        
        # self. labels = [self.label2id[label] for label in self.labels]
        
    def __len__(self):
        """
        Returns the length of the dataset
        """
        length = len(self.premises)
        
        return length
    
    def __getitem__(self, idx):
        """
        
        Returns the features and label corresponding to the the `idx` entry in the dataset.
        
        Inputs:
            - idx (int): Index corresponding to the sentence_pair,label to be returned
        
        Returns:
            - input_ids (torch.tensor): Indices of the tokens in the sentence pair.
                                        Shape of the tensor should be (`seq_len`,)
            - mask (torch.tensor): Attention mask indicating which tokens are padded.
            - label (int): Label for the premise-hypothesis pair
            
        Hint: We have 2 sentences in a pair which must be concatenated using the [SEP] token before we tokenize and encode them
        
        """


        
        tokenized =  self.tokenizer (self.premises[idx]+"[SEP]"+self.hypotheses[idx],max_length = self.max_length, padding = 'max_length', truncation = True, return_tensors = 'pt')
        input_ids = tokenized['input_ids']
        mask = tokenized['attention_mask']
        label = self.label2id[self.labels[idx]]
        
        return input_ids.squeeze(0), mask.squeeze(0), label

In [21]:
print("Running Sample Test Cases")
sample_premises = ["A man inspects the uniform of a figure in some East Asian country.",
                    "An older and younger man smiling.",
                   "A soccer game with multiple males playing."
                    ]
sample_hypotheses = ["The man is sleeping.",
                     "Two men are smiling and laughing at the cats playing on the floor.",
                    "Some men are playing a sport."]
sample_labels = ["contradiction", "neutral", "entailment"]
sample_max_len = 32
sample_dataset = XNLImBertDataset(
    sample_premises,
    sample_hypotheses,
    sample_labels,
    sample_max_len
)
print(f"Sample Test Case 1: Checking if `__len__` is implemented correctly")
dataset_len= len(sample_dataset)
expected_len = len(sample_labels)
print(f"Dataset Length: {dataset_len}")
print(f"Expected Length: {expected_len}")
assert len(sample_dataset) == len(sample_premises)
print("Sample Test Case Passed!")
print("****************************************\n")

print(f"Sample Test Case 2: Checking if `__getitem__` is implemented correctly for `idx= 0`")
sample_idx = 0
input_ids, mask, label = sample_dataset.__getitem__(sample_idx)
expected_input_ids =  torch.tensor([  101,   143, 10564, 15450, 84789, 10107, 10103, 38884, 10108,   143,
        16745, 10104, 10970, 11344, 17147, 11913,   119,   102, 10103, 10564,
        10127, 55860,   119,   102,     0,     0,     0,     0,     0,     0,
            0,     0])
expected_mask = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0])
expected_label = 0
print(f"input_ids:\n {input_ids}")
print(f"Expected input_ids:\n {expected_input_ids}")
assert (expected_input_ids == input_ids).all()

print(f"mask:\n {mask}")
print(f"Expected mask:\n {expected_mask}")
assert (expected_mask == mask).all()

print(f"label:\n {label}")
print(f"Expected label:\n {expected_label}")
assert expected_label == label

print("Sample Test Case Passed!")
print("****************************************\n")

print(f"Sample Test Case 3: Checking if `__getitem__` is implemented correctly for `idx= 1`")
sample_idx = 1
input_ids, mask, label = sample_dataset.__getitem__(sample_idx)
expected_input_ids = torch.tensor([  101, 10144, 18585, 10110, 24392, 10564, 14965, 64581,   119,   102,
        10536, 10562, 10320, 14965, 64581, 10110, 18418, 82863, 10160, 10103,
        45670, 14734, 10125, 10103, 21005,   119,   102,     0,     0,     0,
            0,     0])
expected_mask = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 0, 0, 0, 0, 0])
expected_label = 2
print(f"input_ids:\n {input_ids}")
print(f"Expected input_ids:\n {expected_input_ids}")
assert (expected_input_ids == input_ids).all()

print(f"mask:\n {mask}")
print(f"Expected mask:\n {expected_mask}")
assert (expected_mask == mask).all()

print(f"label:\n {label}")
print(f"Expected label:\n {expected_label}")
assert expected_label == label

print("Sample Test Case Passed!")
print("****************************************\n")


print(f"Sample Test Case 4: Checking if `__getitem__` is implemented correctly for `idx= 2`")
sample_idx = 2
input_ids, mask, label = sample_dataset.__getitem__(sample_idx)
expected_input_ids = torch.tensor([  101,   143, 20071, 11336, 10171, 18248, 19592, 14734,   119,   102,
        10970, 10562, 10320, 14734,   143, 13148,   119,   102,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0])
expected_mask = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
expected_label = 1
print(f"input_ids:\n {input_ids}")
print(f"Expected input_ids:\n {expected_input_ids}")
assert (expected_input_ids == input_ids).all()

print(f"mask:\n {mask}")
print(f"Expected mask:\n {expected_mask}")
assert (expected_mask == mask).all()

print(f"label:\n {label}")
print(f"Expected label:\n {expected_label}")
assert expected_label == label

print("Sample Test Case Passed!")
print("****************************************\n")



sample_premises = ["एक आदमी किसी पूर्वी एशियाई देश में एक आकृति की वर्दी का निरीक्षण करता है।",
                    "एक बूढ़ा और छोटा आदमी मुस्कुरा रहा है।",
                   "एक फ़ुटबॉल खेल जिसमें कई पुरुष खेल रहे हैं।"
                    ]
sample_sentence2s = ["आदमी सो रहा है।",
                     "फर्श पर खेल रही बिल्लियों को देखकर दो आदमी मुस्कुरा रहे हैं और हंस रहे हैं।",
                    "कुछ पुरुष कोई खेल खेल रहे हैं।"
                    ]
sample_labels = ["contradiction", "neutral", "entailment"]
sample_max_len = 36
sample_dataset = XNLImBertDataset(
    sample_premises,
    sample_sentence2s,
    sample_labels,
    sample_max_len
)

print(f"Sample Test Case 5: Checking for hindi")
sample_idx = 1
input_ids, mask, label = sample_dataset.__getitem__(sample_idx)
expected_input_ids =  torch.tensor([  101, 11384,   569, 30119, 10949, 11142, 74535, 10949,   533, 13764,
        25695,   571, 12114, 19086, 10949, 36335,   580,   591,   102,   568,
        11551, 17109, 12334, 56426, 52061,   569, 28393, 41790, 20106, 11483,
        91329, 19086, 29931,   533, 13764,   102])
expected_mask = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
expected_label = 2
print(f"input_ids:\n {input_ids}")
print(f"Expected input_ids:\n {expected_input_ids}")
assert (expected_input_ids == input_ids).all()

print(f"mask:\n {mask}")
print(f"Expected mask:\n {expected_mask}")
assert (expected_mask == mask).all()

print(f"label:\n {label}")
print(f"Expected label:\n {expected_label}")
assert expected_label == label

print("Sample Test Case Passed!")
print("****************************************\n")




Running Sample Test Cases
Sample Test Case 1: Checking if `__len__` is implemented correctly
Dataset Length: 3
Expected Length: 3
Sample Test Case Passed!
****************************************

Sample Test Case 2: Checking if `__getitem__` is implemented correctly for `idx= 0`
input_ids:
 tensor([  101,   143, 10564, 15450, 84789, 10107, 10103, 38884, 10108,   143,
        16745, 10104, 10970, 11344, 17147, 11913,   119,   102, 10103, 10564,
        10127, 55860,   119,   102,     0,     0,     0,     0,     0,     0,
            0,     0])
Expected input_ids:
 tensor([  101,   143, 10564, 15450, 84789, 10107, 10103, 38884, 10108,   143,
        16745, 10104, 10970, 11344, 17147, 11913,   119,   102, 10103, 10564,
        10127, 55860,   119,   102,     0,     0,     0,     0,     0,     0,
            0,     0])
mask:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0])
Expected mask:
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Initialize dataset and dataloaders for english training and validation sets

In [22]:
max_seq_len = 128
batch_size = 8

train_en_premises, train_en_hypotheses = train_en_data["premise"].values, train_en_data["hypothesis"].values
train_en_labels = train_en_data["label"].values

val_en_premises, val_en_hypotheses = val_en_data["premise"].values, val_en_data["hypothesis"].values
val_en_labels = val_en_data["label"].values

train_en_dataset = XNLImBertDataset(train_en_premises, train_en_hypotheses, train_en_labels, max_seq_len)
val_en_dataset = XNLImBertDataset(val_en_premises, val_en_hypotheses, val_en_labels, max_seq_len)

train_en_dataloader = DataLoader(train_en_dataset, batch_size = batch_size)
val_en_dataloader = DataLoader(val_en_dataset, batch_size = batch_size)

## Task 1.2: Implement mBERT Based Classifier for NLI (2 Marks)

Similar to last assignment implement a classifier with an mBERT module followed by a classification layer. Note that unlike last time we have 3 classes now, so we can no longer use Sigmoid function in the output layer and instead will need to use the Softmax function. You can refer [here](https://cs231n.github.io/linear-classify/#softmax-classifier) if you need a primer on how the softmax function works. Hence, this time instead of getting a single output from the model for an input, denoting the probability of the poistive class, we will get 3 numbers as output for each input denoting the probability of each of the 3 classes. Also, it is common to use Log of the Softmax function instead of plain softmax to obtain log-probabilities. Log-Softmax is numerically more stable and hence it is often more used in practice. You can read about it's usage in pytorch [here](https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html). Implement the `mBERTNLIClassifierModel` below

In [23]:

class mBERTNLIClassifierModel(nn.Module):
    
    def __init__(self, d_hidden = 768, mbert_variant = "bert-base-multilingual-uncased"):
        
        """
        Constructor for the `mBERTNLIClassifierModel` class. Use this to define  the network architecture
        which should be: Input -> mBERT -> Linear Layer -> Log-Softmax
        
        Inputs:
            - d_hidden (int): Size of the hidden representations of mbert
            - mbert_variant (str): mBERT variant to use
        
        """
        super(mBERTNLIClassifierModel, self).__init__()
        
        self.mbert_layer = BertModel.from_pretrained(mbert_variant)
        self.output_layer = nn.Linear(d_hidden,3)
        self.log_softmax_layer = nn.LogSoftmax()
        

    def forward(self, input_ids, attn_mask):
        
        """
        Forward Passes the inputs through the network and obtains the prediction
        
        Inputs:
            - input_ids (torch.tensor): A torch tensor of shape [batch_size, seq_len]
                                        representing the sequence of token ids
            - attn_mask (torch.tensor): A torch tensor of shape [batch_size, seq_len]
                                        representing the attention mask such that padded tokens are 0 and rest 1
                                        
        Returns:
          - output (torch.tensor): A torch tensor of shape [batch_size, 3] containing (log) probabilities
          of each class 
                                                
        """
        for idx in range(len(input_ids)):
          
          _, output = self.mbert_layer(input_ids, attention_mask = attn_mask, return_dict = False)
          output = self.output_layer(output)
          output = self.log_softmax_layer(output)
        return output

In [24]:
print(f"Running Sample Test Cases!")
torch.manual_seed(42)
model = mBERTNLIClassifierModel()

sample_premises = ["A man inspects the uniform of a figure in some East Asian country.",
                    "An older and younger man smiling.",
                   "A soccer game with multiple males playing."
                    ]
sample_hypotheses = ["The man is sleeping.",
                     "Two men are smiling and laughing at the cats playing on the floor.",
                    "Some men are playing a sport."]
sample_labels = ["contradiction", "neutral", "entailment"]
sample_max_len = 32
sample_dataset = XNLImBertDataset(
    sample_premises,
    sample_hypotheses,
    sample_labels,
    sample_max_len
)


print("Sample Test Case 1")
sample_idx = 0
input_ids, attn_mask, label = sample_dataset.__getitem__(sample_idx)
mbert_cls_out = model(input_ids.unsqueeze(0), attn_mask.unsqueeze(0)).detach().numpy()
expected_mbert_cls_out = np.array([[-0.9885041, -1.479876,  -0.915788 ]])
print(f"Model Output: {mbert_cls_out }")
print(f"Expected Output: {expected_mbert_cls_out}")

assert mbert_cls_out .shape == expected_mbert_cls_out.shape
assert np.allclose(mbert_cls_out, expected_mbert_cls_out, 1e-4)
print("Test Case Passed! :)")
print("******************************\n")

print("Sample Test Case 2")
sample_idx = 1
input_ids, attn_mask, label = sample_dataset.__getitem__(sample_idx)
mbert_cls_out = model(input_ids.unsqueeze(0), attn_mask.unsqueeze(0)).detach().numpy()
expected_mbert_cls_out = np.array([[-0.97441876, -1.4775381,  -0.9304163 ]])
print(f"Model Output: {mbert_cls_out }")
print(f"Expected Output: {expected_mbert_cls_out}")

assert mbert_cls_out .shape == expected_mbert_cls_out.shape
assert np.allclose(mbert_cls_out, expected_mbert_cls_out, 1e-4)
print("Test Case Passed! :)")
print("******************************\n")


Running Sample Test Cases!


Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Sample Test Case 1
Model Output: [[-0.9885042 -1.479876  -0.9157878]]
Expected Output: [[-0.9885041 -1.479876  -0.915788 ]]
Test Case Passed! :)
******************************

Sample Test Case 2
Model Output: [[-0.97441864 -1.477538   -0.9304163 ]]
Expected Output: [[-0.97441876 -1.4775381  -0.9304163 ]]
Test Case Passed! :)
******************************





## Task 1.3: Training and Evaluating the Model (4 Marks)

Similar to previous assignments implement the `train` and `evaluate` functions below. There will be though a couple of differences this time. First, for training the model we can no longer use Binary Cross Entropy Loss because as the name suggests it is applicable for binary classification problems. Instead we will use the [Negative Log-Likelihood Loss function](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) instead. Second, while evaluating the accuracy we can no longer use a threshold to convert the probabilities into the labels, since we will now have 3 probability values instead of a single one, corresponding to the each class. In such cases it is common to predict the class as the label which has the highest probability (or equivalently log probability).


In [25]:
def evaluate(model, test_dataloader, device = "cuda"):
    
    """
    Evaluates `model` on test dataset

    Inputs:
        - model (mBERTNLIClassifierModel): mBERT based classifier model to be evaluated
        - test_dataloader (torch.utils.DataLoader): A dataloader defined over the test dataset

    Returns:
        - accuracy (float): Average accuracy over the test dataset 
    """
    
    
    accuracy = 0
    
    model.eval()
    model = model.to(device)
    
    with torch.no_grad():
      for test_batch in test_dataloader:
        features, masks, labels = test_batch

        features = features.to(device).long()
        masks = masks.to(device).long()
        labels = labels.to(device).long()

        # Probability predictions from the model
        pred_probs = model(features,masks).to(device)

        # Convert predictions and labels to numpy arrays from torch tensors 
        predictions = np.argmax( pred_probs.detach().cpu().numpy(), axis = 1).tolist()
        labels = labels.detach().cpu().numpy()


        batch_accuracy = None
        correct_predictions=0
        # print(predictions, labels)
        for prediction,actual in zip(predictions,labels):
          if prediction==actual:
            correct_predictions+=1
        batch_accuracy=correct_predictions/len(predictions)


        accuracy += batch_accuracy

      # Divide by number of batches to get average accuracy
      accuracy = accuracy / len(test_dataloader)


    
    return accuracy
    
    
    
def train(model, train_dataloader, val_dataloader,
          lr = 1e-5, num_epochs = 3,
          device = "cpu"):
    
    """
    Runs the training loop. Define the loss function as NLLLoss
    and optimizer as Adam and train for `num_epochs` epochs.

    Inputs:
        - model (mBERTNLIClassifierModel): mBERT based classifer model to be trained
        - train_dataloader (torch.utils.DataLoader): A dataloader defined over the training dataset
        - val_dataloader (torch.utils.DataLoader): A dataloader defined over the validation dataset
        - lr (float): The learning rate for the optimizer
        - num_epochs (int): Number of epochs to train the model for.
        - device (str): Device to train the model on. Can be either 'cuda' (for using gpu) or 'cpu'

    Returns:
        - best_model (mBERTNLIClassifierModel): model corresponding to the highest validation accuracy (checked at the end of each epoch)
        - best_val_accuracy (float): Validation accuracy corresponding to the best epoch
    """
        
    best_val_accuracy = float("-inf")
    best_model = None
    model = model.to(device)

    epoch_loss = 0

    loss_fn = nn.NLLLoss()
    optimizer = Adam(model.parameters(),lr=lr)

    for epoch in range(num_epochs):
        model.train() 
        epoch_loss = 0
        for train_batch in tqdm.tqdm(train_dataloader):
            optimizer.zero_grad()

            features, masks, labels = train_batch

            features = features.float()
            labels = labels.to(torch.int64)

            features  = torch.tensor(features)
            masks = torch.tensor(masks)
            
            features = features.to(device).long()
            masks = masks.to(device).float()
            labels = labels.to(device)


            preds = model(features, masks)
            loss = loss_fn(preds,labels)
            loss.backward()

            optimizer.step()
            epoch_loss += loss.item()

        
        epoch_loss = epoch_loss / len(train_dataloader)
        
        print('Evaluating')
        val_accuracy = evaluate(model,val_dataloader)
 
        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            best_model = copy.deepcopy(model) 
        
        print(f"Epoch {epoch} completed | Average Training Loss: {epoch_loss} | Validation Accuracy: {val_accuracy}")

        best_model.zero_grad()
    return best_model, best_val_accuracy

In [26]:
torch.manual_seed(42)
print("Training on 100 data points for sanity check")

max_seq_len = 128
batch_size = 8

sample_premises, sample_hypotheses = train_en_data["premise"].values[:100], train_en_data["hypothesis"].values[:100]
sample_labels = train_en_data["label"].values[:100]

sample_dataset = XNLImBertDataset(sample_premises, sample_hypotheses, sample_labels, max_seq_len)
sample_dataloader = DataLoader(sample_dataset, batch_size = batch_size)


model = mBERTNLIClassifierModel()
best_model, best_val_acc = train(model, sample_dataloader, sample_dataloader, lr = 5e-5, num_epochs = 10, device = "cuda")
print(f"Best Validation Accuracy: {best_val_acc}")
print(f"Expected Best Validation Accuracy: {0.99}")

Training on 100 data points for sanity check


Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 13/13 [00:04<00:00,  2.63it/s]


Evaluating
Epoch 0 completed | Average Training Loss: 1.1242104401955237 | Validation Accuracy: 0.3557692307692308


100%|██████████| 13/13 [00:04<00:00,  2.70it/s]


Evaluating
Epoch 1 completed | Average Training Loss: 1.0847587218651404 | Validation Accuracy: 0.5576923076923077


100%|██████████| 13/13 [00:04<00:00,  2.69it/s]


Evaluating
Epoch 2 completed | Average Training Loss: 1.0078262870128338 | Validation Accuracy: 0.6730769230769231


100%|██████████| 13/13 [00:04<00:00,  2.70it/s]


Evaluating
Epoch 3 completed | Average Training Loss: 0.9527593484291663 | Validation Accuracy: 0.7115384615384616


100%|██████████| 13/13 [00:04<00:00,  2.69it/s]


Evaluating
Epoch 4 completed | Average Training Loss: 0.7570186532460726 | Validation Accuracy: 0.6346153846153846


100%|██████████| 13/13 [00:04<00:00,  2.67it/s]


Evaluating
Epoch 5 completed | Average Training Loss: 0.7018076869157645 | Validation Accuracy: 0.6346153846153846


100%|██████████| 13/13 [00:04<00:00,  2.70it/s]


Evaluating
Epoch 6 completed | Average Training Loss: 0.7862338102780856 | Validation Accuracy: 0.875


100%|██████████| 13/13 [00:04<00:00,  2.70it/s]


Evaluating
Epoch 7 completed | Average Training Loss: 0.600664672943262 | Validation Accuracy: 0.9711538461538461


100%|██████████| 13/13 [00:04<00:00,  2.69it/s]


Evaluating
Epoch 8 completed | Average Training Loss: 0.3131307191573657 | Validation Accuracy: 0.9230769230769231


100%|██████████| 13/13 [00:04<00:00,  2.69it/s]


Evaluating
Epoch 9 completed | Average Training Loss: 0.1509376414693319 | Validation Accuracy: 1.0
Best Validation Accuracy: 1.0
Expected Best Validation Accuracy: 0.99


Since we just trained and evaluated on same 100 examples, you should expect nearly perfect 99% accuracy. Now let's train on the entire dataset.

In [26]:
model = mBERTNLIClassifierModel()
best_model, best_val_acc = train(model, train_en_dataloader, val_en_dataloader, lr = 1e-5, num_epochs = 2, device = "cuda")
print(f"Best Validation Accuracy: {best_val_acc}")
print(f"Expected Best Validation Accuracy: {0.7675}")

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 4750/4750 [30:48<00:00,  2.57it/s]


Evaluating
Epoch 0 completed | Average Training Loss: 0.7769625133307356 | Validation Accuracy: 0.7145


100%|██████████| 4750/4750 [30:45<00:00,  2.57it/s]


Evaluating
Epoch 1 completed | Average Training Loss: 0.5599559139703449 | Validation Accuracy: 0.7355
Best Validation Accuracy: 0.7355
Expected Best Validation Accuracy: 0.7675


## Task 1.4: Zero-Shot Transfer (2 Marks)

Pre-trained multilingual models like mBERT have shown to exhibit zero-shot transfer capabilities to new langauges for which the model was never fine-tuned on. You can read more about zero-shot transfer in mBERT in this [paper](https://arxiv.org/abs/1906.01502). We now test this phenomenon for ourselves, where we will evaluate the performance of the mBERT classifier that we just trained on the English on the test sets in 15 different languages. Implement the `evaluate_on_diff_langs` function below that does that

In [28]:
def evaluate_on_diff_langs(model, lang2test_df, max_length = 128, batch_size = 8, device = "cpu"):
    
    """
    Evaluates the accuracy of the fine-tuned model on test data in different langauges.
    
    Inputs:
        - model (mBERTNLIClassifierModel): mBERT based classifer model fine-tuned on English data
        - lang2test_df (dict): A dictionary with langauges as keys and
                                their corresponding test sets (in form of pandas dataframe)
                                as values
                                
    Returns:
        - lang2acc (dict): A dictionary with language ids as keys and the accuracy on it's test set as values
                            eg: {"en" : 0.8, "fr" : 0.77, "hi": 0.72, ...}
    
    """
    from tqdm import tqdm
    lang2acc = {}


    for lang, test_df in tqdm(lang2test_df.items()):
      
      sample_premises, sample_hypotheses = test_df["premise"].values[:100], test_df["hypothesis"].values[:100]
      sample_labels = test_df["label"].values[:100]

      sample_dataset = XNLImBertDataset(sample_premises, sample_hypotheses, sample_labels, max_seq_len)
      sample_dataloader = DataLoader(sample_dataset, batch_size = batch_size)


      accuracy = evaluate(model,sample_dataloader,device=device)
      lang2acc[lang]= accuracy

    return lang2acc
    

In [29]:
lang2acc = evaluate_on_diff_langs(best_model, lang2test_df, max_length = 128, batch_size = 8, device = "cuda")
expected_vals = {'ar': 0.5989583333333334,
 'bg': 0.6454326923076923,
 'de': 0.6698717948717948,
 'el': 0.6402243589743589,
 'en': 0.7263621794871795,
 'es': 0.6923076923076923,
 'fr': 0.6802884615384616,
 'hi': 0.5893429487179487,
 'ru': 0.6478365384615384,
 'sw': 0.53125,
 'th': 0.35136217948717946,
 'tr': 0.610176282051282,
 'ur': 0.5637019230769231,
 'vi': 0.6193910256410257,
 'zh': 0.6073717948717948}
print(f"Langauge to Accuracy:\n {lang2acc}")
print(f"Expected Values:\n {expected_vals}")

100%|██████████| 15/15 [01:14<00:00,  4.98s/it]

Langauge to Accuracy:
 {'ar': 0.6442307692307693, 'bg': 0.7019230769230769, 'de': 0.7211538461538461, 'el': 0.7403846153846154, 'en': 0.8269230769230769, 'es': 0.75, 'fr': 0.7884615384615384, 'hi': 0.5961538461538461, 'ru': 0.6826923076923077, 'sw': 0.5673076923076923, 'th': 0.3076923076923077, 'tr': 0.6730769230769231, 'ur': 0.6057692307692307, 'vi': 0.6538461538461539, 'zh': 0.6538461538461539}
Expected Values:
 {'ar': 0.5989583333333334, 'bg': 0.6454326923076923, 'de': 0.6698717948717948, 'el': 0.6402243589743589, 'en': 0.7263621794871795, 'es': 0.6923076923076923, 'fr': 0.6802884615384616, 'hi': 0.5893429487179487, 'ru': 0.6478365384615384, 'sw': 0.53125, 'th': 0.35136217948717946, 'tr': 0.610176282051282, 'ur': 0.5637019230769231, 'vi': 0.6193910256410257, 'zh': 0.6073717948717948}





Don't worry if the values do not match exactly, but you can expect similar patterns i.e. the fine-tuned model on English data, performs reasonably on other new langauges as well compared to it's performance on English test data. Performance on langauges like German, French and Spanish is much closer to the performance on English. However, it is on the lower side for languages like Swahilli, Urdu and Thai. The values are still surprisingly high, considering a random guess will fetch you an accuracy of 33%.