# A5: Natural Language Inference using Neural Networks

by Adam Ek, Bill Noble, and others.

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.


In this lab we will work with neural networks for natural language inference. Our task is: given a premise sentence P and hypothesis H, what entailment relationship holds between them? Is H entailed by P, contradicted by P or neutral towards P?

Given a sentence P, if H definitely describe something true given P then it is an **entailment**. If H describe something that's *maybe* true given P, it's **neutral**, and if H describe something that's definitely *false* given P it's a **contradiction**. 

## 1. Data

We will explore natural language inference using neural networks on the SNLI dataset, described in [1]. 

There are two options for loading and working with the data.

1. Download the data directly from the [SNLI website](https://nlp.stanford.edu/projects/snli/) and write a dataloader based on your dataloader from **A3: Distributed Representations and Language Models**.
2. Use the `datasets` library to load the version on the [HuggingFace hub](https://huggingface.co/datasets/stanfordnlp/snli). Follow the steps in [the documentation](https://huggingface.co/docs/datasets/v2.19.0/loading#hugging-face-hub) for loading the dataset.

[you can remove the template for whatever code you don't use]

The data is organized as follows:

* Column 1: Premise (sentence1)
* Column 2: Hypothesis (sentence2)
* Column 3: Relation (gold_label)

**[3 marks]**

In [1]:
from datasets import load_dataset
dataset = load_dataset("stanfordnlp/snli")

# filter out dataset instances which don't have any gold label and which are marked with -1 
dataset = dataset.filter(lambda example: example['label'] in [0, 1, 2])

ex = dataset['train'][0]
print(dataset)
print(ex)

DatasetDict({
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9824
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9842
    })
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 549367
    })
})
{'premise': 'A person on a horse jumps over a broken down airplane.', 'hypothesis': 'A person is training his horse for a competition.', 'label': 1}


Notice that the dataset comes as a dictionary-like object with three splits: `'test'`, `'train'`, and `'validation'`. Each item is a dictionary containing a `'premise'`, `'hypothesis'`, and `'label'`.

## 2. Tokenization

This data does not come pre-tokenized. Instead of training our own tokenizer, we can use the BERT tokenizer like in the preivous assignment. Even though we aren't using BERT the tokenizer works with any model. See the documentation on [using a pretrained tokenizer](https://huggingface.co/docs/tokenizers/en/quicktour#using-a-pretrained-tokenizer). **[1 mark]**

In [2]:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
ex_encoding = tokenizer.encode(ex['premise'])
ex_encoding_ids = tokenizer.encode(ex['premise']).ids
print(ex_encoding)
print(ex_encoding_ids)

def encode(example):
    # tokenize the premise
    tokens_pre = tokenizer.encode(example['premise'])
    # tokenize the hypothesis
    tokens_hyp = tokenizer.encode(example['hypothesis'])
    # add the tokenized versions to the example row
    example['premise_ids'] = tokens_pre.ids
    example['hypothesis_ids'] = tokens_hyp.ids
    return example

# map the encoding function to the full dataset
dataset_encoded = dataset.map(encode)
print(dataset_encoded['train'][0])

Encoding(num_tokens=14, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
[101, 1037, 2711, 2006, 1037, 3586, 14523, 2058, 1037, 3714, 2091, 13297, 1012, 102]
{'premise': 'A person on a horse jumps over a broken down airplane.', 'hypothesis': 'A person is training his horse for a competition.', 'label': 1, 'premise_ids': [101, 1037, 2711, 2006, 1037, 3586, 14523, 2058, 1037, 3714, 2091, 13297, 1012, 102], 'hypothesis_ids': [101, 1037, 2711, 2003, 2731, 2010, 3586, 2005, 1037, 2971, 1012, 102]}


## 2. Model

In this part, we'll build the model for predicting the relationship between H and P.

We will process each sentence using an LSTM. Then, we will construct some representation of the sentence. When we have a representation for H and P, we will combine them into one vector which we can use to predict the relationship.

We will train a model described in [2], the BiLSTM with max-pooling model. The procedure for the model is roughly:

    1) Encode the Hypothesis and the Premise using one shared bidirectional LSTM (or two different LSTMS)
    2) Perform max over the tokens in the premise and the hypothesis
    3) Combine the encoded premise and encoded hypothesis into one representation
    4) Predict the relationship 

### Creating a representation of a sentence

Let's first consider step 2 where we perform pooling. There is a builtin function in pytorch for this, but we'll implement it from scratch. 

Let's consider the general case, what we want to do for these methods is apply some function $f$ along dimension $i$, and we want to do this for all $i$'s. As an example we consider the matrix S with size ``(N, D)`` where N is the number of words and D the number of dimensions:

$S = \begin{bmatrix}
    s_{11} & s_{12} & s_{13} & \dots  & s_{1d} \\
    s_{21} & s_{22} & s_{23} & \dots  & s_{2d} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    s_{n1} & s_{n2} & s_{n3} & \dots  & s_{nd}
\end{bmatrix}$

What we want to do is apply our function $f$ on each dimension, taking the input $s_{1d}, s_{2d}, ..., s_{nd}$ and generating the output $x_d$. 

You will implement the max pooling method. When performing max-pooling, $max$ will be the function which selects a _maximum_ value from a vector and $x$ is the output, thus for each dimension $d$ in our output $x$ we get:

\begin{equation}
    x_d = max(s_{1d}, s_{2d}, ..., s_{nd})
\end{equation}

This operation will reduce a batch of size ``(batch_size, num_words, dimensions)`` to ``(batch_size, dimensions)`` meaning that we now have created a sentence representation based on the content of the representation at each token position. 

Create a function that takes as input a tensor of size ``(batch_size, num_words, dimensions)`` then performs max pooling and returns the result (the output should be of size: ```(batch_size, dimensions)```). [**4 Marks**]

In [3]:
import torch

def max_pooling(input_tensor):
    output_tensor, _ = torch.max(input_tensor, 1)
    return output_tensor

test_unpooled = torch.rand(32, 100, 512)
test_pooled = max_pooling(test_unpooled)
print(test_pooled.size()) # should be torch.Size([32, 512])

torch.Size([32, 512])


### Combining sentence representations

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ should be ``(batch_size, 4d)`` where ``d`` is the number of dimensions that you use): 

$$X = [P; H; |P-H|; P \cdot H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

Implement the function. **[4 marks]**

In [4]:
def combine_premise_and_hypothesis(premise, hypothesis):
    output = torch.cat((premise, hypothesis, abs(premise - hypothesis), premise * hypothesis), 1)
    return output

test_hypothesis = test_pooled.clone()
test_premise = test_pooled.clone()
test_combined = combine_premise_and_hypothesis(test_hypothesis, test_premise)
print(test_combined.size()) # should be torch.Size([32, 400]) not it's supposed to be torch.Size([32, 2048]), 100 is the num_words dim

torch.Size([32, 2048])


### Creating the model

Finally, we can build the model according to the procedure given previously by using the functions we defined above. Additionaly, in the model you should use *dropout*. For efficiency purposes, it's acceptable to only train the model with either max or mean pooling. 

Implement the model [**8 marks**]

In [5]:
vocab_size = tokenizer.get_vocab_size()
print(vocab_size)

30522


In [6]:
import torch.nn as nn

class SNLIModel(nn.Module):
    def __init__(self, vocab_size=vocab_size, embeddings_dim=128, hidden_dim=32, num_labels=3):
        # your code goes here
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embeddings_dim, padding_idx=0)
        self.dropout1 = nn.Dropout(p=0.2)
        self.rnn = nn.LSTM(embeddings_dim, hidden_dim, bidirectional=True, batch_first=True)
        self.dropout2 = nn.Dropout(p=0.2)
        self.classifier = nn.Linear(hidden_dim *8, num_labels)
        
    def forward(self, premise, hypothesis):
        # two dropout layers after embeddings and LSTM
        p_in = self.embeddings(premise)
        p_in = self.dropout1(p_in)
        h_in = self.embeddings(hypothesis)
        h_in = self.dropout1(h_in)

        p_out, _ = self.rnn(p_in)
        h_out, _ = self.rnn(h_in)
        p_out = self.dropout2(p_out)
        h_out = self.dropout2(h_out)

        p_pooled = max_pooling(p_out)
        h_pooled = max_pooling(h_out)
        
        ph_representation = combine_premise_and_hypothesis(p_pooled, h_pooled)
        predictions = self.classifier(ph_representation)
        
        return predictions
    

## 3. Training

As before, implement the training and testing of the model. SNLI can take a very long time to train, so I suggest you only run it for one or two epochs. **[10 marks]** 

**Tip for efficiency:** *when developing your model, try training and testing the model on one batch (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [7]:
device = torch.device('cuda:2')

import torch.optim as optim

epochs = 2
batch_size = 8

train_iter = dataset_encoded['train'].iter(batch_size=batch_size)

model = SNLIModel(vocab_size=vocab_size, embeddings_dim=128, hidden_dim=32, num_labels=3).to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)


for _ in range(epochs):
    model.train()
    
    # train model
    total_loss = 0
    for i, batch in enumerate(train_iter):
        premise = batch['premise_ids']
        hypothesis = batch['hypothesis_ids']
        # first pad both premises and hypothesis to max_len for each of them
        max_len_p = max([len(prem) for prem in premise])
        for prem in premise:
            if len(prem) < max_len_p:
                prem += [0] *(max_len_p - len(prem))
        max_len_h = max([len(hyp) for hyp in hypothesis])
        for hyp in hypothesis:
            if len(hyp) < max_len_h:
                hyp += [0] *(max_len_h - len(hyp))
        # convert to tensors
        premise = torch.tensor(premise).to(device)    
        hypothesis = torch.tensor(hypothesis).to(device)
        label = torch.tensor(batch['label']).to(device)
        # reset the gradients
        optimizer.zero_grad()
        # predict
        prediction = model(premise, hypothesis)
        # calculate loss
        loss = loss_function(prediction, label)
        # backpropagation
        loss.backward()
        # optimization / update weigts
        optimizer.step()
        # update loss for this epoch
        total_loss += loss.item()
        print(total_loss/(i+1), end='\r')
    print()

0.9071731670871357



## 4. Testing

Test the model on the testset. For each example in the test set, compute a prediction from the model (`entailment`, `contradiction` or `neutral`). Compute precision, recall, and F1 score for each label. **[10 marks]**

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

test_iter = dataset_encoded['test'].iter(batch_size=batch_size)

model.eval()
all_labels = []
all_predictions = []

with torch.no_grad():
    for batch in test_iter:
        premise = batch['premise_ids']
        max_len_p = max([len(prem) for prem in premise])
        for prem in premise:
            if len(prem) < max_len_p:
                prem += [0] *(max_len_p - len(prem))
        hypothesis = batch['hypothesis_ids']
        max_len_h = max([len(hyp) for hyp in hypothesis])
        for hyp in hypothesis:
            if len(hyp) < max_len_h:
                hyp += [0] *(max_len_h - len(hyp))
        # convert to tensors
        premise = torch.tensor(premise).to(device)    
        hypothesis = torch.tensor(hypothesis).to(device)
        label = batch['label']
        
        output = model(premise, hypothesis)
        prediction = torch.argmax(output, dim =1)

        # update the lists
        all_labels += label
        all_predictions += prediction.tolist()

accuracy = accuracy_score(all_labels, all_predictions)
precision = precision_score(all_labels, all_predictions, average='weighted')
recall = recall_score(all_labels, all_predictions, average='weighted')
f1 = f1_score(all_labels, all_predictions, average='weighted')

print('accuracy: ', round(accuracy, 2))
print('precision: ', round(precision, 2))
print('recall: ', round(recall, 2))
print('f1: ', round(f1, 2))

labels = [0, 1, 2]
target_names = ['entailment', 'neutral', 'contradiction'] # looking at the dataset on https://huggingface.co/datasets/stanfordnlp/snli
report = classification_report(all_labels, all_predictions, labels=labels, target_names=target_names)
print(report)

accuracy:  0.67
precision:  0.67
recall:  0.67
f1:  0.66
               precision    recall  f1-score   support

   entailment       0.65      0.80      0.72      3368
      neutral       0.67      0.59      0.63      3219
contradiction       0.69      0.61      0.65      3237

     accuracy                           0.67      9824
    macro avg       0.67      0.66      0.66      9824
 weighted avg       0.67      0.67      0.66      9824



Suggest a _baseline_ that we can compare our model against **[2 marks]**

A baseline model could be a 'dumb' model only predicting a label randomly. Since there are 3 labels to predict from, this baseline model would get an accuracy of about 33%.

Our model got the accuracy of about 67%, which doubles the baseline accuracy. We can say that it performs well.

Suggest some ways (other than using a baseline) in which we can analyse the models performance **[3 marks]**.

- We could compare the model's performance to that of some other SNLI models.
- We could use existing dataset as benchmark (e.g. RTE) to test the performance of the model.
- We can try simple transfer learning and compare results with some exising other NLI models.

Suggest some ways to improve the model **[3 marks]**.

- Check the balance between the labels
- If we got imbalanced data, we use stratified sampling method on data splitting.
- Shuffle the dataset to get more balanced data
- Optimization: We can try tuning the hyper parameter
    - (e.g.) embedding size, batch size, hidden dimension, epochs, learning rate, number of layers, dropout

## Readings

[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.

## Statement of contribution

Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.
Before the questions session (5/6) we met one (3 on site and one online), 3 hours in total. 
After the question session, we met twice for 2 hours to finalize the model.

## Marks

This assignment has a total of 23 marks.