# A5: Natural Language Inference using Neural Networks

by Adam Ek, Bill Noble, and others.

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.


In this lab we will work with neural networks for natural language inference. Our task is: given a premise sentence P and hypothesis H, what entailment relationship holds between them? Is H entailed by P, contradicted by P or neutral towards P?

Given a sentence P, if H definitely describe something true given P then it is an **entailment**. If H describe something that's *maybe* true given P, it's **neutral**, and if H describe something that's definitely *false* given P it's a **contradiction**. 

## 1. Data

We will explore natural language inference using neural networks on the SNLI dataset, described in [1]. The dataset can be downloaded [here](https://nlp.stanford.edu/projects/snli/), but unfortunately that dataset is down at the moment. Instead, we will use the version uploaded to [HuggingFace](https://huggingface.co/datasets/stanfordnlp/snli) available through the `datasets` library. See the [documentation](https://huggingface.co/docs/datasets/v2.19.0/loading#hugging-face-hub) for loading a dataset from the HuggingFace hub.

The (simplified) data is organized as follows:

* Column 1: Premise
* Column 2: Hypothesis
* Column 3: Relation

Like in the previous lab, we'll need to build a dataloader. You can adapt your code from the previous lab to the new dataset. **[1 mark]**

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

device = torch.device('cpu')
# device = torch.device('cuda:0')
## MLT gpu is down this week, so we switch to pc to complete the assignment.

In [2]:
#!pip install datasets

In [3]:
from datasets import load_dataset
dataset = load_dataset('snli')

## filter out data with label `-1`
def filter_labels(example):
    return example['label'] != -1

filtered_dataset = dataset.filter(filter_labels)
# ex = dataset['train'][0]
# print(dataset)
# print(ex)

In [4]:
train =  filtered_dataset['train']
# validation = dataset['validation']
test = filtered_dataset['test']

In [5]:
# from collections import Counter

# print("summary labels:",Counter(train['label']))

Notice that the dataset comes as a dictionary-like object with three splits: `'test'`, `'train'`, and `'validation'`. Each item is a dictionary containing a `'premise'`, `'hypothesis'`, and `'label'`.

## 2. Tokenization

This data does not come pre-tokenized. Instead of training our own tokenizer, we can use the BERT tokenizer like in the preivous assignment. Even though we aren't using BERT the tokenizer works with any model. See the documentation on [using a pretrained tokenizer](https://huggingface.co/docs/tokenizers/en/quicktour#using-a-pretrained-tokenizer). **[1 mark]**

In [6]:
from tokenizers import Tokenizer

# initialize tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
# print(tokenizer.encode(ex['premise']).ids)

In [7]:
# print(tokenizer.encode(train[0]['premise']))
# output: Encoding(num_tokens=14, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [8]:
# import torch.utils.data as data_utils

# merged_data = data_utils.ConcatDataset([train, test])

# ## for padding, to computate the max_padding_length
# premise_padding_length =  max(len(example['premise']) for example in merged_data)
# hypothesis_padding_length =  max(len(example['hypothesis']) for example in merged_data)
# print(premise_padding_length,hypothesis_padding_length) # 402 295

In [9]:
# # clarify
# longest_premise_example = max(merged_data, key=lambda x: len(x['premise']))

# print("Longest premise text:")
# print(longest_premise_example['premise'])

# ## Why max_length so large => A sentence is treated as one long word, and each character is treated as a token.

In [10]:
# define encode function
def encoded_data(data, tokenizer): 
    processed_data = []
    for pair in data:
        premise_encoded = tokenizer.encode(pair['premise']).ids
        hypothesis_encoded = tokenizer.encode(pair['hypothesis']).ids

        processed_data.append({'premise': premise_encoded, 
                             'hypothesis': hypothesis_encoded, 
                             'label': pair['label']})
    return processed_data

train_encoded = encoded_data(train, tokenizer)
# validation_encoded = encoded_data(validation)
test_encoded = encoded_data(test, tokenizer)

In [11]:
# tokenizer.token_to_id("[PAD]") # padding_value = 0

In [12]:
from torch.nn.utils.rnn import pad_sequence

def collate_fn(batch):
    
    premises = [torch.tensor(item['premise'], dtype=torch.long) for item in batch]
    hypothesis = [torch.tensor(item['hypothesis'], dtype=torch.long) for item in batch]
    labels = [torch.tensor(item['label'], dtype=torch.long) for item in batch] # extract 
    
    premises_padded = pad_sequence(premises, batch_first=True, padding_value=0)
    hypothesis_padded = pad_sequence(hypothesis, batch_first=True, padding_value=0)  # padding

    # convert to tensors
    labels_tensor = torch.tensor(labels, dtype=torch.long)

    return {
        'premise': premises_padded,
        'hypothesis': hypothesis_padded,
        'label': labels_tensor
    }

In [13]:
# train_encoded[0]
# label: 0-entailment, 1-neutral, 2-contradiction

In [14]:
import torch
from torch.utils.data import DataLoader, TensorDataset

batch_size = 32

train_loader = DataLoader(train_encoded, batch_size = batch_size, shuffle = True, collate_fn = collate_fn)
test_loader = DataLoader(test_encoded, batch_size = batch_size, shuffle = False, collate_fn = collate_fn)


In [15]:
# train_iter = iter(train_loader)
# train_batch = next(train_iter)

# print("train_loader sample:",train_batch) # the first batch
# print("Shape of 'premise':",train_batch['premise'].shape) 
# print("Shape of 'hypothesis':",train_batch['hypothesis'].shape) # [batch_size, max_length(num_words)]

## 2. Model

In this part, we'll build the model for predicting the relationship between H and P.

We will process each sentence using an LSTM. Then, we will construct some representation of the sentence. When we have a representation for H and P, we will combine them into one vector which we can use to predict the relationship.

We will train a model described in [2], the BiLSTM with max-pooling model. The procedure for the model is roughly:

    1) Encode the Hypothesis and the Premise using one shared bidirectional LSTM (or two different LSTMS)
    2) Perform max over the tokens in the premise and the hypothesis
    3) Combine the encoded premise and encoded hypothesis into one representation
    4) Predict the relationship 

### Creating a representation of a sentence

Let's first consider step 2 where we perform pooling. There is a builtin function in pytorch for this, but we'll implement it from scratch. 

Let's consider the general case, what we want to do for these methods is apply some function $f$ along dimension $i$, and we want to do this for all $i$'s. As an example we consider the matrix S with size ``(N, D)`` where N is the number of words and D the number of dimensions:

$S = \begin{bmatrix}
    s_{11} & s_{12} & s_{13} & \dots  & s_{1d} \\
    s_{21} & s_{22} & s_{23} & \dots  & s_{2d} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    s_{n1} & s_{n2} & s_{n3} & \dots  & s_{nd}
\end{bmatrix}$

What we want to do is apply our function $f$ on each dimension, taking the input $s_{1d}, s_{2d}, ..., s_{nd}$ and generating the output $x_d$. 

You will implement the max pooling method. When performing max-pooling, $max$ will be the function which selects a _maximum_ value from a vector and $x$ is the output, thus for each dimension $d$ in our output $x$ we get:

\begin{equation}
    x_d = max(s_{1d}, s_{2d}, ..., s_{nd})
\end{equation}

This operation will reduce a batch of size ``(batch_size, num_words, dimensions)`` to ``(batch_size, dimensions)`` meaning that we now have created a sentence representation based on the content of the representation at each token position. 

Create a function that takes as input a tensor of size ``(batch_size, num_words, dimensions)`` then performs max pooling and returns the result (the output should be of size: ```(batch_size, dimensions)```). [**4 Marks**]

In [16]:
def max_pooling(input_tensor):
    output_tensor,_ = torch.max(input_tensor, dim=1) # return a tuple: maximum value and corresponding index
    return output_tensor

# test_unpooled = torch.rand(32, 100, 512)
# test_pooled = max_pooling(test_unpooled)
# print(test_pooled.size()) # should be torch.Size([32, 512]) 

### Combining sentence representations

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ should be ``(batch_size, 4d)`` where ``d`` is the number of dimensions that you use): 

$$X = [P; H; |P-H|; P \cdot H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

Implement the function. **[4 marks]**

In [17]:
def combine_premise_and_hypothesis(hypothesis, premise):
    # Concatenate hypothesis, premise, element-wise product, and absolute difference
    output = torch.cat([
        hypothesis,
        premise,
        hypothesis * premise,
        torch.abs(hypothesis - premise)
    ], dim=1)
    
    return output

# test_hypothesis = test_pooled.clone()
# test_premise = test_pooled.clone()
# test_combined = combine_premise_and_hypothesis(test_hypothesis, test_premise)
# print(test_combined.size()) # should be torch.Size([32, 400])

### Creating the model

Finally, we can build the model according to the procedure given previously by using the functions we defined above. Additionaly, in the model you should use *dropout*. For efficiency purposes, it's acceptable to only train the model with either max or mean pooling. 

Implement the model [**8 marks**]

In [18]:
class SNLIModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_classes, dropout_prob):
        # your code goes here
        super(SNLIModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_size, batch_first=True, bidirectional=True)
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 8, hidden_size),  # Concatenate premise and hypothesis representations
            nn.ReLU(),
            nn.Dropout(dropout_prob),
            nn.Linear(hidden_size, num_classes)
        )
        
    def forward(self, premise, hypothesis):
        # Compute embeddings for premise and hypothesis
        p_embedded = self.embeddings(premise)
        h_embedded = self.embeddings(hypothesis)
        # print("Shape of p_embedded",p_embedded.shape) # (batch_size, sequence_length, embedding_dim)

        # Pass through RNN
        p_rnn_output, (_, _) = self.rnn(p_embedded)
        h_rnn_output, (_, _) = self.rnn(h_embedded) # extract all hidden states

        # print("Shape of p_rnn_output",p_rnn_output.shape) # (batch_size, sequence_length, hidden_size*2)

        # Perform pooling (max pooling)
        p_pooled = max_pooling(p_rnn_output)  
        h_pooled = max_pooling(h_rnn_output)

        # Combine premise and hypothesis representations
        ph_representation = combine_premise_and_hypothesis(p_pooled, h_pooled)

        # Pass through classifier
        predictions = self.classifier(ph_representation)
        
        return predictions

In [19]:
# print(dir(tokenizer))

In [20]:
vocab_size = tokenizer.get_vocab_size() # 30522

## 3. Training

As before, implement the training and testing of the model. SNLI can take a very long time to train, so I suggest you only run it for one or two epochs. **[10 marks]** 

**Tip for efficiency:** *when developing your model, try training and testing the model on one batch (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [21]:
epochs = 2
embedding_dim = 64
hidden_size = 32
dropout_prob = 0.1
lr = 0.0001

# train_iter = dataset['train'].iter(batch_size=batch_size)
print("build model...")
model = SNLIModel(vocab_size=vocab_size, embedding_dim=embedding_dim, hidden_size=hidden_size, num_classes=3, dropout_prob=dropout_prob)
model.to(device)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

print("training model 1...")
for epoch in range(epochs):
    
    total_loss = 0.0 # reset

    # train model
    model.train()  # train mode
    for batch in train_loader:
        
        # reset gradient
        optimizer.zero_grad()
        
        # forward
        premise = batch['premise']
        hypothesis = batch['hypothesis']
        labels = batch['label'] 
        
        premise = premise.to(device)
        hypothesis = hypothesis.to(device)
        labels = labels.to(device)

        # print("premises:",premise)
        # print("labels:",labels) # (batch_size, sequence_length)
        
        outputs = model(premise, hypothesis)
        # print("outputs:",outputs)

        # calculate loss
        loss = loss_function(outputs, labels)
        
        # Backpropagation
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        # accumulate losses
        total_loss += loss.item()
    
    # calculate average loss
    avg_loss = total_loss / len(train_loader)
    print(f"epoch {epoch + 1}, average loss: {avg_loss:.4f}")
    

build model...
training model 1...
epoch 1, average loss: 0.9153
epoch 2, average loss: 0.7668


## 4. Testing

Test the model on the testset. For each example in the test set, compute a prediction from the model (`entailment`, `contradiction` or `neutral`). Compute precision, recall, and F1 score for each label. **[10 marks]**

In [22]:
from sklearn.metrics import precision_recall_fscore_support

def test_model(model, test_loader):
    
    model.eval()
    correct = 0
    total = 0
    all_labels = []
    all_predictions = []

    with torch.no_grad():
        for batch in test_loader:
            premise = batch['premise']
            hypothesis = batch['hypothesis']
            labels = batch['label']

            premise = premise.to(device)
            hypothesis = hypothesis.to(device)
            labels = labels.to(device)
      
            outputs = model(premise, hypothesis)
            _, predicted = torch.max(outputs.data, 1)
            
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

            # Store labels and predictions for further evaluation
            all_labels.extend(labels.cpu().numpy())
            all_predictions.extend(predicted.cpu().numpy())

    accuracy = 100 * correct / total

    # Calculate precision, recall, and F1 score
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_predictions, average=None, labels=[0, 1, 2])
    label_names = ['entailment', 'neutral', 'contradiction']
    
    # Compare indices
    print("Compute precision, recall, and F1 score for each label:")
    for i, label in enumerate(label_names):
        print(f"Label: {label}")
        print(f"Precision: {precision[i]}")
        print(f"Recall: {recall[i]}")
        print(f"F1 Score: {f1[i]}")
        print()

    return accuracy, precision, recall, f1


In [23]:
# test model after all epochs are completed
accuracy, precision, recall, f1 = test_model(model, test_loader)
print(f"Total accuracy on test set: {accuracy:.2f}%")

Compute precision, recall, and F1 score for each label:
Label: entailment
Precision: 0.7432311811960726
Recall: 0.7416864608076009
F1 Score: 0.742458017536038

Label: neutral
Precision: 0.621773288439955
Recall: 0.688412550481516
F1 Score: 0.6533982013858174

Label: contradiction
Precision: 0.7150741635046568
Recall: 0.6404077849860983
F1 Score: 0.675684485006519

Total accuracy on test set: 69.09%


Suggest a _baseline_ that we can compare our model against **[2 marks]**

We want to suggest a **random guessing** model.

In [24]:
import random
from sklearn.metrics import accuracy_score

def random_guessing(test_loader):
    all_labels = []
    all_predictions = []

    with torch.no_grad():
        for batch in test_loader:
            labels = batch['label']

            # Generate random predictions for each label
            batch_size = len(labels)
            random_predictions = [random.randint(0, 2) for _ in range(batch_size)]

            # Store labels and predictions for further evaluation
            all_labels.extend(labels.cpu().numpy())
            all_predictions.extend(random_predictions)

    # Calculate accuracy using sklearn
    accuracy = accuracy_score(all_labels, all_predictions)

    return accuracy

# Computation
random_accuracy = random_guessing(test_loader)
print(f"Random Guessing Accuracy: {random_accuracy}")


Random Guessing Accuracy: 0.3377442996742671


From the result above, it is appearant that our SNLI model perfoms better.  
Actually, since the label distribution of the test set is close to a *uniform distribution*, the result of this baseline model is predictable, just around *1/n_classes*.  
So if a model's accuracy is close to random guessing, it may not have learned useful patterns from the data.

In [25]:
from collections import Counter

print("summary labels:",Counter(test['label']))

summary labels: Counter({0: 3368, 2: 3237, 1: 3219})


Suggest some ways (other than using a baseline) in which we can analyse the models performance **[3 marks]**.

- **Confusion Matrix**  
The confusion matrix compares the true labels with the labels predicted by the model and organizes the results into a matrix to show the relationship between various classification results. By analyzing the confusion matrix, we can understand which categories the model performs well and which categories it performs poorly on, and identify the types of errors that the model may have.
- **ROC Curve**  
The ROC curve shows the relationship between the true positive rate and the false positive rate. By analyzing the curve, we can better understand the model performance at different thresholds and choose the most appropriate threshold.
- **Cross-Validation**  
Using cross-validation can assess the model performance more accurately and test the generalization ability of the model on different data subsets. By using different training and validation data sets, we can reduce the bias caused by uneven or random distribution of data.

Suggest some ways to improve the model **[3 marks]**.

We can try to improve the model from these perspectives:  
**Model Structure Adjustment**:  
Here we just use one-layer Bi-LSTM with max-pooling, we can try more complex architectures like the *HBMP* model, or other architectures like *Transformer*.  
**Hyperparameter Tuning**:  
Limited by the computation resources this time, we just tried few hyperarameters. We can find an optimal combination of hyperparameters using grid search.   
**Early stopping**:  
Though we didn't introduce validation set here, to monitor the perfomance of each epoch and stop training when parameter updates no longer begin to yield improvement is a suggested way.

## Readings

[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.

## Statement of contribution

Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.

## Marks

This assignment has a total of 23 marks.