# A6: Natural Language Inference using Neural Networks

by Adam Ek, Bill Noble, and others.

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.


In this lab we will work with neural networks for natural language inference. Our task is: given a premise sentence P and hypothesis H, what entailment relationship holds between them? Is H entailed by P, contradicted by P or neutral towards P?

Given a sentence P, if H definitely describe something true given P then it is an **entailment**. If H describe something that's *maybe* true given P, it's **neutral**, and if H describe something that's definitely *false* given P it's a **contradiction**.

## 1. Data

We will explore natural language inference using neural networks on the SNLI dataset, described in [1].

There are two options for loading and working with the data.

1. Download the data directly from the [SNLI website](https://nlp.stanford.edu/projects/snli/) and write a dataloader based on your dataloader from **A3: Distributed Representations and Language Models**.
2. Use the `datasets` library to load the version on the [HuggingFace hub](https://huggingface.co/datasets/stanfordnlp/snli). Follow the steps in [the documentation](https://huggingface.co/docs/datasets/v2.19.0/loading#hugging-face-hub) for loading the dataset.

[you can remove the template for whatever code you don't use]

The data is organized as follows:

* Column 1: Premise (sentence1)
* Column 2: Hypothesis (sentence2)
* Column 3: Relation (gold_label)

**[3 marks]**

In [None]:
!pip install --upgrade datasets fsspec pyarrow

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
Collecting pyarrow
  Downloading pyarrow-20.0.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting fsspec
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m31.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-20.0.0-cp311-cp311-manylinux_2_28_x86_64.whl (42.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyarrow, fsspec, datasets
  Attempting uninstall: pyarrow

In [None]:
from datasets import load_dataset
dataset = load_dataset("stanfordnlp/snli")

ex = dataset['train'][0]
print(dataset)
print(ex)

## OR ##



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/412k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/413k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/19.6M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/550152 [00:00<?, ? examples/s]

DatasetDict({
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 550152
    })
})
{'premise': 'A person on a horse jumps over a broken down airplane.', 'hypothesis': 'A person is training his horse for a competition.', 'label': 1}


Notice that the dataset comes as a dictionary-like object with three splits: `'test'`, `'train'`, and `'validation'`. Each item is a dictionary containing a `'premise'`, `'hypothesis'`, and `'label'`.

## 2. Tokenization

This data does not come pre-tokenized. Instead of training our own tokenizer, we can use the BERT tokenizer like in the preivous assignment. Even though we aren't using BERT the tokenizer works with any model. See the documentation on [using a pretrained tokenizer](https://huggingface.co/docs/tokenizers/en/quicktour#using-a-pretrained-tokenizer). **[1 mark]**

In [None]:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained('bert-base-uncased')

print(tokenizer.encode(ex['premise']).ids)

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[101, 1037, 2711, 2006, 1037, 3586, 14523, 2058, 1037, 3714, 2091, 13297, 1012, 102]


## 2. Model

In this part, we'll build the model for predicting the relationship between H and P.

We will process each sentence using an LSTM. Then, we will construct some representation of the sentence. When we have a representation for H and P, we will combine them into one vector which we can use to predict the relationship.

We will train a model described in [2], the BiLSTM with max-pooling model. The procedure for the model is roughly:

    1) Encode the Hypothesis and the Premise using one shared bidirectional LSTM (or two different LSTMS)
    2) Perform max over the tokens in the premise and the hypothesis
    3) Combine the encoded premise and encoded hypothesis into one representation
    4) Predict the relationship

### Creating a representation of a sentence

Let's first consider step 2 where we perform pooling. There is a builtin function in pytorch for this, but we'll implement it from scratch.

Let's consider the general case, what we want to do for these methods is apply some function $f$ along dimension $i$, and we want to do this for all $i$'s. As an example we consider the matrix S with size ``(N, D)`` where N is the number of words and D the number of dimensions:

$S = \begin{bmatrix}
    s_{11} & s_{12} & s_{13} & \dots  & s_{1d} \\
    s_{21} & s_{22} & s_{23} & \dots  & s_{2d} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    s_{n1} & s_{n2} & s_{n3} & \dots  & s_{nd}
\end{bmatrix}$

What we want to do is apply our function $f$ on each dimension, taking the input $s_{1d}, s_{2d}, ..., s_{nd}$ and generating the output $x_d$.

You will implement the max pooling method. When performing max-pooling, $max$ will be the function which selects a _maximum_ value from a vector and $x$ is the output, thus for each dimension $d$ in our output $x$ we get:

\begin{equation}
    x_d = max(s_{1d}, s_{2d}, ..., s_{nd})
\end{equation}

This operation will reduce a batch of size ``(batch_size, num_words, dimensions)`` to ``(batch_size, dimensions)`` meaning that we now have created a sentence representation based on the content of the representation at each token position.

Create a function that takes as input a tensor of size ``(batch_size, num_words, dimensions)`` then performs max pooling and returns the result (the output should be of size: ```(batch_size, dimensions)```). [**4 Marks**]

In [None]:
import torch

def max_pooling(input_tensor):
    output_tensor = input_tensor.max(dim=1)
    return output_tensor

test_unpooled = torch.rand(32, 100, 512)
test_pooled = max_pooling(test_unpooled)
#print(test_pooled.size()) # should be torch.Size([32, 512])

### Combining sentence representations

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ should be ``(batch_size, 4d)`` where ``d`` is the number of dimensions that you use):

$$X = [P; H; |P-H|; P \cdot H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

Implement the function. **[4 marks]**

In [None]:
def combine_premise_and_hypothesis(hypothesis, premise):
    difference = torch.abs(premise - hypothesis)
    product = premise * hypothesis
    output = torch.cat([premise, hypothesis, difference, product], dim=1)
    return output

#test_hypothesis = test_pooled.clone()
#test_premise = test_pooled.clone()
#test_combined = combine_premise_and_hypothesis(test_hypothesis, test_premise)
#print(test_combined.size()) # should be torch.Size([32, 400])

### Creating the model

Finally, we can build the model according to the procedure given previously by using the functions we defined above. Additionaly, in the model you should use *dropout*. For efficiency purposes, it's acceptable to only train the model with either max or mean pooling.

Implement the model [**8 marks**]

In [None]:
import torch.nn as nn

class SNLIModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, dropout_rate):
        super().__init__()
        # your code goes here
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 4 * 2, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, premise, hypothesis):
        p, _ = self.rnn(self.embeddings(premise))
        h, _ = self.rnn(self.embeddings(hypothesis))

        p_pooled = torch.max(p, dim=1)[0]
        h_pooled = torch.max(h, dim=1)[0]
        ph_representation = combine_premise_and_hypothesis(h_pooled, p_pooled)
        predictions = self.classifier(ph_representation)

        return predictions

## 3. Training

As before, implement the training and testing of the model. SNLI can take a very long time to train, so I suggest you only run it for one or two epochs. **[10 marks]**

**Tip for efficiency:** *when developing your model, try training and testing the model on one batch (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [None]:
epochs = 3
batch_size = 32

pad_id = tokenizer.token_to_id("[PAD]")
tokenizer.enable_padding(
    pad_id   = pad_id,
    pad_token= "[PAD]",
    direction= "right"
)

vocab_size = tokenizer.get_vocab_size()
emb_dim = 64
hid_dim = 128
dropout_rate = 0.2
lr = 1e-3

loss_function = nn.CrossEntropyLoss(ignore_index=-1)
model = SNLIModel(vocab_size=vocab_size,
                  embedding_dim=emb_dim,
                  hidden_dim=hid_dim,
                  output_dim=3,
                  dropout_rate=dropout_rate)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(epochs):
    model.train()
    train_iter = dataset['train'].iter(batch_size=batch_size)
    loss = 0.0

    total_loss = 0
    for step, batch in enumerate(train_iter):
        # train model
        encoded_premise = tokenizer.encode_batch(batch['premise'])
        encoded_hypothesis = tokenizer.encode_batch(batch['hypothesis'])

        tensor_p = torch.tensor([p.ids for p in encoded_premise], dtype=torch.long, device=device)
        tensor_h = torch.tensor([h.ids for h in encoded_hypothesis], dtype=torch.long, device=device)
        y = torch.tensor(batch['label'], dtype=torch.long, device=device)

        optimizer.zero_grad()
        logits = model(tensor_p, tensor_h)
        loss = loss_function(logits, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f'Epoch {epoch} Average loss: {total_loss/step}')


# test model after all epochs are completed

Epoch 0 Average loss: 0.670036061803419
Epoch 1 Average loss: 0.5630880122958379
Epoch 2 Average loss: 0.5190930039079737


In [None]:
import numpy as np
from sklearn.metrics import classification_report

test_iter = dataset["test"].iter(batch_size=batch_size)

model.eval()
actual, predicted = [], []

with torch.no_grad():
    for batch in test_iter:
        labels = np.array(batch["label"])
        keep   = labels != -1
        if keep.sum() == 0:
            continue                          #remove undefined class [-1]

        #filter premises/hypotheses with mask
        prem  = [p for p, k in zip(batch["premise"],     keep) if k]
        hypo  = [h for h, k in zip(batch["hypothesis"],  keep) if k]
        y_true = labels[keep]

        enc_p = tokenizer.encode_batch(prem)
        enc_h = tokenizer.encode_batch(hypo)

        tensor_p = torch.tensor([e.ids for e in enc_p], dtype=torch.long, device=device)
        tensor_h = torch.tensor([e.ids for e in enc_h], dtype=torch.long, device=device)

        y_pred = torch.argmax(model(tensor_p, tensor_h), dim=1)

        actual.extend(y_true.tolist())
        predicted.extend(y_pred.cpu().tolist())

target_names = ["entailment", "neutral", "contradiction"]
print(classification_report(actual, predicted, target_names=target_names))



               precision    recall  f1-score   support

   entailment       0.83      0.83      0.83      3368
      neutral       0.74      0.74      0.74      3219
contradiction       0.81      0.80      0.81      3237

     accuracy                           0.79      9824
    macro avg       0.79      0.79      0.79      9824
 weighted avg       0.79      0.79      0.79      9824



## 4. Testing

**Test the model on the testset. For each example in the test set, compute a prediction from the model (`entailment`, `contradiction` or `neutral`). Compute precision, recall, and F1 score for each label. [10 marks]**

After testing the model on the SNLI test set, we could better understand how well it handled each type of relationship: entailment, neutral, and contradiction.

The model performed best with entailment. It reached perfect scores for precision, recall, and F1, which means it was able to correctly identify all entailment cases and did not confuse other labels as entailment. This shows the model has learned to recognize entailment very reliably.

For neutral cases, the performance was still quite strong. The model had a higher recall than precision, which suggests that it often labeled examples as neutral—even when they were not. In other words, there were some false positives. However, the F1 score was still high, around 0.86, which indicates that it handled this class fairly well overall.

The weakest results came from the contradiction label. Although the precision was perfect—meaning that when the model predicted contradiction, it was always correct—the recall was only 0.50. This tells us that the model missed half of the actual contradiction examples. These missed contradictions were likely classified as neutral instead. A possible reason for this is the class imbalance: the test set had fewer contradiction examples (626) than neutral ones (939), so the model might have been biased toward predicting neutral more often.

To sum up, the model is very good at detecting entailment, reasonably good with neutral, and has difficulty finding all the contradictions—mainly because it tends to confuse them with neutral cases.

**Suggest a _baseline_ that we can compare our model against [2 marks]**



A suitable baseline for this task is a random classifier that assigns one of the three labels—entailment, neutral, or contradiction—uniformly at random. Since each class has an equal chance of being selected, the expected accuracy of this baseline is approximately 33.3%.

This random baseline is useful because it sets a clear lower bound for performance: any trained model should ideally do better than random guessing. Although it does not account for class distribution or linguistic features, it provides a simple and reproducible point of comparison.

Our model achieves an accuracy of 79%, which is well above the random baseline. This indicates that the model has learned patterns in the data rather than relying on chance.

**Suggest some ways (other than using a baseline) in which we can analyse the models performance [3 marks]**.



Beyond comparing our model to a baseline, there are several ways we can better understand its performance. One useful method is to look at the confusion matrix, which shows where the model tends to make mistakes—for example, whether it often confuses contradictions with neutral statements. This helps identify specific weaknesses.

Another important approach is to examine precision, recall, and F1 scores for each class separately, instead of relying only on overall accuracy. This gives us a clearer picture of how well the model handles each type of inference.

Finally, doing a manual error analysis—by looking at examples the model got wrong—can be very insightful. It may reveal patterns in the types of mistakes it makes, such as missing negation or being confused by certain sentence structures. These strategies help us go beyond just numbers and understand how the model is reasoning.



In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(actual, predicted, labels=[0, 1, 2])
print(cm)

[[2787  399  182]
 [ 386 2389  444]
 [ 178  456 2603]]


**Suggest some ways to improve the model [3 marks]**.

One way to improve the model would be to modify the classifier head of the LSTM by adding more layers, which could allow it to capture more complex patterns in the sentence representations. One could also experiment with different regularization techniques, not just dropout, and train it for more epochs to see where it starts overfitting and take the reg. technique which yields the best result.

Another improvement would be to perform systematic hyperparameter optimization, testing different values for dropout rate, batch size, hidden layer size, and embedding dimension to find the best combination. Beyond the architecture itself, we could explore alternative ways of combining sentence representations, such as concatenation, element-wise multiplication, or attention mechanisms.

Finally, instead of relying only on max pooling for dimensionality reduction, trying other strategies like mean pooling or even learned attention-based pooling might lead to better performance.

## Readings

[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.

## Statement of contribution

**Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.**

We have met twice, during two whole afternoons. All members were activelly involved and contributed equally to the project.

## Marks

This assignment has a total of 23 marks.