# Natural Language Inference using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

In the VG part of problem set 3, we will work with neural networks for natural language inference. Our task is: given a premise sentence P and hypothesis H, what entailment relationship holds between them? Is H entailed by P, contradicted by P or neutral towards P?

Given a sentence P, if H definitely describe something true given P then it is an **entailment**. If H describe something that's *maybe* true given P, it's **neutral**, and if H describe something that's definitely *false* given P it's a **contradiction**. 

This definition of inference, and the method we use to solve it, is diffrent from what you've previously worked with. Briefly discuss strengths and weaknesses of using formal semantics versus using statistical methods for natural language inference. **[4 marks]**

**Your answer should go here**

| formal semantics                                              | Deep Learning |
|---------------------------------------------------------------|-----------------------------------------------|
| Sentences must be parsable.                                   | Can deal with more natural language      |
| According to the design, it can deal with word sense ambiguity and othe problems | Word sense ambiguity may affect the performance if not included in the training phase|
| Do not need a large data to train, but need a relatively complex world model  | Need large data                               |
| Transperant, the output can be linked to the input            | Model is a black box

# 1. Data

We will explore natural language inference using neural networks on the SNLI dataset, described in [1]. The dataset can be downloaded [here](https://nlp.stanford.edu/projects/snli/). We prepared a "simplified" version, with only the relevant columns [here](https://gubox.box.com/s/idd9b9cfbks4dnhznps0gjgbnrzsvfs4).

The (simplified) data is organized as follows (tab-separated values):
* Column 1: Premise
* Column 2: Hypothesis
* Column 3: Relation

Like in the previous lab, we'll use torchtext to build a dataloader. You can essentially do the same thing as you did in the last lab, but with our new dataset. **[1 mark]**

In [1]:

import random

import numpy as np
from torchtext.data import Field, TabularDataset, BucketIterator
import torch
import torch.nn as nn
import torch.nn.functional as F

from sklearn.metrics import accuracy_score

seed = 123
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.cuda.manual_seed_all(seed)
torch.cuda.manual_seed(seed)

In [2]:
def mytokenizer(text):
    return text.split()


def dataloader(path_to_snli, batch_size=8, device=None):

    print("Loading data...\n")

    if not device:
        device = torch.device('cpu')

    text = Field(tokenize=mytokenizer,
                 sequential=True,
                 batch_first=True)

    label = Field(batch_first=True,
                  sequential=False,
                  is_target=True)

    fields = [("premises", text),
              ("hypothesis", text),
              ("relation", label)]

    # create tabular datasets
    train_ds, dev_ds, test_ds = TabularDataset.splits(
        path=path_to_snli,
        train="simple_snli_1.0_train.csv",
        validation="simple_snli_1.0_dev.csv",
        test="simple_snli_1.0_test.csv",
        format="csv",
        fields=fields,
        csv_reader_params={"delimiter": "\t"})

    text.build_vocab(train_ds, dev_ds, test_ds, min_freq=3)
    label.build_vocab(train_ds, dev_ds, test_ds, min_freq=3)

    train_iter, dev_iter, test_iter = BucketIterator.splits(
        (train_ds, dev_ds, test_ds),
        batch_sizes=batch_size,
        sort_within_batch=False,
        shuffle=True,
        repeat=False,
        sort_key=lambda x: (len(x.premises), len(x.hypothesis)),
        device=device)

    print("Loading data done.\n")

    return train_iter, dev_iter, test_iter, label, text


# 2. Model

In this part, we'll build the model for predicting the relationship between H and P.

We will process each sentence using an LSTM. Then, we will construct some representation of the sentence. When we have a representation for H and P, we will combine them into one vector which we can use to predict the relationship.

We will train a model described in [2], the BiLSTM with mean/max-pooling model. The procedure for the model is roughly:

    1) Encode the Hypothesis and the Premise using a bidirectional LSTM
    2) Perform max or mean pooling over the premise and hypothesis
    3) Combine the premise and hypothesis into one representation
    4) Predict the relationship 

### Creating a representation of a sentence

Let's first consider step 2 where we perform max/mean pooling. There is a function in pytorch for this, but we'll implement it from scratch. 

Let's consider the general case, what we want to do for these methods is apply some function $f$ along dimension $i$, and we want to do this for all $i$'s. As an example we consider the matrix S with size ``(N, D)`` where N is the number of words and D the number of dimensions:

$S = \begin{bmatrix}
    s_{11} & s_{12} & s_{13} & \dots  & s_{1d} \\
    s_{21} & s_{22} & s_{23} & \dots  & s_{2d} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    s_{n1} & s_{n2} & s_{n3} & \dots  & s_{nd}
\end{bmatrix}$

What we want to do is apply our function $f$ on each dimension, taking the input $s_{1d}, s_{2d}, ..., s_{nd}$ and generating the output $x_d$. 

You will implement both the max and mean pooling methods. When performing mean-pooling, $f$ will be the mean function and $x$ is the output, thus for each dimension $d$ we calculate:

\begin{equation}
x_d = \frac{1}{N}\sum_{j=1}^N x_{jd}
\end{equation}

When performing max-pooling we do the same thing, but let $f$ be the ``argmax`` function:

\begin{equation}
    x_d = f(s_{1d}, s_{2d}, ..., s_{nd}) = argmax(s_{1d}, s_{2d}, ..., s_{nd})
\end{equation}


Both of these operations reduce a batch of size ``(batch_size, num_words, dimensions)`` to ``(batch_size, 1, dimensions)`` meaning that we now have created a sentence representation based on the content of the words representations in the sentence (by applying some function $f$ along a dimension). 

Create a function that takes as input a tensor of size ``(batch_size, num_words, dimensions)`` then performs max or mean pooling and return it. [**6 Marks**]

In [3]:
def pooling(input_tensor):
    # sum word feature in each sent and each batch
    # get the index of the max word vector summation
    # select based on the max id

    batch_size = input_tensor.size(0)
    mean_pool = torch.max(input_tensor, 1)[0]
    output_tensor = mean_pool.view(batch_size, 1, -1)
    return output_tensor


### Combining sentence representations

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ will be ``(batch_size, 1, 4d)`` where ``d`` is the number of dimensions that you use): 

$$X = [P; H; P \cdot H; P-H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

Implement the function. **[4 marks]**

In [4]:
def combine_premise_and_hypothesis(hypothesis, premise):
    p_minus_h = torch.abs(premise - hypothesis)
    p_dot_h = hypothesis * premise
    output = torch.cat((premise, hypothesis, p_dot_h, p_minus_h), dim=2)
    return output

### Creating the model

Finally, we can build the model according to the procedure given previously by using the functions we defined above. Additionaly, in the model you should use *dropout*. For efficiency purposes, it's acceptable to only train the model with either max or mean pooling. 

Implement the model [**6 marks**]

In [5]:
class SNLIModel(nn.Module):
    def __init__(self,
                 vocab_size,
                 embedding_dim,
                 hidden_dim,
                 output_size,
                 bidirectional=True,
                 batch_first=True,
                 num_layers=1,
                 dropout=0.5):

        super().__init__()
        # your code goes here
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)

        self.rnn = nn.LSTM(embedding_dim,
                           hidden_dim,
                           bidirectional=bidirectional,
                           batch_first=batch_first,
                           num_layers=num_layers,
                           dropout=dropout)
        self.classifier = nn.Linear(hidden_dim*8, output_size)

    def forward(self, premise, hypothesis):
        p_emb = self.embeddings(premise)
        h_emb = self.embeddings(hypothesis)
        p_encode, _ = self.rnn(p_emb)
        h_encode, _ = self.rnn(h_emb)
        p_max = pooling(p_encode)
        h_max = pooling(h_encode)
        x = combine_premise_and_hypothesis(p_max, h_max)
        predictions = F.relu(self.classifier(x))

        return predictions

# 3. Training and testing

As before, implement the training and testing of the model. SNLI can take a very long time to train, so I suggest you only run it for one or two epochs. **[2 marks]** 

**Tip for efficiency:** *when developing your model, try training and testing the model on one batch (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [6]:
def train_epoch(model, iterator, optimizer, loss_function):
    running_loss = 0
    model.train()

    for i, batch in enumerate(iterator):
        optimizer.zero_grad()
        p = batch.premises
        h = batch.hypothesis
        outputs = model(p, h)

        # predictions = model(batch)
        batch_size = batch.relation.size(0)
        loss = loss_function(outputs.view(batch_size, -1), batch.relation)
        loss.backward()
        optimizer.step()

        # loss
        running_loss += loss.item()

        if i % 1000 == 999 or (i % 1000 != 999 and i == len(iterator)-1):
            print(f"  Batch-{i+1:<6d} Loss: {running_loss/(i+1):.3f}")

    return running_loss/len(iterator)


def evaluate_epoch(model, iterator, loss_function):
    running_loss = 0

    model.eval()
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            p = batch.premises
            h = batch.hypothesis
            outputs = model(p, h)

            # predictions = model(batch)
            batch_size = batch.relation.size(0)
            loss = loss_function(outputs.view(batch_size, -1), batch.relation)

            # loss
            running_loss += loss.item()

            if i % 1000 == 999 or (i % 1000 != 999 and i == len(iterator)-1):
                print(f"  Batch-{i+1:<6d} Loss: {running_loss/(i+1):.3f}")

    return running_loss/len(iterator)


def train_eval(model, train_iter, eval_iter, epochs, lr=0.001):
    loss_function = nn.CrossEntropyLoss()
    loss_function = loss_function.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    for epoch in range(epochs):
        # train, evaluate
        print("Start training...")
        print(f"\nepoch: {epoch:2d}")
        _ = train_epoch(model, train_iter, optimizer, loss_function)

        print("\nStart Eval...")
        print(f"\nepoch: {epoch:2d}\n")
        _ = evaluate_epoch(model, eval_iter, loss_function)

        # TODO: save model that have the best loss or eval metrics ...

    return model


def test(model, iterator):

    model.eval()
    predictions = []
    labels = []
    with torch.no_grad():
        for batch in iterator:
            p = batch.premises
            h = batch.hypothesis
            outputs = model(p, h)

            relation = batch.relation.tolist()
            batch_size = batch.relation.size(0)
            prob_dist = F.softmax(outputs.view(batch_size, -1), 0)
            prediction = torch.max(prob_dist, 1)[
                1].view(-1).tolist()

            predictions += prediction
            labels += relation

            accuracy = accuracy_score(predictions, labels)

    return accuracy


In [7]:
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
batch_size = batch_size = (64, 32, 32)
dataset_path = \
    "/home/guszarzmo@GU.GU.SE/MLT_Temp/lt2213-v20/dataset/simple_snli_1.0"
train_iter, eval_iter, test_iter, label, text = dataloader(
    dataset_path, batch_size, device)

Loading data...

Loading data done.



In [8]:
vocab_size = len(text.vocab)
embedding_size = 100
hidden_size = 33
epochs = 2
output_size = len(label.vocab)

model = SNLIModel(vocab_size, embedding_size, hidden_size, output_size)
model = model.to(device)
model = train_eval(model, train_iter, eval_iter, epochs, lr=0.001)
metrics = test(model, test_iter)
print(f"\nmodel accuracy:{metrics:.2f}")

Start training...

epoch:  0
  Batch-1000   Loss: 1.009
  Batch-2000   Loss: 0.945
  Batch-3000   Loss: 0.907
  Batch-4000   Loss: 0.881
  Batch-5000   Loss: 0.863
  Batch-6000   Loss: 0.849
  Batch-7000   Loss: 0.837
  Batch-8000   Loss: 0.826
  Batch-8597   Loss: 0.820

Start Eval...

epoch:  0

  Batch-313    Loss: 0.879
Start training...

epoch:  1
  Batch-1000   Loss: 0.703
  Batch-2000   Loss: 0.703
  Batch-3000   Loss: 0.701
  Batch-4000   Loss: 0.701
  Batch-5000   Loss: 0.700
  Batch-6000   Loss: 0.698
  Batch-7000   Loss: 0.696
  Batch-8000   Loss: 0.695
  Batch-8597   Loss: 0.694

Start Eval...

epoch:  1

  Batch-313    Loss: 0.813

model accuracy:0.51


Suggest a baseline that we can compare our model against **[2 marks]**

**Your answer should go here**

Suggest some ways (other than using a baseline) in which we can analyse the models performance **[6 marks]**.

1. Using Accuracy, Precision and F1 score in the testing/inference phase
2. Using Confusion Matrix

**Your answer should go here**

Suggest some ways to improve the model **[4 marks]**.

1. Model hyperparameter tuning
2. Using of pre-trained word vectors and c/c encodeing
3. We could use a feedforward layer to encode the sentences instead of pooling, not sure if this will improve the perfromance
4. Adding regularization term to the loss function and/or applying dropout to the embedding layer.


**Your answer should go here**

### Readings

[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.