# Natural Language Inference using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

In this lab we'll work with neural networks for natural language inference. Our task is: given a premise sentence P and hypothesis H, what entailment relationship holds between them? Is H entailed by P, contradicted by P or neutral towards P?

Given a sentence P, if H definitely describe something true given P then it is an **entailment**. If H describe something that's *maybe* true given P, it's **neutral**, and if H describe something that's definitely *false* given P it's a **contradiction**. 

In [30]:
# first we import some packages that we need
import torch
import torch.nn as nn
import torchtext
import torch.optim as optim
from torchtext.data import Field, BucketIterator, Iterator, TabularDataset
import numpy as np

# our hyperparameters (add more when/if you need them)
device = torch.device('cuda:0')

batch_size = 512 #8
learning_rate = 0.001
epochs = 3

# 1. Data

We will explore natural language inference using neural networks on the SNLI dataset, described in [1]. The dataset can be downloaded [here](https://nlp.stanford.edu/projects/snli/). We prepared a "simplified" version, with only the relevant columns [here](https://gubox.box.com/s/idd9b9cfbks4dnhznps0gjgbnrzsvfs4).

The (simplified) data is organized as follows (tab-separated values):
* Column 1: Premise
* Column 2: Hypothesis
* Column 3: Relation

Like in the previous lab, we'll use torchtext to build a dataloader. You can essentially do the same thing as you did in the last lab, but with our new dataset. **[1 mark]**

In [28]:
def dataloader(path_to_snli):
        
    whitespacer = lambda x: x.split(' ')

    # "fields" that process the different columns in our CSV files
    WORDS = Field(tokenize   = whitespacer,
                   lower       = True,
                   batch_first = True)

    RELATION = Field(tokenize   = whitespacer,
                      lower       = True,
                      batch_first = True)
    
    # read the csv files
    train, val, test = TabularDataset.splits(path   = path_to_snli,
                                        train  = 'simple_snli_1.0_train.csv',
                                        validation = 'simple_snli_1.0_dev.csv',
                                        test   = 'simple_snli_1.0_test.csv',
                                        format = 'csv',
                                        fields = [('premise', WORDS),
                                                  ('hypothesis', WORDS),
                                                 ('relation', RELATION)],
                                        skip_header       = True,
                                        csv_reader_params = {'delimiter':'\t',
                                                             'quotechar':'½'})
    
    # build vocabularies based on what our csv files contained and create word2id mapping
    WORDS.build_vocab(train)
    RELATION.build_vocab(train)

    # create batches from our data, and shuffle them for each epoch
    train_iter, val_iter, test_iter = BucketIterator.splits((train, val, test),
                                                  batch_size        = batch_size,
                                                  sort_within_batch = True,
                                                  sort_key          = lambda x: len(x.premise),
                                                  shuffle           = True,
                                                  device            = device)

    return train_iter, val_iter, test_iter, WORDS.vocab, RELATION.vocab

# 2. Model

In this part, we'll build the model for predicting the relationship between H and P.

We will process each sentence using an LSTM. Then, we will construct some representation of the sentence. When we have a representation for H and P, we will combine them into one vector which we can use to predict the relationship.

We will train a model described in [2], the BiLSTM with max-pooling model. The procedure for the model is roughly:

    1) Encode the Hypothesis and the Premise using one shared bidirectional LSTM (or two different LSTMS)
    2) Perform max over the tokens in the premise and the hypothesis
    3) Combine the encoded premise and encoded hypothesis into one representation
    4) Predict the relationship 

### Creating a representation of a sentence

Let's first consider step 2 where we perform max/mean pooling. There is a function in pytorch for this, but we'll implement it from scratch. 

Let's consider the general case, what we want to do for these methods is apply some function $f$ along dimension $i$, and we want to do this for all $i$'s. As an example we consider the matrix S with size ``(N, D)`` where N is the number of words and D the number of dimensions:

$S = \begin{bmatrix}
    s_{11} & s_{12} & s_{13} & \dots  & s_{1d} \\
    s_{21} & s_{22} & s_{23} & \dots  & s_{2d} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    s_{n1} & s_{n2} & s_{n3} & \dots  & s_{nd}
\end{bmatrix}$

What we want to do is apply our function $f$ on each dimension, taking the input $s_{1d}, s_{2d}, ..., s_{nd}$ and generating the output $x_d$. 

You will implement both the max pooling method. When performing max-pooling, $max$ will be the function which selects a _maximum_ value from a vector and $x$ is the output, thus for each dimension $d$ in our output $x$ we get:

\begin{equation}
    x_d = max(s_{1d}, s_{2d}, ..., s_{nd})
\end{equation}


This operation will reduce a batch of size ``(batch_size, num_words, dimensions)`` to ``(batch_size, dimensions)`` meaning that we now have created a sentence representation based on the content of the words representations in the sentence. 

Create a function that takes as input a tensor of size ``(batch_size, num_words, dimensions)`` then performs max pooling and returns the result (the output should be of size: ```(batch_size, dimensions)```). [**4 Marks**]

In [2]:
# dumb option that works
def pooling2(input_tensor):
    l = []
    for x in torch.transpose(input_tensor,2,1):
        m = []
        for y in x:
            n = float(y[0])
            for z in y:
                if float(z)>n:
                    n = float(z)
            m.append(n)
        l.append(m)
    output_tensor = torch.tensor(l)
    return output_tensor

In [8]:
# fancy option
import numpy
def pooling(input_tensor): #input size: (B, W, D)
#     x[0,:,0]
    output_tensor = []
    for batchie in range(input_tensor.shape[0]): 
        batchies = []
        for dim in range(input_tensor.shape[-1]): 
            max_val = numpy.max(input_tensor[batchie, :, dim].detach().cpu().numpy()) #fancy indexing hell yeah ☜(ﾟヮﾟ☜)
            batchies.append(max_val)
        output_tensor.append(batchies)
        
    return torch.tensor(output_tensor).to(device) #output size: (B, D)

### Combining sentence representations

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ should be ``(batch_size, 4d)`` where ``d`` is the number of dimensions that you use): 

$$X = [P; H; |P-H|; P \cdot H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

Implement the function. **[2 marks]**

In [21]:
def combine_premise_and_hypothesis(hypothesis, premise):
    
    x = [premise, hypothesis, abs(premise-hypothesis), premise*hypothesis]

    output = torch.cat(x, dim=1)
    
    return output

### Creating the model

Finally, we can build the model according to the procedure given previously by using the functions we defined above. Additionaly, in the model you should use *dropout*. For efficiency purposes, it's acceptable to only train the model with either max or mean pooling. 

Implement the model [**6 marks**]

In [19]:
class SNLIModel(nn.Module):
    def __init__(self, vocab, num_relations, i_dim):
        super(SNLIModel, self).__init__()
        self.embeddings = nn.Embedding(vocab, i_dim)
        self.rnn = nn.LSTM(i_dim, i_dim, bidirectional=True, batch_first=True) #idk about i_dim twice
        self.classifier = nn.Linear(i_dim*8, num_relations)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, premise, hypothesis):
        # Encode the Hypothesis and the Premise using one shared bidirectional LSTM
        p = self.embeddings(premise)
        h = self.embeddings(hypothesis)
        
        p,(_, _) = self.rnn(p)
        h,(_, _) = self.rnn(h)
        
        # Perform max over the tokens in the premise and the hypothesis
        p_pooled = pooling(p)
        h_pooled = pooling(h)
        
        # Combine the encoded premise and encoded hypothesis into one representation
        ph_representation = combine_premise_and_hypothesis(h_pooled, p_pooled)
        ph_representation = self.dropout(ph_representation)
        
        predictions = self.classifier(ph_representation)
        
        return predictions

# 3. Training and testing

As before, implement the training and testing of the model. SNLI can take a very long time to train, so I suggest you only run it for one or two epochs. **[2 marks]** 

**Tip for efficiency:** *when developing your model, try training and testing the model on one batch (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [31]:
path_to_snli = "data"
train_iter, dev_iter, test_iter, words, relation = dataloader(path_to_snli)

vocab_len = len(words)
rel_len = len(relation)

model = SNLIModel(vocab_len, rel_len, 128).to(device)
loss_function = nn.CrossEntropyLoss(reduction='mean')
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

model.train()

for _ in range(epochs):
    total_loss = 0
    for i, batch in enumerate(train_iter):
        p = batch.premise
        h = batch.hypothesis
        r = batch.relation
        
        out = model(p, h)
        
        #calculate the loss
        loss = loss_function(out, r.view(-1))
        total_loss += loss.item()
        
        # print average loss for the epoch
        print(total_loss/(i+1), end='\r')
        
        # backpropagation
        loss.backward()
        
        # optimizing
        optimizer.step()
            
        # clear gradients
        optimizer.zero_grad()
    
    print()
    
torch.save(model, 'snli_model.pt')

1.0436860337368277
1.0164621060393577
1.0139082385772882


  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


In [35]:
# printing accuracy

model.eval()

test_loss = 0

correct_guesses = 0

for i, batch in enumerate(test_iter):
    p = batch.premise
    h = batch.hypothesis
    r = batch.relation
        
    with torch.no_grad():
        output = model(p, h)
        
    loss = loss_function(output, r.view(-1))
    test_loss += loss.item()
    
    # finding accuracy
    correct_guesses += torch.sum(torch.eq(torch.argmax(output, dim=1), r.view(-1)).long())
    
    # print average loss for the epoch
    print(total_loss/(i+1), end='\r')

accuracy = int(correct_guesses) / ((i+1) * batch_size)

print('>', np.round(test_loss/(i+1), 4))
print('accuracy: ', accuracy)

> 1.07837823529245
accuracy:  0.50546875


Suggest a _baseline_ that we can compare our model against **[2 marks]**

    We have seen that many models are biased when checking NLI: entailment is usually guessed among sentences that have the same words, and contradiction is predicted when the hypothesis contains a negation. to create a baseline and get similar results to previous models, then, would be to "copy this bias" and check for shared words and negative words, the latter being checked first.

Suggest some ways (other than using a baseline) in which we can analyse the models performance **[4 marks]**.

    We could compare the performance of this model to the performance of another model in a paper, training and testing it with the same dataset that they used, if it is available. Otherwise, we could try to replicate the model and use our dataset with it.  We could also use different measures, like precision, F1_score or recall, to see the model's performance.

Suggest some ways to improve the model **[3 marks]**.

    As per usual, we can always improve the model with more data. We would suggest having human annotators that check the premises, but especially the hypotheses. We would suggest the following, when writing hypotheses:
        - Use different words from the premise.
        - Write sentences of different lengths.
        - Use negation in both entailment, neutral and contradiction sentences.
    
    We could also try to use a pre-trained model for the sentence embeddings, for instance, BERT. 

### Readings

[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.