<a href="https://colab.research.google.com/github/yala/deeplearning_bootcamp/blob/master/lab2/nli_excercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a Natural Language Inference Classifier (with PyTorch)

Natural language inference is the task of determining whether or not a given statement (the "hypothesis") is entailed by another given statement (the "premise").

The hypothesis is true (entailment) if it is entailed, it is false (contradiction) if it is not entailed, and it is undetermined (neutral) if it is neither true nor false.

An example is:

| Premise | Label | Hypothesis |
| ---  | --- | --- |
|The Golden State Warriors scored 100 points last night.| Entailment | Someone scored a basket in the game. |
|The Golden State Warriors scored 100 points last night. | Neutral | The Warriors won the game. |
| The Golden State Warriors scored 100 points last night. | Contradiction | The Warriors struggled to make baskets. |


## Dataset

For this exercise we'll be using a portion of the [MNLI](https://arxiv.org/abs/1704.05426) dataset --- a dataset for natural language inference that spans multiple genres and writing styles. To keep things simple, we will only be dealing with the "Entailment" and "Contradiction" classes --- making it a binary classification task.

The data is provided to you as a list of entries, where each `entry` has the following structure:

```
example.x1 = ["the", "tokenized", "premise"]
example.x2 = ["the", "tokenized", "hypothesis"]
example.y = 0 or 1
```

In [0]:
# Load the data.
!wget https://people.csail.mit.edu/fisch/assets/data/bootcamp/nli/train.txt
!wget https://people.csail.mit.edu/fisch/assets/data/bootcamp/nli/valid.txt
!wget https://people.csail.mit.edu/fisch/assets/data/bootcamp/nli/test.txt

import collections
import json
import numpy as np

LABELS = ["contradiction", "entailment"]

Example = collections.namedtuple("Entry", ["x1", "x2", "y"])

def load_data(filename):
  examples = []
  with open(filename, "r") as f:
    for line in f:
      fields = json.loads(line)
      x1 = fields["x1"]
      x2 = fields["x2"]
      if fields["y"] not in LABELS:
        continue
      y = LABELS.index(fields["y"])
      examples.append(Example(x1, x2, y))
  return examples

train_examples = load_data("train.txt")
valid_examples = load_data("valid.txt")
test_examples = load_data("test.txt")

In [0]:
import torch
import torch.nn.functional as F
from sklearn.feature_extraction.text import CountVectorizer

# Set vocab using train text.
min_df = 5
max_features = 3000
vectorizer = CountVectorizer(min_df=min_df, max_features=max_features)
vectorizer.fit([" ".join(ex.x1) for ex in train_examples] +
               [" ".join(ex.x2) for ex in train_examples])


def prepare_dataset(examples, batch_size=4):
  # Convert all strings to indices.
  x1 = vectorizer.transform([" ".join(ex.x1) for ex in examples]).toarray()
  x2 = vectorizer.transform([" ".join(ex.x2) for ex in examples]).toarray()
  
  # Isolate the labels.
  y = np.array([ex.y for ex in examples])

  # Convert to torch tensors.
  x1 = torch.from_numpy(x1)
  x2 = torch.from_numpy(x2)
  y = torch.from_numpy(y)

  # Wrap in a pytorch tensor dataset.
  dataset = torch.utils.data.TensorDataset(x1, x2, y)

  # Load the dataset with a data loader.
  loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

  return loader


train_data = prepare_dataset(train_examples)
valid_data = prepare_dataset(valid_examples)
test_data = prepare_dataset(test_examples)

## Boilerplate training code

The standard functions that are helpful to train your networks.

In [0]:
def train_epoch( model, train_loader, optimizer, epoch):
  model.train() # Set the nn.Module to train mode. 
  total_loss = 0
  correct = 0
  num_samples = len(train_loader.dataset)
  for batch_idx, (x1, x2, target) in enumerate(train_loader): #1) get batch
    x1 = x1.float()
    x2 = x2.float()
    target = target.float()
    # Reset gradient data to 0
    optimizer.zero_grad()
  
    # Get prediction for batch
    output = model(x1, x2).squeeze(1)
  
    # 2) Compute loss
    # YOUR CODE HERE
  
    #3) Do backprop
    loss.backward()
  
    #4) Update model
    optimizer.step()

    total_loss += loss.detach() # Don't keep computation graph 

  print('Train Epoch: {} \tXENT: {:.4f})\n'.format(
        epoch, total_loss / num_samples))
  


def eval_epoch(model, test_loader, name):
  model.eval()
  test_loss = 0
  correct = 0.0
  for x1, x2, target in test_loader:
    x1 = x1.float()
    x2 = x2.float()
    target = target.float()
    output = model(x1, x2).squeeze(-1)
    # YOUR CODE HERE. Get accuracy and loss!

  test_loss /= len(test_loader.dataset)
  correct /= len(test_loader.dataset)
  print('\n{} set: Average XENT: {:.4f}\n'.format(name, test_loss))
  print('\n{} set: Average Acc: {:.4f}\n'.format(name, correct))

## Modeling

Try different model choices!

In [0]:
import torch.nn as nn
import torch.optim as optim

# Training settings
epochs = 10
lr = .01
momentum = 0.5
hidden_dim = 100

class Model(nn.Module):
  def __init__(self):
    # YOUR CODE HERE
    pass  

  def forward(self, x1, x2):
    # YOUR CODE HERE
    pass


model = Model()
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

In [0]:

for epoch in range(1, epochs + 1):
    train_epoch(model, train_data, optimizer, epoch)
    eval_epoch(model,  valid_data, "Dev")
    print("---")
eval_epoch(model,  test_data, "Test")