<a href="https://colab.research.google.com/github/yala/deeplearning_bootcamp/blob/master/lab1/nli_excercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a Natural Language Inference Classifier

Natural language inference is the task of determining whether or not a given statement (the "hypothesis") is entailed by another given statement (the "premise").

The hypothesis is true (entailment) if it is entailed, it is false (contradiction) if it is not entailed, and it is undetermined (neutral) if it is neither true nor false.

An example is:

| Premise | Label | Hypothesis |
| ---  | --- | --- |
|The Golden State Warriors scored 100 points last night.| Entailment | Someone scored a basket in the game. |
|The Golden State Warriors scored 100 points last night. | Neutral | The Warriors won the game. |
| The Golden State Warriors scored 100 points last night. | Contradiction | The Warriors struggled to make baskets. |


## Dataset

For this exercise we'll be using a portion of the [MNLI](https://arxiv.org/abs/1704.05426) dataset --- a dataset for natural language inference that spans multiple genres and writing styles. To keep things simple, we will only be dealing with the "Entailment" and "Contradiction" classes --- making it a binary classification task.

The data is provided to you as a list of entries, where each `entry` has the following structure:

```
example.x1 = ["the", "tokenized", "premise"]
example.x2 = ["the", "tokenized", "hypothesis"]
example.y = 0 or 1
```

In [0]:
# Load the data.
!wget https://people.csail.mit.edu/fisch/assets/data/bootcamp/nli/train.txt
!wget https://people.csail.mit.edu/fisch/assets/data/bootcamp/nli/valid.txt
!wget https://people.csail.mit.edu/fisch/assets/data/bootcamp/nli/test.txt

import collections
import json
import numpy as np

LABELS = ["contradiction", "entailment"]

Example = collections.namedtuple("Entry", ["x1", "x2", "y"])

def load_data(filename):
  examples = []
  with open(filename, "r") as f:
    for line in f:
      fields = json.loads(line)
      x1 = fields["x1"]
      x2 = fields["x2"]
      if fields["y"] not in LABELS:
        continue
      y = LABELS.index(fields["y"])
      examples.append(Example(x1, x2, y))
  return examples

train_examples = load_data("train.txt")
valid_examples = load_data("valid.txt")
test_examples = load_data("test.txt")

## Feature Engineering

As you can see from the example, this task takes **two** inputs $x_1$ and $x_2$. We'll experiment with some basic featurization options.

Feel free to explore additional feature engineering approaches if you have time!






### Majority baseline

It's always good to start simple when approaching new task. Naïve baselines are often good at uncovering biases in the data you might not have noticed otherwise.

One to start out with is the majority baseline. What is the prior for entailment? In this model simply ignore the input and use the most common class, always.

Using the [DummyClassifier](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators) from `sklearn`, create a majority baseline and record the accuracy. See how much higher you can get in the next sections!

In [0]:
# TODO: YOUR CODE HERE

majority_baseline = ...

### Hypothesis- and premise-only baselines

Two other simple baselines are to try to classify the data using just the hypothesis (and no premise) and just the premise (and no hypothesis). You may use a bag-of-words representation.

In [0]:
# TODO: YOUR CODE HERE

hypothesis_only = ...

premise_only = ...

### Independent features

Let's now create a featurization that includes both $x_1$ and $x_2$. A simple one to begin with is the concatenation of their bag-of-words vectors: $[\texttt{BoW}(x_1); \texttt{BoW}(x_2)]$.

In [0]:
# TODO: YOUR CODE HERE

concatenated = ...

### Interaction features

The concatenated features don't capture any interactions between the premise and hypothesis. We can also try to add some more features by considering the number of shared terms in both sentences: $\min(x_1, x_2)$

In [0]:
# TODO: YOUR CODE HERE

overlap = ...

## Modeling

Using your featurizations as inputs, experiment with different modeling choices using `sklearn`.

Try a logistic regression model first, with different regularization stragies. Then you may move on to non-linear classifiers, such as decision trees. Try to get as high an accuracy that you can!

In [0]:
# TODO: YOUR CODE HERE