# Using NLI to Encode Sentences - Demo Notebook

This notebook assumes that you've already followed the instructions on the `README.md` file. Hence, it is expected that you either have downloaded the pre-trained models and their results on SNLI & SentEval, or that you have trained your own models and tested them on SNLI & SentEval.

In any case the folders `pretrained`, `senteval` and `tb_logs` should have some trained models and evaluations on SentEval and SNLI respectively.

We start by importing the necessary libraries. The rest of the notebook is divided in two parts:
- in the first part we evaluate all four pretrained models on SNLI & SentEval
- in the second part we use some of the pretrained models to do inference on custom sentence pairs

In [1]:
import numpy as np
import pandas as pd
import pickle

from collections import defaultdict
from data import SNLIDataModule
from model import NLI
from tensorboard.backend.event_processing import event_accumulator

pd.options.display.precision = 2

## Evaluation

### Performance on SNLI

Logs of the performance of all models on SNLI are stored in `tb_logs`. You can of course re-train all the models by uncommenting the lines below:

In [2]:
# !python train.py --encoder AWE
# !python train.py --encoder LSTM
# !python train.py --encoder BiLSTM
# !python train.py --encoder BiLSTM-MaxPool

Let's see the Validation and Test accuracies our models achieved on SNLI.

In [3]:
tb_paths = {
    'AWE': {
        'validation': 'tb_logs/AWE/version_0/events.out.tfevents.1650439854.r30n1.lisa.surfsara.nl.12563.0', 
        'test': 'tb_logs/AWE/version_0/events.out.tfevents.1650441682.r30n1.lisa.surfsara.nl.12563.1',
    },
    'LSTM': {
        'validation': 'tb_logs/LSTM/version_0/events.out.tfevents.1650440601.r30n1.lisa.surfsara.nl.14847.0',
        'test': 'tb_logs/LSTM/version_0/events.out.tfevents.1650447205.r30n1.lisa.surfsara.nl.14847.1',
    },
    'BiLSTM': {
        'validation': 'tb_logs/BiLSTM/version_0/events.out.tfevents.1650441709.r30n2.lisa.surfsara.nl.22540.0',
        'test': 'tb_logs/BiLSTM/version_0/events.out.tfevents.1650450412.r30n2.lisa.surfsara.nl.22540.1',
    },
    'BiLSTM-MaxPool': {
        'validation': 'tb_logs/BiLSTM-MaxPool/version_0/events.out.tfevents.1650447234.r30n6.lisa.surfsara.nl.25797.0',
        'test': 'tb_logs/BiLSTM-MaxPool/version_0/events.out.tfevents.1650454387.r30n6.lisa.surfsara.nl.25797.1',
    }
}

snli_results = defaultdict(list)

# Iterate over all encoders
for encoder in tb_paths.keys():
    # Iterate over validation, test and possibly other splits
    for split, path in tb_paths[encoder].items():
        # Open the TensorBoard event file
        ea = event_accumulator.EventAccumulator(
            path,
            size_guidance={event_accumulator.SCALARS: 0},
        )
        
        _absorb_print = ea.Reload()
        
        # Get a list with the maximum accuracy (over epochs) in all splits
        snli_results[encoder] += [100 * max([event.value for event in ea.Scalars(f'accuracy/{split}')])]

# Convert the results to a dataframe    
snli_results = pd.DataFrame(snli_results, index=['SNLI Val Acc', 'SNLI Test Acc']).T

# Print dataframe
snli_results

Unnamed: 0,SNLI Val Acc,SNLI Test Acc
AWE,65.57,65.73
LSTM,80.4,80.22
BiLSTM,79.38,79.44
BiLSTM-MaxPool,83.61,80.56


Not bad, right? We can go further, and inspect the training & evaluation performance on SNLI using the interactive tensorboard extension.

In [4]:
%load_ext tensorboard
%tensorboard --logdir=tb_logs/

Reusing TensorBoard on port 6006 (pid 45197), started 7:51:12 ago. (Use '!kill 45197' to kill it.)

### Performance on SentEval

Pickled results of the evaluation of all models on SentEval are provided in the `senteval` folder. If you want to re-evaluate on senteval, uncomment the cell below (and replace the checkpoint with your checkpoints of choice).

In [5]:
# !python -W ignore eval.py --checkpoint pretrained/AWE/version_0/AWE-epoch=0.ckpt
# !python -W ignore eval.py --checkpoint pretrained/LSTM/version_0/LSTM-epoch=0.ckpt
# !python -W ignore eval.py --checkpoint pretrained/BiLSTM/version_0/BiLSTM-epoch=0.ckpt
# !python -W ignore eval.py --checkpoint pretrained/BiLSTM-MaxPool/version_0/BiLSTM-MaxPool-epoch=0.ckpt

Let's see the test accuracies our models achieved on SentEval. This table is directly comparable to Table 4 of the original paper.

In [6]:
senteval_results = defaultdict(dict)

# Iterate over all encoders
for encoder in ['AWE', 'LSTM', 'BiLSTM', 'BiLSTM-MaxPool']:
    results = pickle.load(open(f'senteval/{encoder}.pkl', 'rb'))
    
    # Iterate over all datasets
    for dataset in ['MR', 'CR', 'SUBJ', 'MPQA', 'SST2', 'TREC', 'MRPC', 'SICKRelatedness', 'SICKEntailment', 'STS14']:
        if dataset == 'SICKRelatedness':
            senteval_results[encoder][dataset] = results[dataset]['pearson']
        elif dataset == 'STS14':
            senteval_results[encoder][dataset] = '{:.2f} / {:.2f}'.format(
                results['STS14']['all']['pearson']['wmean'],
                results['STS14']['all']['spearman']['wmean']
            )
        else:
            senteval_results[encoder][dataset] = results[dataset]['acc']

# Convert the results to a dataframe
senteval_results = pd.DataFrame(senteval_results).T
senteval_results = senteval_results.rename(columns={'SICKRelatedness': 'SICK-R', 'SICKEntailment': 'SICK-E'})

# Print dataframe
senteval_results

Unnamed: 0,MR,CR,SUBJ,MPQA,SST2,TREC,MRPC,SICK-R,SICK-E,STS14
AWE,75.16,79.31,90.63,84.66,77.76,80.6,71.36,0.8,78.57,0.47 / 0.50
LSTM,71.54,77.03,86.55,85.06,74.96,78.2,71.88,0.86,84.45,0.56 / 0.54
BiLSTM,72.89,79.15,89.61,85.13,76.94,86.2,71.36,0.86,84.72,0.55 / 0.52
BiLSTM-MaxPool,75.29,81.88,91.17,85.92,78.91,88.2,73.68,0.88,85.69,0.58 / 0.56


### Transfer Performance

To measure the transfer performance, we replicate Table 3 from the original paper.

In [7]:
transfer_results = defaultdict(dict)

# Iterate over all encoders
for encoder in ['AWE', 'LSTM', 'BiLSTM', 'BiLSTM-MaxPool']:
    results = pickle.load(open(f'senteval/{encoder}.pkl', 'rb'))
    
    accuracies = [
        results[dataset]['devacc'] for dataset in ['MR', 'CR', 'SUBJ', 'MPQA', 'SST2', 'TREC', 'MRPC', 'SICKEntailment']
    ]
    
    examples = [
        results[dataset]['ndev'] for dataset in ['MR', 'CR', 'SUBJ', 'MPQA', 'SST2', 'TREC', 'MRPC', 'SICKEntailment']
    ]

    transfer_results[encoder]['micro'] = np.average(accuracies, weights=examples)    
    transfer_results[encoder]['macro'] = np.average(accuracies)

# Convert the results to a dataframe
transfer_results = pd.DataFrame(transfer_results).T

# Print dataframe concatenated with the SNLI results
pd.concat([snli_results, transfer_results], axis=1)

Unnamed: 0,SNLI Val Acc,SNLI Test Acc,micro,macro
AWE,65.57,65.73,80.76,79.11
LSTM,80.4,80.22,78.21,77.56
BiLSTM,79.38,79.44,80.8,80.16
BiLSTM-MaxPool,83.61,80.56,82.4,81.65


## Inference

### Loading pre-trained models and evaluating on custom sentences

The pretrained models are all instances of the `NLI` class, which has the following methods implemented:
- `encode(self, sentence: str)`: which encodes a sentence into a dense vector representation
- `classify(self, sentence_A: str, sentence_B: str):` which classifies a pair of sentences (`sentence_A`, `sentence_B`) as _entailement_, _contradiction_ or _neutral_

For example, we can load a pretrained model as shown below, and see the 2048-dimensional encoding of our BiLSTM model with MaxPooling:

In [8]:
checkpoint_paths = {
    'AWE': 'pretrained/AWE/version_0/AWE-epoch=6.ckpt',
    'LSTM': 'pretrained/LSTM/version_0/LSTM-epoch=8.ckpt',
    'BiLSTM': 'pretrained/BiLSTM/version_0/BiLSTM-epoch=4.ckpt',
    'BiLSTM-MaxPool': 'pretrained/BiLSTM-MaxPool/version_0/BiLSTM-MaxPool-epoch=2.ckpt',
}

dm = SNLIDataModule(); dm.setup()
model = NLI.load_from_checkpoint(checkpoint_paths['BiLSTM-MaxPool'], vocab=dm.vocab, data_dir='data/')

model.encode('The cat is not the mat')

  rank_zero_warn(


tensor([ 0.1061, -0.0645, -0.0265,  ..., -0.1058, -0.0288, -0.0211])

We can also use our model to classify our own sentences:

In [9]:
pairs = [
    ("Jim rides a bike to school every morning.", "Jim can ride a bike."),
    ("The restaurant opens at five o'clock", "The restaurant begins serving between four and nine."),
    ("I liked the TV show.", "It looks like it's gonna rain.")
]

for sentence_A, sentence_B in pairs:
    print(f'{sentence_A:45s}| {sentence_B:55s}| {model.classify(sentence_A, sentence_B)}')

Jim rides a bike to school every morning.    | Jim can ride a bike.                                   | entailment
The restaurant opens at five o'clock         | The restaurant begins serving between four and nine.   | contradiction
I liked the TV show.                         | It looks like it's gonna rain.                         | neutral


The above samples seem very encouraging, but we can also spot some very obvious mistakes:

In [10]:
pairs = [
    ("I like ice-cream.", "It looks like it's gonna rain."),
    ('Butch is married to Barb.', 'Barb is not married to Butch.')
]

for sentence_A, sentence_B in pairs:
    print(f'{sentence_A:45s}| {sentence_B:55s}| {model.classify(sentence_A, sentence_B)}')

I like ice-cream.                            | It looks like it's gonna rain.                         | contradiction
Butch is married to Barb.                    | Barb is not married to Butch.                          | neutral
