# sentence embeddings via NLI from scratch
Rather than using multiple tasks to induce generality in its sentence embeddings, InferSent takes on one very complex task, popularised by the SNLI dataset. The algorithm learns to categorise a pair of sentences (one following the other) as contraditions, entailments or neutral statements of one another. It's an absurdly subtle task, but InferSent and many other surprisingly simple deep learning systems have managed to take it on and achieve good classification performance and consequently producing meaningful sentence embeddings.  
However, the SNLI dataset uses sequences of quite predictable length, which is why it struggles to interpret single- or double-word queries effectively. We need to embed queries, not sentences, so some modifications to the training data are necessary. We can supplement the SNLI dataset with MultiNLI (providing a broader range of language and context), COCO (a natural choice when working with image captions/search, where all combinatorial pairs of captions for the same image are treated as entailments), and sequences paired with nouns and adjective-noun pairs extracted from those sequences (again, all pairs treated as entailments). This should increase the granularity and dexterity of the embeddings.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (20, 14)

import os
import json
import nltk
import spacy
import itertools
import numpy as np 
import pandas as pd
from PIL import Image
from scipy.spatial.distance import cdist
from tqdm import tqdm_notebook as tqdm
from tqdm import tqdm as tqdm_
tqdm_.pandas()

import io
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader, random_split
from torchvision import models, transforms

nlp = spacy.load('en')
nltk.download('punkt')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# assemble SNLI, MultiNLI and COCO dataframes
loading in these datasets is fairly simple - the only complication is finding all of the combinatorial pairs of coco captions, but `itertools.combinations` makes this process much simpler.

In [None]:
multinli = (pd.read_json('/mnt/efs/nlp/natural_language_inference/multinli_1.0/multinli_1.0_train.jsonl', 
                         lines=True)
            [['gold_label', 'sentence1', 'sentence2']]
           )

In [None]:
snli = (pd.read_json('/mnt/efs/nlp/natural_language_inference/snli_1.0/snli_1.0_train.jsonl', 
                     lines=True)
        [['gold_label', 'sentence1', 'sentence2']]
       )

In [None]:
with open('/mnt/efs/nlp/natural_language_inference/coco2014/captions_train2014.json') as f:
    df = pd.DataFrame(json.load(f)['annotations'])

coco, i = {}, 0
for image_id in tqdm(df['image_id'].unique()):
    captions = df[df['image_id'] == image_id]['caption'].values
    for s1, s2 in list(itertools.combinations(captions, 2)):
        coco[i] = {'gold_label': 'entailment',
                   'sentence1': s1, 'sentence2': s2}
        i += 1

coco = pd.DataFrame(coco).T
del df

# find nouns and adjective-noun pairs in sentences
The hope is that by conflating the individual subject nouns and adjective-noun pairs in our source data with their full sentence forms, the network will learn to represent them as essentially the same thing. We want our network to be as meaningful for single word queries as if we were to just use the simple 300d word-vector space, and this is the most straightforward way of doing that which I can imagine, without branching off again into multi-task learning.

First we'll grab a few thousand random sequences from the original datasets.

In [None]:
sentences = (pd.concat([multinli, snli, coco])
             .fillna('')
             ['sentence1']
             .sample(20000)
             .values)

subjects = {}
i = 0

We extract the nouns from the sequence (using spacy's POS tagger) and add them to a dictionary, paired with their original sequence and an `'entailment'` label

In [None]:
for sentence in tqdm(sentences):
    for word in nlp(sentence):
        if word.pos_ == 'NOUN':
            subjects[i] = {'sentence1': word.text,
                           'sentence2': sentence,
                           'gold_label': 'entailment'}
            i += 1

Now we'll grab the adjective-noun pairs and add them to the same dictionary.

In [None]:
for sentence in tqdm(sentences):
    words = nlp(sentence)
    for i in range(len(words) - 1):
        word_1, word_2 = words[i:i+2]
        if ((word_1.pos_ == 'ADJ') & (word_2.pos_ == 'NOUN')):            
            subjects[i] = {'sentence1': word.text,
                           'sentence2': sentence,
                           'gold_label': 'entailment'}
            i += 1
            

We'll now transform that dictionary into a dataframe so that it can be combined with the ones we loaded in before.

In [None]:
subjects = pd.DataFrame(subjects).T

# the base dataframe
Here's the combined dataframe with all four datasets. As usual, pandas makes manipulation of the data at this stage super simple.

In [None]:
df = pd.concat([multinli, snli, coco, subjects]).fillna('')
df = df.drop(df[df['gold_label'] == '-'].index)
df.reset_index(inplace=True, drop=True)

# word vectors, vocabulary and text preprocessing
For our text to be meaningfully interpretable by the neural network, we'll seed it with their representations as given by fasttext. 

First we load in the fasttext vectors, and then process all of our sentences so that they are stored as lists of indexes (mapped to their corresponding word vectors), rather than as raw strings.

In [None]:
wv_path = '/mnt/efs/nlp/word_vectors/fasttext/wiki-news-300d-1M.vec'
wv_file = io.open(wv_path, 'r', encoding='utf-8', newline='\n', errors='ignore')

fasttext = {line.split()[0]: np.array(line.split()[1:])
            for line in tqdm(list(wv_file))}

pad_value, start_value, end_value = 0.25, 0.5, 0.75
fasttext['<p>'] = np.full(shape=(300,), fill_value=pad_value)
fasttext['<s>'] = np.full(shape=(300,), fill_value=start_value)
fasttext['</s>'] = np.full(shape=(300,), fill_value=end_value)

In [None]:
def preprocess(sentence):
    index_list = ([word_to_index['<s>']] + 
                  [word_to_index[w] for w in word_tokenize(sentence) if w in fasttext] + 
                  [word_to_index['</s>']])
    return index_list

In [None]:
word_to_index = {word: index for index, word in enumerate(list(fasttext.keys()))}
index_to_word = {index: word for index, word in enumerate(list(fasttext.keys()))}

In [None]:
index_to_wordvec = np.zeros((len(fasttext), 300))
for word in tqdm(list(fasttext.keys())):
    index_to_wordvec[word_to_index[word]] = fasttext[word]

In [None]:
df['sentence1'] = df['sentence1'].apply(str.lower)
df['sentence2'] = df['sentence2'].apply(str.lower)

In [None]:
df['sentence1'] = df['sentence1'].progress_apply(preprocess)
df['sentence2'] = df['sentence2'].progress_apply(preprocess)

In [None]:
le = LabelEncoder()
df['gold_label'] = le.fit_transform(df['gold_label'].values)

# dataset and dataloader

In [None]:
class NLIDataset(Dataset):
    def __init__(self, dataframe):
        self.sentence1s = dataframe['sentence1'].values
        self.sentence2s = dataframe['sentence2'].values
        self.labels = dataframe['gold_label'].values

    def __getitem__(self, index):
        s1 = self.sentence1s[index]
        s2 = self.sentence2s[index]
        label = self.labels[index]
        return s1, s2, label

    def __len__(self):
        return len(self.labels)

## sort dataset by length

In [None]:
shuffled_df = df.sample(frac=1).reset_index(drop=True)

In [None]:
split_ratio = 0.8
train_size = int(split_ratio * len(df))

In [None]:
train_df = shuffled_df.loc[:train_size]
test_df  = shuffled_df.loc[train_size:]

In [None]:
train_dataset = NLIDataset(train_df)
test_dataset = NLIDataset(test_df)

We've vastly imbalanced the original dataset's classes by adding so many entailments, so we calculate the dataset's class weights to rebalance the training.

In [None]:
class_weights = (train_df['gold_label']
                 .value_counts(normalize=True)
                 .sort_index()
                 .values)

class_weights = torch.Tensor(class_weights).cuda()

# dataloader with custom `collate_fn()`
The custom collate function adds padding to the network's inputs at each batch, ensuring that each batch is rectangular. I know that could be done with `pack_padded_sequence()` etc but they're strange beasts without many parallels in other frameworks, and documentation/examples are lacking at the moment, so I'd rather write something myself which I understand rather than mess things up by using something that doesn't make sense to me yet. In all other ways, this dataloader is the same as the ones we've used in the previous notebooks.

In [None]:
def sentence_to_indexes(sentence):
    tokenised = word_tokenize(sentence)
    indexes = [word_to_index[word] 
               for word in tokenised 
               if word in word_to_index]
    return indexes

def pad_sequence(sentences, pad_length=None):
    if pad_length is None:
        pad_length = max([len(sent) for sent in sentences])

    padded = np.full((len(sentences), pad_length), word_to_index['<p>'])
    for i, sentence in enumerate(sentences):
        padded[i][pad_length - len(sentence):] = sentence
    return padded


def custom_collate_fn(batch):
    s1, s2, labels = zip(*batch)
    
    batch_size = len(labels)
    seq_length = max([len(s) for s in (s1 + s2)])

    padded_s1 = pad_sequence(s1, pad_length=seq_length)
    padded_s2 = pad_sequence(s2, pad_length=seq_length)
    
    wv_s1 = np.stack([[index_to_wordvec[i] for i in seq] for seq in padded_s1])
    wv_s2 = np.stack([[index_to_wordvec[i] for i in seq] for seq in padded_s2])
    
    return wv_s1, wv_s2, labels

In [None]:
batch_size = 64

train_loader = DataLoader(dataset=train_dataset,
                          batch_size=batch_size,
                          num_workers=5,
                          shuffle=True,
                          collate_fn=custom_collate_fn)

test_loader = DataLoader(dataset=test_dataset,
                         batch_size=batch_size,
                         num_workers=5,
                         collate_fn=custom_collate_fn)

# build models
We're replicating InferSent's architecture with a 1-layer, 2048-dimensional, biderectional LSTM providing the brains of the network, followed by a simple compression down to the 3-dimensional softmax output. The sentence embedding and NLI-task networks are kept separate (with one nested inside the other at train-time) for simplicity's sake later on.

In [None]:
hidden_size = 2048

class SentenceEncoder(nn.Module):
    def __init__(self, ):
        super(SentenceEncoder, self).__init__()
        self.enc_lstm = nn.LSTM(input_size=300, 
                                hidden_size=hidden_size, 
                                num_layers=1,
                                bidirectional=True)
        
    def forward(self, wv_batch):
        embedded, _ = self.enc_lstm(wv_batch)
        max_pooled = torch.max(embedded, 1)[0] 
        return max_pooled


class NLINet(nn.Module):
    def __init__(self, index_to_wordvec):
        super(NLINet, self).__init__()
        self.index_to_wordvec = index_to_wordvec
        self.encoder = SentenceEncoder()
        self.classifier = nn.Sequential(nn.Dropout(0.2),
                                        nn.Linear(hidden_size*8, 128),
                                        nn.ReLU(),
                                        nn.Dropout(0.2),
                                        nn.Linear(128, 3),
                                       )

    def forward(self, s1, s2):
        u, v = self.encoder(s1), self.encoder(s2)
        features = torch.cat((u, v, torch.abs(u - v), u * v), 1)
        return self.classifier(features)

    def encode(self, sentences):
        indexes = sentence_to_indexes(sentences)
        wvs = torch.Tensor(np.stack([self.index_to_wordvec[i] for i in indexes]))
        return self.encoder([wvs])

# train loop

In [None]:
losses = []

In [None]:
def train(model, train_loader, loss_function, optimiser, n_epochs):
    model.train()
    for epoch in range(n_epochs):
        loop = tqdm(train_loader)
        for s1, s2, target in loop:
            s1 = torch.FloatTensor(s1).cuda(non_blocking=True)
            s2 = torch.FloatTensor(s2).cuda(non_blocking=True)
            target = torch.LongTensor(target).cuda(non_blocking=True)

            optimiser.zero_grad()
            preds = model(s1, s2)

            loss = loss_function(preds, target)
            loss.backward()
            optimiser.step()

            n_correct = target.eq(preds.max(1)[1]).cpu().sum()
            accuracy = (n_correct / batch_size) * 100

            loop.set_description('Epoch {}/{}'.format(epoch + 1, n_epochs))
            loop.set_postfix(loss=loss.item(), acc=accuracy.item())
            losses.append([loss.item(), accuracy.item()])

In [None]:
torch.backends.cudnn.benchmark = True

model = NLINet(index_to_wordvec).to(device)
model.load_state_dict(torch.load('/mnt/efs/models/nlinet-2018-10-08.pt'))

trainable_parameters = filter(lambda p: p.requires_grad, model.parameters())
optimiser = optim.Adam(trainable_parameters, lr=0.001)
loss_function = nn.CrossEntropyLoss(weight=class_weights)

In [None]:
train(model=model,
      train_loader=train_loader,
      loss_function=loss_function,
      optimiser=optimiser,
      n_epochs=3)

In [None]:
loss_data = pd.Series(losses).rolling(window=50).mean()
ax = loss_data.plot();
ax.set_xlim(0,);
ax.set_ylim(0, 1.1);

In [None]:
torch.save(model.state_dict(), '/mnt/efs/models/nlinet-2018-10-08.pt')

# evaluate
we can inspect and evaluate the model by having a direct look at the similarity of a few query sentences

In [None]:
def embed(sentence):
    indexes = ([word_to_index['<s>']] + 
               sentence_to_indexes(sentence) +
               [word_to_index['</s>']])
    wvs = np.stack([index_to_wordvec[i] for i in indexes])
    embedding = model.encoder(torch.Tensor([wvs]).cuda()).cpu().data.numpy()
    return embedding.squeeze()

In [None]:
sentences = (pd.concat([multinli, snli, coco])
             .fillna('')
             ['sentence1']
             .sample(20)
             .values)

embeddings = [embed(sentence) for sentence in sentences]

for i, sentence in enumerate(sentences):
    print(i, sentence)

In [None]:
from scipy.spatial.distance import cdist
distance_matrix = cdist(embeddings, embeddings, metric='cosine')
sns.heatmap(distance_matrix);

it's obviously important to check that we haven't begun overfitting by comparing our knowledge of the train set to the performance on the test set. 

In [None]:
test_losses = []

with torch.no_grad():
    loop = tqdm(test_loader)
    for s1, s2, target in loop:
        s1 = torch.FloatTensor(s1).cuda(non_blocking=True)
        s2 = torch.FloatTensor(s2).cuda(non_blocking=True)
        target = torch.LongTensor(target).cuda(non_blocking=True)

        optimiser.zero_grad()
        preds = model.eval()(s1, s2)

        test_loss = loss_function(preds, target)
        loop.set_postfix(loss=test_loss.item())
        test_losses.append(test_loss.item())

In [None]:
print(np.mean(test_losses))

# test on wellcome titles
We can also have a look at how our model does on the tiles of works in the wellcome collection catalogue

In [None]:
meta = pd.read_json('/mnt/efs/other/works.json', lines=True)
meta.index = meta['identifiers'].apply(lambda x: x[0]['value']).rename()

In [None]:
titles = meta['title'].values
title_embeddings = np.array([embed(sentence) for sentence in tqdm(titles)])

In [None]:
query_sentence = 'table'
query_embedding = embed(query_sentence).reshape(1, -1)
distances = cdist(query_embedding, title_embeddings, metric='cosine')
print(query_sentence)

In [None]:
titles[np.argsort(distances)][0][:10]

# save model
We'll continue to use the model we've trained here, so let's save all the necessary files

In [None]:
import pickle

sentence_encoder = model.encoder
torch.save(sentence_encoder.state_dict(), '/mnt/efs/models/sentence-encoder-2018-10-08.pt')

np.save('/mnt/efs/models/index_to_wordvec.npy', index_to_wordvec)
pickle.dump(word_to_index, open('/mnt/efs/models/word_to_index.pkl', 'wb'))
pickle.dump(index_to_word, open('/mnt/efs/models/index_to_word.pkl', 'wb'))