# sentence embeddings with infersent
[InferSent](https://github.com/facebookresearch/InferSent) is a sentence embedding model created by Facebook Research using the [SNLI](https://nlp.stanford.edu/projects/snli/) dataset. The whole thing has been released under a [non-commercial license](https://github.com/facebookresearch/InferSent/blob/master/LICENSE) and is starting to gain some traction as it's used in more and more interesting contexts. 
Unsurprisingly, sentence embeddings are word embeddings for sentences. When a sentence is passed through the network, it is assigned a position in sentence space in which other sentences with similar semantic meanings also sit. The 4096 dimensional feature vector which is produced can be interpreted to 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (20, 20)

import os
import json
import nltk
import numpy as np 
import pandas as pd
from PIL import Image
from scipy.spatial.distance import cdist
from tqdm import tqdm_notebook as tqdm

import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from torchvision import models, transforms

nltk.download('punkt')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# load InferSent model
We've stored the relevant infersent code locally in `InferSent.py` so that it can be intuitively imported (as below), but the original can be found as `models.py` in the source repo. We also need to load the model weights in `infersent2.pkl` and the word vectors on which the model was trained from `crawl-300d-2M.vec`. The InferSent API is simple enough to use, and in only a few lines of code we have a working sentence embedding model. Note that this _is_ a model - we're not loading a dictionary and just looking up known keys here as we do with most word vectors. Each time we call `infersent_model.encode()`, the text is passed through a neural network to produce a new, unique embedding which the model had not necessarily seen as part of its training.

In [None]:
from InferSent import InferSent

In [None]:
MODEL_PATH =  '/mnt/efs/models/infersent2.pkl'

params_model = {'bsize': 1024, 
                'word_emb_dim': 300, 
                'enc_lstm_dim': 2048,
                'pool_type': 'max', 
                'dpout_model': 0.0, 
                'version': 2}

infersent_model = InferSent(params_model)
infersent_model.load_state_dict(torch.load(MODEL_PATH))

In [None]:
W2V_PATH = '/mnt/efs/nlp/word_vectors/fasttext/crawl-300d-2M.vec'
infersent_model.set_w2v_path(W2V_PATH)

In [None]:
infersent_model.build_vocab_k_words(K=100000)

In [None]:
infersent_model = infersent_model.to(device)

# load coco captions
We'll use the captions from the well known [COCO dataset](http://cocodataset.org/) to demonstrate InferSent's effectiveness.

In [None]:
with open('/mnt/efs/images/coco/annotations/captions_val2014.json') as f:
    meta = json.load(f)
    
captions = pd.DataFrame(meta['annotations']).set_index('image_id')['caption'].values

# embed captions with infersent

In [None]:
embeddings = infersent_model.encode(captions, tokenize=True)

In [None]:
index = np.random.choice(len(captions))

embedding = embeddings[index].reshape(1, -1)
query_caption = captions[index]
query_caption

In [None]:
distances = cdist(embedding, embeddings, 'cosine').squeeze()
closest_captions = captions[np.argsort(distances)]
closest_captions[:10]

The example above shows the power of modern sentence embedding models which integrate the semantic meaning encoded in word vectors over traditional retrieval methods like TF-IDF or BM25.

A great example is the query `'a rainbow is in the sky over an empty stretch of road'`.  
The fourth result (following a few about rainbows) is `'there is a green street light hanging over this empty intersection'`.
Very few of the most significant words in those sentences are exact matches, but the scenes they describe are extremely similar.


# where infersent breaks
While infersent is capable of encoding an incredible amount of subtlety in medium length sequences, it really struggles to encode that same level of meaning in short sequences.

In [None]:
single_word_embedding = infersent_model.encode(['doctor'])
distances = cdist(single_word_embedding, embeddings, 'cosine').squeeze()
closest_captions = captions[np.argsort(distances)]
closest_captions[:10]

This poses the reverse of the problem posed at the start of this notebook. While word-vector space is only able to meaningfully encode single word queries, infersent is only able to encode longer queries.  
One might suggest a pairing of the models, where at query-time, a one-word search is sent to the word-vector model and a multi-word search is sent to the sentence-embedding model. This might solve the problem of being able to encode arbitrary length sequences, but the space _must_ be shared in order to return consistent results.

In other words, we're eventually going to have to create our own, custom sentence embedding model if we're going to DeViSE our images into a meaningful search space. Nevertheless, in the next notebook we'll ensure that applying the DeViSE principle to sentence embedding space still works.