# Sentence Embeddings

### Outline:

1. Evalutaion
1. Unsupervised
1. Strong Baselines
1. Universal Sentence Embeddings

# Evaluation

SentEval from Facebook Research.   
SentEval is a framework of evaluation of sentence embeddings on a number of transfer learning tasks.  

<img src="images/sent_eval_results.png">

# 1 Unsupervised Embeddings

## 1.1 Skip-Thought

An extension of skip-gram model: given current sentence predict adjacent sentences. 

Problems: too heavy decoder.  

<img src="images/skip_thought.png" style="height:300px">

## 1.2 . Quick Thought

Optimized version of Skip-Thoughts Model:  let's get rid of decoder and predict next sentence label against random alternatives.

<img src="images/quick_thought.png" style="height:300px">

## 1.2 Infersent

Sentence encoder: bidirectional LSTM with max polling.
<img src="images/infersent_1.png"  style="height:400px">

Trained on Stanfolrd Natural Language Inference (SNLI) dataset.
Given 2 sentences (premise and hypothesis), make prediction if:
1. the 2nd sentence is an entailment of the 1st
2. the 2nd sentence is a contradiction of the 1st
3. sentences are neutral 

<img src="images/infersent_2.png"  style="height:300px">

# 2 Strong Baselines

## 2.1 SIF

### In Short

1. Compute sentence embedding as a weighted sum of word embeddings
1. Form a matrix of sentence embeddings X
1. Substract form sentenc embedding the projection onto the first eigenvector X.

### Theory behind

Idea:  
Then treat corpus generation as a random walk of a discourse vector $c_t$. Then, probability of emmiting word $w$ at time t  is 
$$ P( w | c_s) \propto {\exp(<c_t, v_w>)}$$, where $v_w$ is embedding of word $w$.

Suppose discource vector does not change much inside a sentence. Then, given sentence $s$, introduce a discourse vector $c_s \in R^d$  

Let $p(w)$ - unigram probability  
Let $\alpha, \beta \in R$  
Let $c_0$ is a common discourse vector, correction term for the most frequent discourse that is often related to syntax.  
Let $\hat c_s = \beta c_s  + (1 - \beta)c_0 $ - smoothed discourse vector


Probability of emmiting word $w$ in a sentence s:
$$ P( w | c_s) = \alpha p(w) + (1 - \alpha) \frac {\exp(<\hat c_w, v_w>)} {Z_{\hat c_s}}$$.

Algorithm:

1. $v_s \leftarrow  \frac 1 {|s|} \sum_{i=1}^{|s|} \frac {\alpha}  {\alpha + p(w)} v_w $
1. form matrix $X$ whose columns are ${v_s : s \in S}$, 
1. $v_s \leftarrow v_s - u u^T v_s$, where $u$ - the first singular vector of X.

## 2.2 Random Methods

1. **Bag of random embedding projections**
$$h  = f_pool ( U e_i )$$,
where $h \in R^D$ - sentence embedding,    
$e_i \in R^d$ - word embedding,  
$U \in R^{Dxd}$ - random matrix  


1. **Random LSTM**
$$ h  = f_pool ( LSTM(e_1, ..., e_n) )$$,
where $v_s \in R^D$ - sentence embedding,   
$e_i \in R^d$ - word embedding,  
$LSTM$ - randomly initialized LSTM  



# 3 Universal Sentence Embeddings

Sentence embeddings can profit greatly from multitask learning.
<img src="images/multitask.png"  style="height:300px">

Universal Sentence Encoder from Google has 2 variants of encoder:
1. Transformer encoder  
2. Deep Averaging Network (DAN) = average of word embeddings + MLP


<img src="images/google.png"  style="height:500px">

# Papers

1. [SIF] A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS, ICLR 2017
1. [GenSen] Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning, ICLR 2018
1. [Google] 
1. NO TRAINING REQUIRED: EXPLORING RANDOM ENCODERS FOR SENTENCE CLASSIFICATION, ILCR 2019
1. Universal Sentence Encoder
1. [Skip Thought]Skip-Thought Vectors, Kiros et al. 2015
1. [Quick Thought] An efficient framework for learning sentence representations, ICLR 2018
1. [InferSent] Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, EMNLP 2017