# Skip-Thought Vectors
    Ryan Kiros,Yukun Zhu,Ruslan Salakhutdinov,Richard S. Zemel,Antonio Torralba, Raquel Urtasun, Sanja Fidler
    2015

http://papers.nips.cc/paper/5950-skip-thought-vectors.pdf

## 总结
一种unsupervised的句子级别encoder方法。

##  Introduction
An approach for unsupervised learning of a generic, distributed sentence encoder.

![1](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/31481124.jpg)

One difficulty that arises with such an experimental setup is being able to construct a large enough word vocabulary to encode arbitrary sentences. For example, a sentence from a Wikipedia article might contain nouns that are highly unlikely to appear in our book vocabulary. We solve this problem by learning a mapping that transfers word representations from one model to another. Using pre-trained word2vec representations learned with a continuous bag-of-words model, we learn a linear mapping from a word in word2vec space to a word in the encoder’s vocabulary space. The mapping is learned using all words that are shared between vocabularies. After training, any word that appears in word2vec can then get a vector in the encoder word embedding space.

##  Approach
###  Inducing skip-thought vectors
Encoder-decoder models:
- encoder: maps words to a sentence vector
- decoder: generate the surrounding sentences

Define:
- given a sentence tuple $(s_{i-1}, s_i, s_{i+1})$
- $w_i^t$ : the t-th word for sentence $s_i$
- $x_i^t$ : its word embedding

Encoder:
- $w_i^1,\cdots,w_i^N$ be the words in sentence $s_i$ where N is the number of words in the sentence
- At each time step, the encoder produces a hidden state $h_t^i$ which can be interpreted as the representation of the sequence $w_i^1,\cdots,w_i^t$. The hidden state $h_i^N$ thus represents the full sentence.

$r^t=\sigma(W_r x^t + U_r h^{t-1})$ (1)

$z^t=\sigma(W_z x^t + U_z h^{t-1})$ (2)

$\bar{h}^r = tanh(Wx^t + U(r^t \odot h^{t-1}))$ (3)

$h^t=(1-z^t) \odot h^{t-1} + z^t \odot \bar{h}^t$ (4)

- $\bar{h}^t$ is the proposed state update at time t
- $z^t$ is the update gate
- $r_t$ is the reset gate
- $\odot$ denotes a component-wise product
- Both update gates takes values between zero and one

Decoder:
- $C_z$, $C_r$ and C that are used to bias the update gate, reset gate and hidden state computation by the sentence vector
- One decoder is used for the next sentence $s_{i+1}$ while a second decoder is used for the previous sentence $s_{i-1}$
- Separate parameters are used for each decoder with the exception of the vocabulary matrix V, which is the weight matrix connecting the decoder’s hidden state for computing a distribution over words
- $h_{i+1}^t$: the hidden state of the decoder at time t

$r^t=\sigma(W_r^d x^{t-1} + U_r^d h^{t-1} + C_rh_i)$ (5)

$z^t=\sigma(W_z^d x^{t-1} + U_z^d h^{t-1} + C_zh_i)$ (6)

$\bar{h}^t = tanh(W^dx^{t-1} + U^d(r^t \odot h^{t-1}) + Ch_i)$ (7)

$h_{i+1}^t=(1-z^t) \odot h^{t-1} + z^t \odot \bar{h}^t$ (8)

Given $h_{i+1}^t$, the probability of word $w_{i+1}^t$ given the previous t − 1 words and the encoder vector is

$P(w_{i+1}^t | w_{i+1}^{<t},h_i) \propto exp(v_{w_{i+1}^t}h_{i+1}^t)$ (9)
- $v_{w_{i+1}^t}$:the row of V corresponding to the word of $w_{i+1}^t$

An analogous computation is performed for the previous sentence $s_{i-1}$.

Objective:

Given a tuple $(s_{i-1}, s_i, s_{i+1})$, the objective optimized is the sum of the log-probabilities
for the forward and backward sentences conditioned on the encoder representation:

$\sum_t log P(w_{i+1}^t | w_{i+1}^{<t},h_i) + \sum_t log P(w_{i-1}^t | w_{i-1}^{<t},h_i)$ (10)

The total objective is the above summed over all such training tuples.

### Vocabulary expansion
Our goal is to construct a mapping f : $V_{w2v} \to V_{rnn}$
- $V_{w2v}$:the word embedding space of these word representations
- $ V_{rnn}$:the RNN word embedding space