# A Neural Probabilistic Language Model

    Yoshua Bengio
    Réjean Ducharme
    Pascal Vincent
    Christian Jauvin

    Journal of Machine Learning Research 3 (2003) 1137–1155 Submitted 4/02; Published 2/03

http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

- Language model
    - $\hat{P}(w_1^T)=\prod_{t=1}^T \hat{P}(w_t | w_1^{t-1})$
    - 简化为取last n-1个words：$\hat{P}(w_t | w_1^{t-1}) \approx \hat{P}(w_t | w_{t-n+1}^{t-1})$
- Distributed Representations
    - 将vocabulary中的每个word表示成`distributed word feature vector`
    - 基于`word feature vector`表示language model
    - 同时学习`word feature vector`和language model的参数
- A Neural Model
    - 有一个mapping C将vocabulary中每一个word表示成distributed feature vectors，即$C(i) \in \mathbb{R}^m$
    - 现在要预测下一个word $w_t$，输出g是一个vector，它的第i个值estimates：$\hat{P}(w_t = i|w_1^{t-1})$，意思是说最大的i所在位置的词即是预测结果
    - 总结成一个公式：$f(i,w_{t-1},\cdots,w_{t-n+1}) = g(i,C(w_{t-1}),\cdots,C(w_{t-n+1}))$
        - 函数f是两个映射得组合：C和g
        - 矩阵C：|V|行m列，每行代表一个word
        - g可以是一个feed-forward或rnn，其中的parameter是ω
        - 整个parameter set：θ = (C, ω)
    - 通过log-likelihood找到θ，使得
        - 最大化：$L=\frac{1}{T}\sum_t log f(w_t,w_{t-1},\cdots,w_{t-n+1};\theta)+R(\theta)$
        - R(θ)是regularization term
- 对输出做softmax：$\hat{P}(w_t|w_{t-1},\cdots,w_{t-n+1})=\frac{e^{y_{w_t}}}{\sum_i e^{y_i}}$
- $y_i$表示针对每一个输出的word i的unnormalized log-probabilities
- $y=b+Wx+Utanh(d+Hx)$
    - W可以为0
    - $x=(C(w_{t-1}),C(w_{t-2}),\cdots,C(w_{t-n+1}))$

## Introduction
A statistical model of language can be represented by the conditional probability of the next word given all the previous ones, since

$\hat{P}(w_1^T)=\prod_{t=1}^T \hat{P}(w_t | w_1^{t-1})$

where $w_t$ is the t-th word, and writing sub-sequence $w_i^j = (w_i, w_{i+1}, \cdots ,  w_{j-1},w_j)$. 

`n-gram` models construct tables of conditional probabilities for the next word, for each one of a large number of contexts, i.e. combinations of the last n − 1 words:

$\hat{P}(w_t | w_1^{t-1}) \approx \hat{P}(w_t | w_{t-n+1}^{t-1})$

### Fighting the Curse of Dimensionality with Distributed Representations
In a nutshell, the idea of the proposed approach can be summarized as follows:

1. associate with each word in the vocabulary a distributed `word feature vector` (a real-valued vector in $\mathbb{R}^m$),
1. express the joint `probability function` of word sequences in terms of the feature vectors of these words in the sequence, and
1. learn simultaneously the `word feature vectors` and the parameters of that `probability function`.


## A Neural Model
1. The training set is a sequence $w_1 \cdots w_T$ of words $w_t \in V$ , where the vocabulary V is a large but finite set.
1. The objective is to learn a good model $f(w_t, \cdots,w_{t-n+1})=\hat{P}(w_t | w_1^{t-1})$, in the sense that it gives high out-of-sample likelihood.

We decompose the function $f(w_t,\cdots , w_{t-n+1})=\hat{P}(w_t|w_1^{t-1})$ in two parts:

1. A mapping C from any element i of V to a real vector $C(i) \in \mathbb{R}^m$. It represents the `distributed feature vectors` associated with each word in the vocabulary. In practice, C is represented by a |V| × m matrix of free parameters.
1. The probability function over words, expressed with C: a function g maps an input sequence of feature vectors for words in context, $(C(w_{t-n+1}),\cdots,C(w_{t-1}))$, to a conditional probability distribution over words in V for the next word $w_t$. The output of g is a vector whose i-th element estimates the probability $\hat{P}(w_t = i|w_1^{t-1})$ as in Figure 1.

$f(i,w_{t-1},\cdots,w_{t-n+1}) = g(i,C(w_{t-1}),\cdots,C(w_{t-n+1}))$

- The function f is a composition of these two mappings (C and g), with C being shared across all the words in the context.
- The parameters of the mapping C are simply the feature vectors themselves, represented by a |V| × m matrix C whose row i is the feature vector C(i) for word i.
- The function g may be implemented by a feed-forward or recurrent neural network or another parametrized function, with parameters ω.
- The overall parameter set is θ = (C, ω).

Training is achieved by looking for θ that maximizes the training corpus penalized log-likelihood:

$L=\frac{1}{T}\sum_t log f(w_t,w_{t-1},\cdots,w_{t-n+1};\theta)+R(\theta)$

- R(θ) is a regularization term

![1](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/5849358.jpg)

More precisely, the neural network computes the following function, with a `softmax` output layer, which guarantees positive probabilities summing to 1:

$\hat{P}(w_t|w_{t-1},\cdots,w_{t-n+1})=\frac{e^{y_{w_t}}}{\sum_i e^{y_i}}$

The $y_i$ are the unnormalized log-probabilities for each output word i, computed as follows, with parameters b,W,U,d and H:

$y=b+Wx+Utanh(d+Hx)$

- the hyperbolic tangent tanh is applied element by element
- W is optionally zero (no direct connections)
- x is the word features layer activation vector, which is the concatenation of the input word features from the matrix C:$x=(C(w_{t-1}),C(w_{t-2}),\cdots,C(w_{t-n+1}))$