# Neural Variational Inference for Text Processing
In this notebook we'll be looking at a neural network alternative to LDA.

Information from [Miao et al. 2016](https://arxiv.org/pdf/1511.06038.pdf)

The basic idea of VI is simple: we want to estimate the true posterior $p(\mathbf{X}|\mathcal{D})$ which is in general intractable. What we do is to pick an approximation $q(\mathbf{X})$ from some tractable family, and then to try to make this approximation as close as possible to the true posterior, usually by minimising the KL divergence from $p$ to $q$. This reduces inference to an optimization problem.

Neural variational inference learns to model the posterior probability, thus endowing the model with strong generalisation abilities. By using the re-parameterisation method (Rezende et al. 2014), the inference network is trained through back-propagation unbiased and low variance gradients wrt the latent variables. Within this framework, Miao et al. proposes a Neural Variation Document Model (NVDM) for document modelling. 

<img src='figures/NVDM.png' width='30%'>

NVDM is an unsupervised generative model of text which aims to extract a continuous semantic latent variable for each document. This model can be interpreted as a variational auto-encoder: a multi-layer perceptron (MLP) encoder (inference network) compresses the bag of words document representation into a continuous latent distribution, and a softmax decoder (generative model) reconstructs the document by generating the words independently.

A primary feature of NVDM is that each word is generated directly from a **dense continuous document representation** instead of the more common **binary semantic vector**. 

## Neural Variational Inference Framework
- We define a generative model with a latent variable $\boldsymbol{h}$ which can be considered as the stochastic units in deep neural networks.
- We designed the observed parent and child nodes of $\boldsymbol{x}$ and $\boldsymbol{y}$.
- The joint distribution of the generative model is 
    $$p_\theta(\boldsymbol{x}, \boldsymbol{y}) = \sum_{\boldsymbol{h}}p_\theta(\boldsymbol{y}|\boldsymbol{h})p_\theta(\boldsymbol{h}|\boldsymbol{x})p(\boldsymbol{x})$$
    
The variational lower bound $\mathcal{L}$ is derived as:
$$
\begin{array}
\mathcal{L} & = \mathbb{E}_{q(\boldsymbol{h})}[\log p_\theta(\boldsymbol{y}|\boldsymbol{h})p_\theta(\boldsymbol{h}|\boldsymbol{x})p(\boldsymbol{x}) - \log q(\boldsymbol{h})] \\
 & \le \log\int \frac{q(\boldsymbol{h})}{q(\boldsymbol{h})} p_\theta(\boldsymbol{y}|\boldsymbol{h})p_\theta(\boldsymbol{h}|\boldsymbol{x})p(\boldsymbol{x})d\boldsymbol{h} = \log p_\theta(\boldsymbol{x}, \boldsymbol{y})
\end{array}
$$
(It's the difference between the true posterior and the variational approximation $q$)

- $\theta$ parameterises the generative distribution $p_\theta(\boldsymbol{y}|\boldsymbol{h})$ and $p_\theta(\boldsymbol{h}|\boldsymbol{x})$
- Employ a parameterized diagonal Gaussian $\mathcal{N}(\boldsymbol{h}|\boldsymbol{\mu}(\boldsymbol{x}, \boldsymbol{y}), \mathrm{diag}(\boldsymbol{\sigma}^2(\boldsymbol{x}, \boldsymbol{y})))$ as $q_\theta(\boldsymbol{h}|\boldsymbol{x}, \boldsymbol{y})$.

### 3-steps to construct the inference network
1. Construct vector representations of the observed variables: $\boldsymbol{u} = f_x(\boldsymbol{x}), \boldsymbol{v} = f_y(\boldsymbol{y})$.
2. Assemble a joint representation: $\boldsymbol{\pi} = g(\boldsymbol{u}, \boldsymbol{v})$.
3. Parameterise the variational distribution over the latent variable: $\boldsymbol{\mu} = l_1(\boldsymbol{\pi})$, $\log\boldsymbol{\sigma} = l_2(\boldsymbol{\pi})$.

$f_x(\cdot)$ and $f_y(\cdot)$ can be any type of deep neural networks that are suitable for observed data; $g(\cdot)$ is an MLP that concatenates the vector representations of the conditioning variables; $l(\cdot)$ is a linear transformation which outputs the parameters of the Gaussian distribution. By sampling from the variational distribution $\boldsymbol{h}\sim q_\phi(\boldsymbol{h}|\boldsymbol{x}, \boldsymbol{y})$, we are able to carry out stochastic back-propagation to optimise the lower bound.

## Neural Variational Document Model
- Simple instance of unsupervised learning 
- A continuous hidden variable $\boldsymbol{h}\in \mathbb{R}^K$, which generates all the words in a document independently, is introduced to represent is semantic content.
- Let $\boldsymbol{X} \in \mathbb{R}^{|V|}$ be the bag-of-words representation of a document and $\boldsymbol{x}_i\in\mathbb{R}^{|V|}$ be the one-hot representation of the word at position $i$.

Interpret NVDM as a variational autoencoder: 
- An MLP encoder $q(\boldsymbol{h}|\boldsymbol{X})$ compresses document representations into continuous hidden vectors ($\boldsymbol{X}\rightarrow\boldsymbol{h}$)
- A softmax decoder
$$
p(\boldsymbol{X}|\boldsymbol{h}) = \prod_{i=1}^N p(\boldsymbol{x}_i|\boldsymbol{h})$$ reconstructs the documents by independently generating the words $(\boldsymbol{h}\rightarrow \{\boldsymbol{x}_i\})$. 
- To maximise the log-likelihood $\log\sum_{\boldsymbol{h}}p(\boldsymbol{X}|\boldsymbol{h})p(\boldsymbol{h})$ of documents, we derive the lower bound:
$$
\mathcal{L} = \mathbb{E}_{q_\phi(\boldsymbol{h}|\boldsymbol{X})}\left[\sum_{i=1}^N\log p_\theta(\boldsymbol{x}_i|\boldsymbol{h})\right]-\mathit{D}_{KL}[q_\phi(\boldsymbol{h}|\boldsymbol{X})||p(\boldsymbol{h})]
$$
where $N$ is the number of words in the document and $p(\boldsymbol{h})$ is a Gaussian prior for $\boldsymbol{h}$. 