## Probabilistic latent semantic indexing

This model posits that a document label $d$ and a word $w_n$ are conditionally independent given an unobserved topic $z$:

$$
p(d, w_n) = p(d)\sum_{z}p(w_n|z)p(z|d).
$$

The pLSI model attempts to relax the simplifying assumption made in the mixture of unigrams model that each document is generated from only one topic. In a sense, it does capture the possibility that a document may contain multiple topics since $p(z|d)$ serves as the mixture weights of the topics for a particular document $d$. 

However, we need to note several problems:

1. $d$ is a dummy index into the list of documents in the *training set*. Thus, $d$ is a multinomial random variable with as many possible values as there are training documents and the model learns the topic mixtures $p(z|d)$ only for those documents on which it is trained.

2. Also stems from the use of a distribution index by training documents, is that the number of parameters which must be estimated grows linearly with the number of training documents. The parameters for a $k$-topic PLSI model are $k$ multinomial distributions of size $V$ and $M$ mixtures over the $k$ hidden topics. This gives $kV + kM$ parameters and therefore linear growth in $M$. 

$\therefore$ pLSI is not a well-defined generative model of documents; there is no natural way to use it to assign probability to a previously seen document. Also, this linear growth in parameters suggests that the model is prone to overfitting. 

## LDA
LDA overcomes both problems by treating the topic mixture weights as a $k$-parameter hidden *random variables* rather than a large set of individual parameters which are explicitly linked to the training set. 

The graphical model representation of LDA is given blow:
<img src="figures/LDA.png">

The basic idea of LDA is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

LDA assumes the following generative process for each document $\mathbf{w}$ in a corpus $\mathcal{D}$:
1. Choose $N\sim$ Poisson($\xi$).
2. Choose $\theta \sim $Dir($\alpha$).
3. For each of the $N$ words $w_n$:
a) Choose a topic $z_n\sim$ Multinomial($\theta$)
b) Choose a word $w_n$ from $p(w_n|z_n, \beta)$, a multinomial probability conditioned on the topic $z_n$.

Several simplifying assumptions are made in this basic model:
1. The dimensionality $k$ of the Dirichlet distribution (and thus the dimensionality of the topic variable $z$) is assumed known and fixed.
2. The word probabilities are parameterized by a $k\times V$ matrix $\beta$ where $\beta_{ij} = p(w^j=1 | z^i=1)$, which for now we treat as a fixed quantity that is to be estimated.
3. The Poisson assumption is not critical to anything that follows and more realistic document length distributions can be used as needed.

A $k$-dimensional Dirichlet random variable $\theta$ can take values in the $(k-1)$-simplex (a $k$-vector $\theta$ in the $(k-1)$-simplex if $\theta_i\ge 0, \sum_{i=1}^k\theta_i=1$), and has the following probability density on this simplex:

$$
p(\theta|\alpha) = \frac{\Gamma(\sum_{i=1}^k\alpha_i)}{\prod_{i=1}^k\Gamma(\alpha_i)}\theta_1^{\alpha_1-1}\cdots\theta_k^{a_k-1},
$$
where the parameter $\alpha$ is a $k$-vector with components $\alpha_i > 0$, and where $\Gamma(x)$ is the Gamma function.

Given the parameters $\alpha$ and $\beta$, the joint distribution of a topic mixture $\theta$, a set of $N$ topics $\mathbf{z}$, and a set of $N$ words $\mathbf{w}$ is given by:
$$
p(\theta, \mathbf{z}|\alpha, \beta) = p(\theta | \alpha) \prod_{n=1}^N p(z_n|\theta)p(w_n|z_n,\beta),
$$
where $p(z_n|\theta)$ is simply $\theta_i$ for the unique $i$ such that $z_n^i = 1$. Integrating over $\theta$ and summing over $z$, we obtain the marginal distribution of a document:
$$
p(\mathbf{w}|\alpha,\beta) = \int p(\theta|\alpha)\left(\prod_{n=1}^N\sum_{z_n}p(z_n|\theta) p(w_n|z_n,\beta)\right)d\theta.
$$
Finally, taking the product of the marginal probabilities of single documents, we obtain the probability of a corpus:
$$
p(\mathcal{D}|\alpha,\beta) = \prod_{d=1}^M\int p(\theta_d|\alpha)\left(\prod_{n=1}^{N_d}\sum_{z_{d_n}}p(z_{d_n}|\theta_d)p(w_{d_n}| z_{d_n},\beta)\right)d\theta_d.
$$

### Inference
The key inference problem that we need to solve in order to use LDA is that of computing the posterior distribution of the hidden variables given a document:
$$
p(\theta, \mathbf{z}|\mathbf{w}, \alpha, \beta) = \frac{p(\theta, \mathbf{z}, \mathbf{w}| \alpha, abeta)}{p(\mathbf{w}|\alpha, \beta)}.
$$
Unfortunately, this distribution is intractable to compute in general. 