## Code written and working on simulated data set

In [83]:
import numpy as np
import scipy as sp

In [117]:
def PhiGamma(alpha, beta, w, T):
    k = len(alpha)
    N = len(w)
    alpha = np.array(alpha)
    beta = np.array(beta)
    gamma = np.tile(alpha + N/k, ((T+1),1))
    phi = np.array([[[1/k]*k]*N]*(T+1))
    for t in range(T):
        for n in range(N):
            for i in range(k):
                phi[t+1,n,i] = beta[i,n] * np.exp(sp.special.psi(gamma[t,i])) #beta[i,n], n should be w_n.
            phi_const = np.sum(phi[t+1,n,])
            phi[t+1,n,] = phi[t+1,n,]/phi_const
        gamma[t+1,] = alpha + np.sum(phi[t+1,], axis=0)
    return(gamma[T,],phi[T,])

In [121]:
#Test function
N=4
T=5
alpha = np.array([1/3,1/3,1/3])
beta = np.array([[1/2,1/2,1/2,1/2,1/2,1/2], [1/2,1/2,1/2,1/2,1/2,1/2], [1/2,1/2,1/2,1/2,1/2,1/2]])
w=0
a = np.stack((beta,beta))
PhiGamma(alpha, beta, w, N, T)

(array([ 1.66666667,  1.66666667,  1.66666667]),
 array([[ 0.33333333,  0.33333333,  0.33333333],
        [ 0.33333333,  0.33333333,  0.33333333],
        [ 0.33333333,  0.33333333,  0.33333333],
        [ 0.33333333,  0.33333333,  0.33333333]]))

## 1. Background Introduction
### A. Question interested
Due to the Internet, information retrieval has exploded. For a range of reasons in the modern world, people want to compare and contrast two documents for structure, text, words, topics. The goal is to retrieve information from a large text corpus. “Thus each word is generated from a single topic, and different words in one document can be generated by different topics. Each document is represented as a list of mixing proportions for these mixture components and thereby reduced to a probability distribution on a fixed set of topics. This distribution is the ‘reduced description’ associated with the document”. 
### B. Basic thoughts and procedure
A significant step forward in this regard was made by modeling each word in a document as a sample from a mixture model, where the mixture components are random variables that can be viewed as representations of “topics”. The solution to the goal is the Latent Dirichlet Allocation (LDA) model, a three-level probabilistic model in natural language processing for comparing and contrasting collections of discrete data, and providing a short description of the statistical relationships in those large discrete data sets. In LDA, each document is composed of various topics with Dirichlet prior distribution. Each topic has a probability of generating its corresponding word and those words that do not belong to any topic have an even probability of being placed into each category.



## 2. Algorithm 
### A. Models
#### (1). Notation
Formally, we define the following terms:
- A _word_ is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1, . . . , V }. We represent words using unit-basis vectors that have a single component equal to one and all other components equal to zero. Thus, using superscripts to denote components, the _v_th word in the vocabulary is represented by a V -vector w such that $w^v = 1$ and $w^u = 0$ for $u\neq v$.
- _A _document_ is a sequence of N words denoted by $\textbf{w} = (w_1,w_2,... ,w_N)$, where $w_n$ is the _n_ th word in the sequence.
- A _corpus_ is a collection of M documents denoted by $D = \{\textbf{w}_1,\textbf{w}_2,... ,\textbf{w}_M \}$.

#### (2). Latent Dirichlet allocation
LDA assumes the following generative process for each document $\textbf{w}$ in a corpus D:
1. Choose N $\sim$ Poisson(ξ).
2. Chooseθ $\sim$ Dir(α).
3. For each of the N words $w_n$:
 - Choose a topic $z_n$ $\sim$ Multinomial(θ).
 - Choose a word $w_n$ from p($w_n$ |$z_n$,β), a multinomial probability conditioned on the topic $z_n$.

### B. Techniques for inference
#### (1). Variational inference

#### (2). Parameter estimation
The derivation of the variational EM algorithm for LDA yields the following iterative algorithm:
  1. (E-step) For each document, find the optimizing values of the variational parameters {$γ^*_d , φ^*_d : d \in D$}. This is done as described in the previous section.
  2. (M-step) Maximize the resulting lower bound on the log likelihood with respect to the model parameters α and β. This corresponds to finding maximum likelihood estimates with expected sufficient statistics for each document under the approximate posterior which is computed in the E-step.

## 3. Application
### A. Simulated data
### B. Real dataset
## 4. Optimization and Analysis
## 5. Discussion