# Topic Modeling
* [Introduction to Probabilistic Topic Models](http://menome.com/wp/wp-content/uploads/2014/12/Blei2011.pdf)
* [Empirical Study of Topic Modeling in Twitter](http://snap.stanford.edu/soma2010/papers/soma2010_12.pdf)
## Latent Dirichlet Allocation (LDA)
![Topic Modeling Demo Figure](http://www.scottbot.net/HIAL/wp-content/uploads/2011/11/IntroToLDA.png)
### Definitions
We define a **topic** to be a distribution over a fixed vocabulary. Each document will be processed in this two-stage process:
1. Randomly choose a distribution over topics
2. For each word in document
    * Randomly choose a topic from distribution over topics
    * Randomly choose a word from the corresponding distribution over the vocabulary
Define some notations below:
* topics $\beta_{1:K}$, where each $\beta_{1:K}$ is a distribution over vacabulary (left blocks in figure)
* topic proportions of dth document $\theta_d$, and $\theta_{d,k}$ is proportion for topic k in document d
* topic assignment for document d: $z_{d}$, and $z_{d,n}$ is topic assignment for nth word in document d
* observed nth word in document d: $w_{d,n}$
### Probabilistic model of LDA:  
$p(\beta_{1:K},\theta_{1:D},z_{1,D},w_{1,D})=\prod_{i=1}^{K}p(\beta_i)\prod_{d=1}^{D}p(\theta_d)\big(\sum_{n=1}^{N}p(z_{d,n}|\theta_d)p(w_{d,n}|\beta_{1:K},z_{d,n})\big)$  
The graphical model is shown below:  
![Topic Modeling Graphical Model](https://filebox.ece.vt.edu/~s14ece6504/projects/alfadda_topic/main_figure_3.png)
### Posterior computation
$p(\beta_{1:K},\theta_{1:D},z_{1:D}|w_{1:D})=\frac{p(\beta_{1:K},\theta_{1:D},w_{1:D})}{p(w_{1:D})}$
## Topic Modeling Schemes
### MSG
1. Train LDA on **all training messages**
2. Aggreage training messages from the **same user**
3. Aggregate test messages by **same user**
4. Use the trained model to infer topic mixtures of each testing message
### USER
1. Train LDA on **aggregated user profiles**
2. Aggregate testing messages from **same user**
3. Use the trained model to infer topic mixtures of each testing message
### TERM
1. For each term in training set, aggregate messages with that term
2. train LDA on **training term profiles**
3. Build user profiles in training and testing set respectively
4. Use the trained model to infer topic mixtures of each testing message
## Measure Similarity of Topics
Jensen-Shannon divergence:
$D_{JS}=\frac{1}{2}D_{KL}(P||R)+\frac{1}{2}D_{KL}(Q||R)\\R=\frac{1}{2}(P+Q)$  
where $D_{KL}(P||R)=\sum_{n=1}^{M}\beta_{P,n}\log\frac{\beta_{P,n}}{\beta_{R,n}}$