# Topic Modeling

## Reading
* [Topic Coherence](https://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf)

## 1 Problem statement
Topic modeling is equivalent to soft clustering:   
Given document ${d_j}$, assign each document a vector of probabilities for  each cluster $v_j = [p(c_0), ... , p(c_K)]$,   
$\sum_k p(c_k) = 1$

Usually,  
* number of clusters K is fixed and is a subject for cross-validation.
* document is described by term-document matrix (bag of words model)
    


## 2 Latent Semantic Analysis

LSA = term-document matrix + Truncated SVD

<img src=images/lsa.png style="height:200px"/>

## 3 PLSA

<img src=images/plsa.png height=400/>

Let  
$w$ - set of words  
$d$ - set of documents  
$t$ - implicit topics  

Joint probability of term and document assuming $p(w|t,d) = p(w | t)$
$$p(w,d) = \sum_t p(t) p(w|t) p(d|t) = \sum_t p(d) p(w|t) p(t|d) = \sum_t p(w) p(t|w) p(d|t) $$

Likelihood

$$ \ln \prod_{d,w}p(w,d)^{n_{dw}} = \sum_{w \in d, d \in D} n_{wd} \ln p(w|d) + \sum_{d \in D} \ln p(d) $$

$$ \sum_{w \in d, d \in D} n_{wd} \ln p(w|d) = \sum_{w \in d, d \in D} n_{wd} \ln \sum_{t \in T} p(w|t) p(t|d) = 
\sum_{w \in d, d \in D} n_{wd} \ln \sum_{t \in T} \phi_{wt} \theta_{td} $$


$\sum_w \phi_{wt} = 1$, $\phi_{wt} > 0$  
$\sum_t \theta_{td} = 1$m, 

$p(d) = \frac {n_d} {\sum_d {n_d}}$ prior distribution over documents, n_d document length  

### Training

E-step

$$p(t|d,w) = \frac {p(w,t|d)} {p(w|d)} = \frac {p(w|t)p(t|d)} {p(w|d)} = \frac {\phi_{wt} \theta_{td}} {\sum_{s \in T} \phi_{ws} \theta_{sd} }$$

$$ n_{dwt} = n_{dw}p(t |d,w) = n_{dw} \frac {\phi_{wt} \theta_{td}} {\sum_{s \in T} \phi_{ws} \theta_{sd} } $$

M-step

$$ \phi_{wt} = \frac {n_{dwt}} {n_t}$$
$$ \theta_{td} = \frac {n_{dt}} {n_d}$$

$ n_{dt} = \sum_{w} n_{dwt}$  
$ n_{wt} = \sum_{d} n_{dwt}$  
$ n_{t} = \sum_{w} n_{wt}$  
$ n_{d} = \sum_{t} n_{dt}$  





### Properties

* $\Phi$, $\Theta$ - are unstable
* easily overfits
* do not scale well

## 4 Latent Dirichlet Allocation

We demand, that $p(w|t)$, $p(t|d)$ have Dirichlet distribution.



**simplex** is a set of points $\{z |  \sum_{i=1}^k z_i = 1  \wedge z_i \geq0 \}$


<img src=images/dir.jpeg style='height:200px'/>


**Dirichlet distribution** on simplex

$$Dir(z | \alpha) = \frac {\Gamma(\alpha_0)}{\prod_i \Gamma(\alpha_i)} \prod_{i=1}^k z_i^{\alpha_i-1}$$
, where  
$\alpha_i > 0$  
$\alpha_0 = \sum_i \alpha_i$


<img src=images/lda.png style='height:200px'/>

where   
$$\theta_d \sim Dir(\theta, \alpha) = \frac {\Gamma(\alpha_0)}{\prod_t \Gamma(\alpha_t)}  \prod_t \theta_{td}^{\alpha_t-1}$$

$$\phi_t \sim Dir(\theta, \beta) = \frac {\Gamma(\beta_0)}{\prod_w \Gamma(\beta_w)}  \prod_w \phi_{wt}^{\beta_w-1}$$


Likelihood

$$L = \ln \prod_{i} p(d_i, w_i | \Theta, \Phi) p(\Theta | \alpha) p(\Phi | \beta)$$
$$L = \ln \prod_{dw} p(d, w | \Theta, \Phi)^{n_{dw}} \prod_d p(\theta_d | \alpha) \prod_t p(\phi_t | \beta) $$
$$L = \sum_{w \in d, d \in D} n_{wd} \ln \sum_{t \in T} \phi_{wt} \theta_{td} + \sum_{dt} \ln \theta_{td}^{\alpha_t-1} + \sum_{tw} \phi_{wt}^{\beta_w - 1} $$
$$L = \sum_{w \in d, d \in D} n_{wd} \ln \sum_{t \in T} \phi_{wt} \theta_{td} + \sum_{dt} (\alpha_t-1)\ln \theta_{td} + \sum_{tw} (\beta_w - 1) \phi_{wt} $$

### Training

E-step

$$p(t|dw) = {norm}_{t}(\phi_{wt}\theta_{td}) $$

M-step

$$ \phi_{wt} = {norm}_w (n_{wt} + \beta_w - 1) $$
$$ \theta_{td} = {norm}_w (n_{td} + \alpha_t - 1) $$

$n_{wt} = \sum_d n_{dw} p(t|dw)$  
$n_{wt} = \sum_w n_{dw} p(t|dw)$ 

### Generation from LDA model

* sample $\theta_d$ and $\phi_t$
* for each term position $i$ in document $d$
* sample $t_i \sim p(t|d)$
* sample $w_i \sim p(w|t_i)$

## 5 Additive Regularized Topic Models

$$L(\Phi, \Theta) + \sum_{i=1}^r \tau_i c$$

$R_i$ - continously differentiable. 

Example for LDA

$$ \sum_d KL(\alpha_t | \theta_{td}) \rightarrow min $$
$$ \sum_w KW(\beta_w | \phi_{wt}) \rightarrow min $$

$$ R(\Phi, \Theta) = \beta_0 \sum_t \sum_w \beta_w \ln \phi_{wt} + \alpha_0 \sum_d \sum_t \alpha_t \ln \theta_{td} $$

Example for sparse regularizer

$$ R(\Phi, \Theta) = -\beta_0 \sum_t \sum_w \beta_w \ln \phi_{wt} - \alpha_0 \sum_d \sum_t \alpha_t \ln \theta_{td} $$

### Training

E-step

$$p(t|dw) = {norm}_{t}(\phi_{wt}\theta_{td}) $$

M-step

$$ \phi_{wt} = {norm}_w (n_{wt} + \phi_{wt} \frac {\partial R} {\partial \phi_{wt}}) $$
$$ \theta_{td} = {norm}_w (n_{td} + \theta_{td} \frac {\partial R} {\partial \theta_{td}}) $$

$n_{wt} = \sum_d n_{dw} p(t|dw)$  
$n_{wt} = \sum_w n_{dw} p(t|dw)$  

## 6 Evalutation

### Likelihood

$$ L(\Phi, \Theta) = \sum_d \sum_w n_{dw} \ln p(w|d)  $$

### Perplexity

$$ PL(D) = \exp(-\frac 1 n \sum_d \sum_w n_{dw} \ln p(w|d))  $$

$ n = \sum_d \sum_w n_{dw}$


### Topic Coherence

Algorithm sketch:
1. Each topic is described by top-n most probable words.
1. Introduce similarity measure between words: e.g. based on co-occurance matrix or cosine distance between word embeedings
1. Compute average of pairwise similarities of top-n words for each topic
1. Average scores over topics

## 7 NN for Aspect Learning

<img src=images/aspect.png style="height:300px"/>

where  
sentence embedding 
$$z_s = \sum_{i}^n \alpha_i e_{w_i}$$

attention weight
$$a_i = softmax(d_i)$$  
$$d_i = e_{w_i}^T M y_s$$

sentence context
$$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}$$


reconstructed sentence embedding
$$p_t = softmax(W z_s + b)$$  
$$r_s = T^T p_t$$   
$T$ matrix of aspect embeddings  


**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^n max(0, 1-r_s z_s + r_s n_i) + \lambda ||T^T T - I ||  $$
where   
$m$ random sentences are sampled as negative examples  
$n_i$ average of word embeddings in the i-th sentence
