## Topic modeling on GTEx data (objective: generalized KL)

### Data
GTEx data is prepared using https://github.com/stephenslab/topics/blob/master/analysis/prepare_gtex.Rmd
The size is `n=11688; p=55863`. 


### Results:
```txt
## K = 30
model   time/s multinomll             poissonll (per count)         poissonll (add eps)
skd.nmf 90950   -5507149657753.165      -Inf
nnmf    21426   -5505629590840.375      -82.00236
betanmf 5810    -5505838194582.725      -82.32185
ccd     62175   -5505895039558.204      -82.41003
flashier 8973 (prior: pn pn)
flashier 6040  (prior: nn pn)
flashier 12580 (prior: pn nn)

## K = 20:
model   time/s multinomll             poissonll (per count)         poissonll (add eps)
skd.nmf 96137   -5529924942649.528      -Inf                            -119.5197
nnmf    15406   -5527555982054.788      -115.5841
betanmf 4429    -5527308834557.994      -115.2055
ccd     43338   -5526679093542.194      -114.2419
maptpx  44775   -5527136017322.085      -114.9409
skd.lda 28504   -5538743477750.327      -132.7184       (perplexity = 1)
rsvd    28
```

### Computation:
See https://github.com/stephenslab/topics/tree/zihao/scripts
* `K = 30`:`fit_gtex_{nnmf,maptpx,ccd,betanmf,skdnmf,skdlda,rsvd}_K30.sbatch`
* `K = 20`: `fit_gtex_{nnmf,maptpx,ccd,betanmf,skdnmf,skdlda,rsvd}.sbatch`



## Topic Modeling on GTEx-simulated data (objective: generalized KL)

### Generate data
I fit `NNLM::nnmf` on GTEx data (K = 20) and use the fitted factor and loading to generate new data, using the function below:

```{r}
## Input
## A: factor [p,k]
## W: loading [k,n]
## Output
## X: simulated data [p,n]
generateForacle <- function(A,W, seed = 0){
    set.seed(seed)
    Lam = A %*% W
    p = nrow(Lam)
    n = ncol(Lam)
    X = matrix(rpois(n*p,Lam), nrow = p)
    return(X)
}
```
See for details in
https://github.com/zihao12/topics-simulation-experiments/script/generate_gtex_nnlm.R

### Results:

```txt
model   time/s multinomll             poissonll (per count)         poissonll (add eps)
oracle          -5535274871489.7959     -2.108910154212
maptpx  46273   -5535547915113.8789     -2.527085324929
nnmf    15581   -5535274339301.6719     -2.108086114508
ccd     45358   -5.535653e+12           -Inf                    -2.689571
betanmf 5364    -5.535566e+12           -2.554909
skdnmf  34695   -5539070585721.606      -Inf                    -8.588246
```

See details in 
https://github.com/zihao12/topics-simulation-experiments/script, the batch files are:
`fit_gtex_simulation_{nnmf,maptpx,ccd,betanmf,skdnmf,skdlda}.sbatch`

## Topic Modeling on GTEx/GTEx-simulated data (objective: Frobenius norm)

### paper and main idea
I found an interesting method called `randomized nmf` that solves `NMF` with Frobenius norm as objective function (https://arxiv.org/abs/1711.02037). It shares similar idea with `rsvd`, projecting original data to a smaller subspace: 
$$ X \approx Q B$$
where $X \in R^{m,n}$ is data, $Q \in R^{m, k}$ is orthonormal, $B \in R^{k,n}$.

Then $$|X - W H|_F \approx |QB - WH|_F = |QB - Q\hat{W}H|_F = |B - \hat{W}H|_F$$
where $W = Q\hat{W}$. 

Then we only need to solve the small problem (data becomes `k,n` in dimension), the project $\hat{W}$ back to $W$. The small problem is solved using HALS (Hierarchical Alternating Least Square). 

(Software: https://github.com/erichson/ristretto)

### Experiments:

#### Small scale experiment:
https://zihao12.github.io/topics-simulation-experiments/Investigate_rnmf_least_square.html

https://zihao12.github.io/topics-simulation-experiments/Investigate_rnmf_poisson.html

#### GTEx and GTEx-simulated data (call it `rnmf`), and compare it with `nnmf`. 
##### GTEx data
```txt
method  (rel)tol       time   n_iter  mse             multinom        poisson(add eps)
rnmf    1e-04           115     724     22756690.7955   -5583491994252  -218.4602
nnmf    9e-07           7409    200     22744071.4794   -5583603643351  -218.9064
```

##### GTEx-simulated data
```txt
method  (rel)tol        time   n_iter  mse             multinom        poisson(add eps)
rnmf    1e-04            106   323     1178.5           -5535304227626  -2.161012
nnmf    0.02            7494    200    176312.8         -5541676793230  -12.78057   
```
(`nnmf` in the second case is probably not initialized properly)

## Some questions

### How real is the simulated data?
* The mean poisson loglikelihood in GTEx and in GTEx-simulated data are very different, but the median of the poisson loglikelihood are close. I looked at the poisson loglikelihood more closely here:
https://zihao12.github.io/topics-simulation-experiments/Investigate_poisson_ll_matrix.html

It seems that bad fits are wide spread across all features and samples. Are the bad fits due to the model or optimization? 

### Is it Coincidence that `nnmf` fits best when data is simulated from its `F,L`

We can see that `nnmf` fit on GTEx-simulated data is the only one that beats oracle, while other methods are lagged behind. This advantage is not seen in real GTEx data. It is quite dubious since the data is generated from `F,L` from `nnmf` fit ...

### Can we use `rnmf` in generalized KL objective?

* `rnmf`  seems to be significantly faster than most NMF methods. Maybe we can adapt it to solve KL directly. But we may need $Q$ to be slightly different. Orthonormal $Q$ does not change the Frobenius norm, but will change generalized KL divergence. In order to reduce to small problem, we might want to the equation above to hold under generalized KL. 

* From small generated data, we can see `rnmf` decreases generalized KL as it optimizes Frobenius norm when tolerance is high (say 1e-04). Maybe we can use it to initialize other algorithms?