## Goal and Methods

* I want to investigate how well the current topic modeling methods can optimize their objective function (negative poisson loglikelihood, or beta divergence here). 

* Therefore, I hope to apply topic modeling methods to realistic data, while also knowing the underlying distribution of the data. 

* To achieve this, I first fit a model on some realistic data (simulated here), then generate new data from this "oracle" model. Then I apply topic modeling methods and compare their performance against the oracle model.  

In [5]:
library(NNLM)
library(maptpx)
library(Matrix)
set.seed(12345)


## utility functions

In [13]:
simulate_pois <- function(n,p,k, seed = 0){
    set.seed(seed)
    A = matrix(rnorm(p*k, 0,1), nrow = p)
    W = matrix(rnorm(k*n, 0,1), nrow = k)
    lam = exp(A) %*% exp(W)
    X = matrix(rpois(n*p,lam), nrow = p)
    return(list(X = X, lam = lam))
}

pois_lk <- function(X,lam){
    return(sum(dpois(X,lam, log= TRUE)))
}

generateForacle <- function(A,W, seed = 0){
    set.seed(seed)
    Lam = A %*% W
    p = nrow(Lam)
    n = ncol(Lam)
    X = matrix(rpois(n*p,Lam), nrow = p)
    return(X)
}

## turn fit from multinomial model to poisso model
multinom2poisson_ll <- function(X,A,W){
    return(A %*% W * diag(colSums(X)))
}

## generate new data from true data

In [3]:
## "true" data
n = 1000
p = 5000
k = 5
X0 = simulate_pois(n,p,k)$X

## fit model to "true" data
oracle = nnmf(X0,k,method = "scd", loss = "mkl", rel.tol = 1e-3, 
           n.threads = 0, max.iter = 200, 
           inner.max.iter = 4,trace = 1,verbose = 0)

## generate data from oracle
Xnew = generateForacle(oracle$W,oracle$H)

In [6]:
start = proc.time()
fit_nnlm = nnmf(Xnew,k,method = "scd", loss = "mkl", rel.tol = 1e-8, 
           n.threads = 0, max.iter = 200, 
           inner.max.iter = 4,trace = 1,verbose = 0)
print(paste0("time elapsed: ", proc.time() - start))

“Target tolerance not reached. Try a larger max.iter.”

[1] "time elapsed: 315.49" "time elapsed: 3.601"  "time elapsed: 84.066"
[4] "time elapsed: 0"      "time elapsed: 0"     


In [9]:
start = proc.time()
fit_maptpx <- topics(t(Xnew),k,shape = 0.1,tol = 1e-4,
                tmax = 100,verb = 2)
print(paste0("time elapsed: ", proc.time() - start))


Estimating on a 1000 samples collection.
Fitting the 5 clusters/topics model.
log posterior increase: 272113.9498, 308768.4128, 374645.0363, 38180.3242, 6532.8724, 2192.0793, 965.7739, 492.71, 276.3672, 167.0913, 106.3126, 69.6902, 5.0034, 3.2136, 0.9079, 2.6829, 0.5513, 1.9407, 0.317, 1.9692, done. (L = -586333697.6143)
[1] "time elapsed: 69.617" "time elapsed: 8.277"  "time elapsed: 78.175"
[4] "time elapsed: 0"      "time elapsed: 0"     


In [15]:
print(paste0("oracle loglikelihood: ", pois_lk(Xnew,oracle$W %*% oracle$H)))
print(paste0("nnlm loglikelihood  : ", pois_lk(Xnew,fit_nnlm$W %*% fit_nnlm$H)))



[1] "oracle loglikelihood: -12815132.3336176"
[1] "nnlm loglikelihood: -12799974.747101"
