## Goal and Methods

* I want to investigate how well the current topic modeling methods can optimize their objective function (negative poisson loglikelihood, or beta divergence here). 

* Therefore, I hope to apply topic modeling methods to realistic data, while also knowing the underlying distribution of the data. 

* To achieve this, I first fit a model on some realistic data (simulated here), then generate new data from this "oracle" model. Then I apply topic modeling methods and compare their performance against the oracle model.  

In [1]:
library(NNLM)
library(maptpx)
library(Matrix)
set.seed(12345)


## utility functions

In [2]:
simulate_pois <- function(n,p,k, seed = 0){
    set.seed(seed)
    A = matrix(rnorm(p*k, 0,1), nrow = p)
    W = matrix(rnorm(k*n, 0,1), nrow = k)
    lam = exp(A) %*% exp(W)
    X = matrix(rpois(n*p,lam), nrow = p)
    return(list(X = X, lam = lam))
}

pois_lk <- function(X,lam){
    return(sum(dpois(X,lam, log= TRUE)))
}

generateForacle <- function(A,W, seed = 0){
    set.seed(seed)
    Lam = A %*% W
    p = nrow(Lam)
    n = ncol(Lam)
    X = matrix(rpois(n*p,Lam), nrow = p)
    return(X)
}

## compute poisson loglikelihood from fir of multinomial model
multinom2poisson_ll <- function(X,A,W){
    Lam  = A %*% W %*% diag(colSums(X))
    ll = pois_lk(X,Lam)
}

## generate new data from true data

In [3]:
## "true" data
n = 1000
p = 5000
k = 5
X0 = simulate_pois(n,p,k)$X

## fit model to "true" data
oracle = nnmf(X0,k,method = "scd", loss = "mkl", rel.tol = 1e-3, 
           n.threads = 0, max.iter = 200, 
           inner.max.iter = 4,trace = 1,verbose = 0)

## generate data from oracle
Xnew = generateForacle(oracle$W,oracle$H)

## Fit NNLM

In [4]:
start = proc.time()
fit_nnlm = nnmf(Xnew,k,method = "scd", loss = "mkl", rel.tol = 1e-8, 
           n.threads = 0, max.iter = 200, 
           inner.max.iter = 4,trace = 1,verbose = 0)
print(paste0("time elapsed: ", proc.time() - start))

“Target tolerance not reached. Try a larger max.iter.”

[1] "time elapsed: 321.727" "time elapsed: 2.728"   "time elapsed: 82.303" 
[4] "time elapsed: 0"       "time elapsed: 0"      


## Fit maptpx

In [5]:
start = proc.time()
fit_maptpx <- topics(t(Xnew),k,shape = 0.1,tol = 1e-4,
                tmax = 100,verb = 2)
print(paste0("time elapsed: ", proc.time() - start))


Estimating on a 1000 samples collection.
Fitting the 5 clusters/topics model.
log posterior increase: 116808.7747, 389378.9005, 194624.9315, 132295.758, 16751.1029, 4341.4211, 1612.0923, 739.7867, 391.4155, 228.671, 141.735, 93.1668, 64.0424, 45.9141, 34.0384, 25.8792, 20.0704, 15.8548, 12.7583, 10.4535, done. (L = -586334149.7618)
[1] "time elapsed: 72.682" "time elapsed: 7.84"   "time elapsed: 80.448"
[4] "time elapsed: 0"      "time elapsed: 0"     


## Compare poisson loglikelihood

In [6]:
print(paste0("oracle loglikelihood  : ", pois_lk(Xnew,oracle$W %*% oracle$H)))
print(paste0("nnlm loglikelihood    : ", pois_lk(Xnew,fit_nnlm$W %*% fit_nnlm$H)))
print(paste0("maptpx loglikelihood  : ", 
             multinom2poisson_ll(Xnew,fit_maptpx$theta,t(fit_maptpx$omega))))

[1] "oracle loglikelihood  : -12815132.3336176"
[1] "nnlm loglikelihood    : -12799974.3919854"
[1] "maptpx loglikelihood  : -12800401.0190685"


## Summary
* This shows the pipeline of the simulation experiments. 
* From this toy "true" data, we see that both NNLM and maptpx outcompetes oracle. They are overfitting the data, therefore the optimization problem may not be the issue.  
* It remains to see how they perform when the data is generated from some complicated real world sources. 

In [12]:
write.csv(as.data.frame(t(Xnew)), "../../topics-simulation-bigdata/output/test2.csv")