The implementation of paper Decomposed Normalized Maximum Likelihood Codelength Criterion for Selecting Hierarchical Latent Variable Models
K: Topic size, the complexity of the model V: vocabulary size, the number of unique words in given documents. D: Document size, the number of documents. alpha: the hyper parameter of the dirichlet distribution of topic distribution in the documents. beta: the hyper parameter of the dirichlet distribution of word distribution in the documents.
All artificial data is generated by the generator in model/topic_model/ArtificialDataGenerator.py Example usage:
dd = LDAArtificialDataGenerator(K, V, alpha, beta, k_noise=0, noise_alpha=1, noise_beta=0.1, random_state=None)
X = dd.generate_artificial_data(D, N, noise_threshold=0.0)
The actual learner is LatentDirichletAllocation in /model/topic_model/LDA/VB/online_lda.py.
LatentDirichletAllocationWithSample is a thin wrapper of LatentDirichletAllocation to sample the latent variable Z to compute the decomposed NML code length.
LatentDirichletAllocationWithScore is a thin wrapper of LatentDirichletAllocationWithSample to calculate different methods like DNML, a-NML, AIC, BIC and VB criterion based on the sampled Z.
learner = LatentDirichletAllocationWithScore(verbose=0, learning_method='batch', evaluate_every=evaluate_every, perp_tol=perp_tol, max_iter=max_iter)
learner.fit(X)
criterion_scores = learner.score_new(X)
model/topic_model/LDA/LDA_R.py
model/relation_model/DataGenerater.py
model/relation_model/SBM/SBM_R.py A thin wrapper of R package blockmodels.