[MRG + 1] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes #3659

chyikwei · 2014-09-13T02:28:48Z

This PR is an implementation of Matt Hoffman's topic modeling algorithm LDA with online variational Bayes.

Based on previous discussion in this email thread, I asked Matt if he could relicense his onlineldavb code to BSD. And now his code is relicensed, so I create a PR for it.

I use the name OnlineLDA for this model and put it in decomposition folder. And since the model can run both online and batch update, I implemented both fit and partial_fit method.
The algorithm part and unit test is done and ready for review. Will work on an example next.

Check List:

Reference:

[1] "Online Learning for Latent Dirichlet Allocation", Matthew D. Hoffman, David M. Blei, Francis Bach
[2] original onlineldavb code (with BSD license)

coveralls · 2014-09-13T03:10:39Z

Coverage increased (+0.02%) when pulling e872e80 on chyikwei:onlineldavb into 78fbd25 on scikit-learn:master.

mblondel · 2014-09-20T15:42:29Z

For the record, here's a fast scikit-learn compatible LDA implementation:
https://github.com/ariddell/lda/

CC @ariddell

ariddell · 2014-09-20T16:03:15Z

It would be great to see LDA in sklearn in any form!

On the subject of online algorithms, apparently onlinehdp has very good results and I think it has the same order of operations requirements as online LDA: I. Sato, K. Kurihara, and H. Nakagawa. Practical collapsed variational Bayes inference for hierarchical Dirichlet process. In Proc. of the 18th ACM SIGKDD

mblondel · 2014-09-20T16:19:14Z

@ariddell What parameter inference method does your implementation use? Would you consider relicensing to BSD?

ariddell · 2014-09-20T16:24:32Z

My implementation uses collapsed Gibbs sampling, rather different from online LDA. I'd be willing to do a one-off relicense to BSD for scikit-learn if there was interest.

chyikwei · 2014-09-20T17:59:50Z

@ariddell yeah. onlineHDP have similar operations as onlineLDA. But I am not sure is if the E-step can be executed in parallel since the topic number will change over time. (I never go through details of its source code yet.)

btw, after saw your implementation, I think I should do some profiling first and see if I can optimize my current implementation.

chyikwei · 2014-10-14T00:10:24Z

profiling result for important functions:
https://gist.github.com/chyikwei/59c3f024ff3148efe1df

amueller · 2015-01-09T21:52:10Z

Can we please rename this to LatentDirichletAllocation even though that is long? We have a (badly named) LDA class in scikit-learn.

amueller · 2015-01-09T21:56:34Z

Not sure decomposition is the right folder, but I don't have a better idea ^^

amueller · 2015-01-09T21:58:55Z

How does this compare against the gensim implementation? Is that the same approach?

chyikwei · 2015-01-09T22:33:43Z

@amueller

ok. I will rename it.
I am not sure about the folder either, but "decomposition" is the best one I can find.
yes. based on gensim's LDA page, I think its approach is also M. Hoffman's online LDA.

amueller · 2015-01-09T22:37:07Z

For 3) it would be cool if you could give a performance comparison (and maybe also a comparison of how well it fit the data?) as a sanity check?

chyikwei · 2015-01-09T22:46:13Z

ok. will add performance comparison with gensim's implementation. For "how well it fit the data", I will compare perplexity.

amueller · 2015-01-09T22:47:59Z

Thanks :)

ariddell · 2015-01-10T01:59:20Z

I've almost got the transform method working for LDA in https://github.com/ariddell/lda (fit and fit_transform work fine); I would imagine Gibbs sampling beats online LDA in perplexity and reasonably fast for small to medium datasets -- and I'd be very curious to see how things play out with large datasets.

@chyikwei I'd be happy to help add Gibbs sampling to the benchmarks once you settle on them.

chyikwei · 2015-01-12T04:54:03Z

quick update:

renamed model to LatentDirichletAllocation
add some cython optimization
will work on performance comparison with gensim next. (not familiar with its interface yet.)

@ariddell thx! I will start with 20 news group data set first, and we can try large one later.

amueller · 2015-01-13T20:56:54Z

It would be very interesting to see how the collapsed gibbs sampler compares to this, indeed.

ariddell · 2015-01-13T21:07:56Z

For larger datasets, there's Enron and PubMed: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words

chyikwei · 2015-01-14T03:46:14Z

Hi,
I put the comparison between my LDA implementation and gensim in this spreadsheet. (My script is here.)

I use 20 newsgroup dataset and compared both online and batch update.
In batch mode, the performance is close.
In online mode, the speed difference is caused by how often we compute perplexity in the training step.
My partial_fit method doesn't compute perplexity at all, so it is much faster than gensim. (For detail, check notes in the result spreadsheet.)

Note: One thing I haven't figured out is why perplexity goes up in gensim as the number of workers increased. I will double check that

@ariddell It will be cool if you can add Gibbs sampler's result. I will check the larger datasets link you post next.

GaelVaroquaux · 2015-01-14T06:33:44Z

Is there a reason that perplexity is computed by default at every step?
It seems to me that we could make model convergence faster by computing
it every 2 or 4 step.

GaelVaroquaux · 2015-01-14T06:35:15Z

Well, the spreadsheet is overall in favor of your implementation. Good work!

amueller · 2015-01-14T13:41:17Z

yeah that looks promising :)
It might be nice to have a plot_ example that generates an image. Then the output will be visible in the example gallery on the website.

chyikwei · 2015-01-14T15:35:57Z

@GaelVaroquaux There is no reason to compute perplexity in every step. I will add a parameter for this (similar to gensim's eval_every). Also, I think I need to check gensim's source code and setting to make sure I am doing apple to apple comparison. (If we both use Matt Hoffman's code, the result should be similar.)

@amueller not sure what's the best way to visualize topic models. (usually, I just check top words in each topic.) Any idea?

amueller · 2015-01-14T16:15:20Z

Maybe pick the top three words in each topic and then to a bar-graph on how likely they are under each of the topics?

amueller · 2015-01-14T16:37:55Z

sklearn/decomposition/online_lda.py

+
+        return score
+
+    def preplexity(self, X, gamma, sub_sampling=False):


It would make sense to have a score method based on transform and perplexity, right?

yes. that make sense. will add it.

amueller · 2015-01-14T16:40:16Z

This might be a stupid question, but if we wanted to add the collapsed gibbs sampler version, say the one by @ariddell, could we use the same public interface and branch using an algorithm='collabsed_gibbs' parameter?

chyikwei · 2015-01-14T17:30:55Z

yes. I think we can share interface for different implementation. online variational Bayes algorithm just have a few more parameters used for online update, which can be ignored by gibbs sampler.

amueller · 2015-01-14T18:05:02Z

Ok cool :)

ogrisel · 2015-06-25T19:13:08Z

Please also rename self.rng_ to self.random_state_ to be consistent with other probabilistic models.

ogrisel · 2015-06-25T19:20:33Z

Also I think we should make sure that the score method is in line with the ongoing work by @xuewei4d on (Bayesian) GMMs. It would be grat @xuewei4d if you could do a review of this PR to check what could be possible source inconsistencies between your GSoC models and this one.

xuewei4d · 2015-07-04T02:00:37Z

I did a quick review on the score method. Because I didn't consider BayesianGaussianMixture when coding DensityMixin and GaussianMixture, current DensityMixin is not compatible with any variational models. It has to compute additional term to the score, i.e. the score comes from variational distributions. Anyway, the lower bound or approximation bound is the right thing we need to compute for variational methods.

ogrisel · 2015-07-07T09:18:42Z

@chyikwei could you please answer or address the comments of @amueller on the _online_lda.pyx file: https://github.com/scikit-learn/scikit-learn/pull/3659/files ?

chyikwei · 2015-07-07T14:31:51Z

sure. I will benchmark the cython code again since there are some code changes after I run line_profiler last time.

chyikwei · 2015-07-08T02:29:04Z

Here is the cython code benchmark (on _update_doc_distribution function)

mean_change improvement: from 32.8% to 6%. (complete result)
_log_dirichlet_expectation improvement: from 38.4% to 28.9%. (complete result)

For reference, here is my profiling code.

chyikwei · 2015-07-08T02:39:01Z

And should I move cython decorators to top of the file?

I see some files use #cython: boundscheck=False(example) and some use cython.boundscheck(False)(example).

xuewei4d · 2015-07-08T12:35:50Z

sklearn/decomposition/online_lda.py

+                cnts = X[idx_d, ids]
+            temp = dirichlet_doc_topic[idx_d, :, np.newaxis] + self.dirichlet_component_[:, ids]
+            tmax = temp.max(axis=0)
+            norm_phi = np.log(np.sum(np.exp(temp - tmax), axis=0)) + tmax


scikit-learn has a logsumexp function in utils.extmath

cool. will use it. thanks!

amueller · 2015-07-11T21:09:47Z

I haven't checked the documentation in detail but I'd be happy to merge this now. @ogrisel what do you think?

amueller · 2015-08-03T14:48:44Z

ping @ogrisel again ;)

amueller · 2015-08-03T20:35:47Z

@larsmans are you interested in this? I think it is in pretty good shape.

ENH Latent Dirichlet Allocation (LDA) with online variational Bayes

larsmans · 2015-08-09T17:13:31Z

Merged this. Let's finish any nitpicking in master.

amueller · 2015-08-09T18:16:34Z

Thanks @larsmans, I agree :) 🍻 Thank you so much @chyikwei !

mblondel · 2015-08-11T15:29:33Z

It has been suggested to mention more clearly in the code that Matt Hoffmann allowed us to license the code as BSD even though it's derived from his GPL implementation:
https://twitter.com/EdwardRaffML/status/631123381212082176

larsmans · 2015-08-11T15:34:03Z

Yeah, it'd be good to reproduce the email or something. @chyikweiyau, could you do that?

chyikwei · 2015-08-11T16:55:26Z

@larsmans sure. Here is the email I sent to Matt Hoffmann before.
And he emailed me relicensed code. Here are the files.

amueller · 2015-08-13T18:22:24Z

Maybe just add to the licence header "relicenced as BSD with the kind permission of Matt Hoffmann"?

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller reviewed Jan 14, 2015
View reviewed changes

chyikwei added 4 commits June 25, 2015 15:28

remove self.n_features_

36bb611

remove main in test

a1c3561

make dimension mismatch err msg more explicit

567aa61

change all rng to random_state

093088a

ariddell mentioned this pull request Jun 26, 2015

Add simple partial_fit, update error messages lda-project/lda#37

Closed

chyikwei added 2 commits July 7, 2015 22:42

fix feature_names

5b7768b

improve document in _online_lda.pyx

12859d9

xuewei4d reviewed Jul 8, 2015
View reviewed changes

use logsumexp

0c48570

amueller changed the title ~~[MRG] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes~~ [MRG + 1] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes Aug 3, 2015

larsmans added a commit that referenced this pull request Aug 9, 2015

Merge pull request #3659 from chyikwei/onlineldavb

a6c6e73

ENH Latent Dirichlet Allocation (LDA) with online variational Bayes

larsmans merged commit a6c6e73 into scikit-learn:master Aug 9, 2015

chyikwei deleted the onlineldavb branch August 12, 2015 04:06

tmylk mentioned this pull request Sep 20, 2015

Comparing LDA between gensim and sklearn piskvorky/gensim#457

Closed


		return score

		def preplexity(self, X, gamma, sub_sampling=False):

[MRG + 1] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes #3659

[MRG + 1] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes #3659

Conversation

chyikwei commented Sep 13, 2014

Check List:

Reference:

coveralls commented Sep 13, 2014

mblondel commented Sep 20, 2014

ariddell commented Sep 20, 2014

mblondel commented Sep 20, 2014

ariddell commented Sep 20, 2014

chyikwei commented Sep 20, 2014

chyikwei commented Oct 14, 2014

amueller commented Jan 9, 2015

amueller commented Jan 9, 2015

amueller commented Jan 9, 2015

chyikwei commented Jan 9, 2015

amueller commented Jan 9, 2015

chyikwei commented Jan 9, 2015

amueller commented Jan 9, 2015

ariddell commented Jan 10, 2015

chyikwei commented Jan 12, 2015

amueller commented Jan 13, 2015

ariddell commented Jan 13, 2015

chyikwei commented Jan 14, 2015

GaelVaroquaux commented Jan 14, 2015

GaelVaroquaux commented Jan 14, 2015

amueller commented Jan 14, 2015

chyikwei commented Jan 14, 2015

amueller commented Jan 14, 2015

amueller Jan 14, 2015

Choose a reason for hiding this comment

chyikwei Jan 14, 2015

Choose a reason for hiding this comment

amueller commented Jan 14, 2015

chyikwei commented Jan 14, 2015

amueller commented Jan 14, 2015

ogrisel commented Jun 25, 2015

ogrisel commented Jun 25, 2015

xuewei4d commented Jul 4, 2015

ogrisel commented Jul 7, 2015

chyikwei commented Jul 7, 2015

chyikwei commented Jul 8, 2015

chyikwei commented Jul 8, 2015

xuewei4d Jul 8, 2015

Choose a reason for hiding this comment

chyikwei Jul 9, 2015

Choose a reason for hiding this comment

amueller commented Jul 11, 2015

amueller commented Aug 3, 2015

amueller commented Aug 3, 2015

larsmans commented Aug 9, 2015

amueller commented Aug 9, 2015

mblondel commented Aug 11, 2015

larsmans commented Aug 11, 2015

chyikwei commented Aug 11, 2015

amueller commented Aug 13, 2015