Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes #3659

Merged
merged 30 commits into from Aug 9, 2015

Conversation

chyikwei
Copy link
Contributor

This PR is an implementation of Matt Hoffman's topic modeling algorithm LDA with online variational Bayes.

Based on previous discussion in this email thread, I asked Matt if he could relicense his onlineldavb code to BSD. And now his code is relicensed, so I create a PR for it.

I use the name OnlineLDA for this model and put it in decomposition folder. And since the model can run both online and batch update, I implemented both fit and partial_fit method.
The algorithm part and unit test is done and ready for review. Will work on an example next.

Check List:
  • algorithm implementation
  • unit test
  • profiling
  • optimization
  • example
  • documentation
Reference:

[1] "Online Learning for Latent Dirichlet Allocation", Matthew D. Hoffman, David M. Blei, Francis Bach
[2] original onlineldavb code (with BSD license)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling e872e80 on chyikwei:onlineldavb into 78fbd25 on scikit-learn:master.

@mblondel
Copy link
Member

For the record, here's a fast scikit-learn compatible LDA implementation:
https://github.com/ariddell/lda/

CC @ariddell

@ariddell
Copy link
Contributor

It would be great to see LDA in sklearn in any form!

On the subject of online algorithms, apparently onlinehdp has very good results and I think it has the same order of operations requirements as online LDA: I. Sato, K. Kurihara, and H. Nakagawa. Practical collapsed variational Bayes inference for hierarchical Dirichlet process. In Proc. of the 18th ACM SIGKDD

@mblondel
Copy link
Member

@ariddell What parameter inference method does your implementation use? Would you consider relicensing to BSD?

@ariddell
Copy link
Contributor

My implementation uses collapsed Gibbs sampling, rather different from online LDA. I'd be willing to do a one-off relicense to BSD for scikit-learn if there was interest.

@chyikwei
Copy link
Contributor Author

@ariddell yeah. onlineHDP have similar operations as onlineLDA. But I am not sure is if the E-step can be executed in parallel since the topic number will change over time. (I never go through details of its source code yet.)

btw, after saw your implementation, I think I should do some profiling first and see if I can optimize my current implementation.

@chyikwei
Copy link
Contributor Author

profiling result for important functions:
https://gist.github.com/chyikwei/59c3f024ff3148efe1df

@amueller
Copy link
Member

amueller commented Jan 9, 2015

Can we please rename this to LatentDirichletAllocation even though that is long? We have a (badly named) LDA class in scikit-learn.

@amueller
Copy link
Member

amueller commented Jan 9, 2015

Not sure decomposition is the right folder, but I don't have a better idea ^^

@amueller
Copy link
Member

amueller commented Jan 9, 2015

How does this compare against the gensim implementation? Is that the same approach?

@chyikwei
Copy link
Contributor Author

chyikwei commented Jan 9, 2015

@amueller

  1. ok. I will rename it.
  2. I am not sure about the folder either, but "decomposition" is the best one I can find.
  3. yes. based on gensim's LDA page, I think its approach is also M. Hoffman's online LDA.

@amueller
Copy link
Member

amueller commented Jan 9, 2015

For 3) it would be cool if you could give a performance comparison (and maybe also a comparison of how well it fit the data?) as a sanity check?

@chyikwei
Copy link
Contributor Author

chyikwei commented Jan 9, 2015

ok. will add performance comparison with gensim's implementation. For "how well it fit the data", I will compare perplexity.

@amueller
Copy link
Member

amueller commented Jan 9, 2015

Thanks :)

@ariddell
Copy link
Contributor

I've almost got the transform method working for LDA in https://github.com/ariddell/lda (fit and fit_transform work fine); I would imagine Gibbs sampling beats online LDA in perplexity and reasonably fast for small to medium datasets -- and I'd be very curious to see how things play out with large datasets.

@chyikwei I'd be happy to help add Gibbs sampling to the benchmarks once you settle on them.

@chyikwei
Copy link
Contributor Author

quick update:

  1. renamed model to LatentDirichletAllocation
  2. add some cython optimization
  3. will work on performance comparison with gensim next. (not familiar with its interface yet.)

@ariddell thx! I will start with 20 news group data set first, and we can try large one later.

@amueller
Copy link
Member

It would be very interesting to see how the collapsed gibbs sampler compares to this, indeed.

@ariddell
Copy link
Contributor

For larger datasets, there's Enron and PubMed: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words

@chyikwei
Copy link
Contributor Author

Hi,
I put the comparison between my LDA implementation and gensim in this spreadsheet. (My script is here.)

I use 20 newsgroup dataset and compared both online and batch update.
In batch mode, the performance is close.
In online mode, the speed difference is caused by how often we compute perplexity in the training step.
My partial_fit method doesn't compute perplexity at all, so it is much faster than gensim. (For detail, check notes in the result spreadsheet.)

Note: One thing I haven't figured out is why perplexity goes up in gensim as the number of workers increased. I will double check that

@ariddell It will be cool if you can add Gibbs sampler's result. I will check the larger datasets link you post next.

@GaelVaroquaux
Copy link
Member

Is there a reason that perplexity is computed by default at every step?
It seems to me that we could make model convergence faster by computing
it every 2 or 4 step.

@GaelVaroquaux
Copy link
Member

Well, the spreadsheet is overall in favor of your implementation. Good work!

@amueller
Copy link
Member

yeah that looks promising :)
It might be nice to have a plot_ example that generates an image. Then the output will be visible in the example gallery on the website.

@chyikwei
Copy link
Contributor Author

@GaelVaroquaux There is no reason to compute perplexity in every step. I will add a parameter for this (similar to gensim's eval_every). Also, I think I need to check gensim's source code and setting to make sure I am doing apple to apple comparison. (If we both use Matt Hoffman's code, the result should be similar.)

@amueller not sure what's the best way to visualize topic models. (usually, I just check top words in each topic.) Any idea?

@amueller
Copy link
Member

Maybe pick the top three words in each topic and then to a bar-graph on how likely they are under each of the topics?


return score

def preplexity(self, X, gamma, sub_sampling=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense to have a score method based on transform and perplexity, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. that make sense. will add it.

@amueller
Copy link
Member

This might be a stupid question, but if we wanted to add the collapsed gibbs sampler version, say the one by @ariddell, could we use the same public interface and branch using an algorithm='collabsed_gibbs' parameter?

@chyikwei
Copy link
Contributor Author

yes. I think we can share interface for different implementation. online variational Bayes algorithm just have a few more parameters used for online update, which can be ignored by gibbs sampler.

@amueller
Copy link
Member

Ok cool :)

@ogrisel
Copy link
Member

ogrisel commented Jun 25, 2015

Please also rename self.rng_ to self.random_state_ to be consistent with other probabilistic models.

@ogrisel
Copy link
Member

ogrisel commented Jun 25, 2015

Also I think we should make sure that the score method is in line with the ongoing work by @xuewei4d on (Bayesian) GMMs. It would be grat @xuewei4d if you could do a review of this PR to check what could be possible source inconsistencies between your GSoC models and this one.

@xuewei4d
Copy link
Contributor

xuewei4d commented Jul 4, 2015

I did a quick review on the score method. Because I didn't consider BayesianGaussianMixture when coding DensityMixin and GaussianMixture, current DensityMixin is not compatible with any variational models. It has to compute additional term to the score, i.e. the score comes from variational distributions. Anyway, the lower bound or approximation bound is the right thing we need to compute for variational methods.

@ogrisel
Copy link
Member

ogrisel commented Jul 7, 2015

@chyikwei could you please answer or address the comments of @amueller on the _online_lda.pyx file: https://github.com/scikit-learn/scikit-learn/pull/3659/files ?

@chyikwei
Copy link
Contributor Author

chyikwei commented Jul 7, 2015

sure. I will benchmark the cython code again since there are some code changes after I run line_profiler last time.

@chyikwei
Copy link
Contributor Author

chyikwei commented Jul 8, 2015

Here is the cython code benchmark (on _update_doc_distribution function)

  1. mean_change improvement: from 32.8% to 6%. (complete result)
  2. _log_dirichlet_expectation improvement: from 38.4% to 28.9%. (complete result)

For reference, here is my profiling code.

@chyikwei
Copy link
Contributor Author

chyikwei commented Jul 8, 2015

And should I move cython decorators to top of the file?

I see some files use #cython: boundscheck=False(example) and some use cython.boundscheck(False)(example).

cnts = X[idx_d, ids]
temp = dirichlet_doc_topic[idx_d, :, np.newaxis] + self.dirichlet_component_[:, ids]
tmax = temp.max(axis=0)
norm_phi = np.log(np.sum(np.exp(temp - tmax), axis=0)) + tmax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn has a logsumexp function in utils.extmath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. will use it. thanks!

@amueller
Copy link
Member

I haven't checked the documentation in detail but I'd be happy to merge this now. @ogrisel what do you think?

@amueller amueller changed the title [MRG] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes [MRG + 1] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes Aug 3, 2015
@amueller
Copy link
Member

amueller commented Aug 3, 2015

ping @ogrisel again ;)

@amueller
Copy link
Member

amueller commented Aug 3, 2015

@larsmans are you interested in this? I think it is in pretty good shape.

larsmans added a commit that referenced this pull request Aug 9, 2015
ENH Latent Dirichlet Allocation (LDA) with online variational Bayes
@larsmans larsmans merged commit a6c6e73 into scikit-learn:master Aug 9, 2015
@larsmans
Copy link
Member

larsmans commented Aug 9, 2015

Merged this. Let's finish any nitpicking in master.

@amueller
Copy link
Member

amueller commented Aug 9, 2015

Thanks @larsmans, I agree :) 🍻 Thank you so much @chyikwei !

@mblondel
Copy link
Member

It has been suggested to mention more clearly in the code that Matt Hoffmann allowed us to license the code as BSD even though it's derived from his GPL implementation:
https://twitter.com/EdwardRaffML/status/631123381212082176

@larsmans
Copy link
Member

Yeah, it'd be good to reproduce the email or something. @chyikweiyau, could you do that?

@chyikwei
Copy link
Contributor Author

@larsmans sure. Here is the email I sent to Matt Hoffmann before.
And he emailed me relicensed code. Here are the files.

@chyikwei chyikwei deleted the onlineldavb branch August 12, 2015 04:06
@amueller
Copy link
Member

Maybe just add to the licence header "relicenced as BSD with the kind permission of Matt Hoffmann"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet