New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes #3659

Merged
merged 30 commits into from Aug 9, 2015

Conversation

Projects
None yet
@chyikwei
Contributor

chyikwei commented Sep 13, 2014

This PR is an implementation of Matt Hoffman's topic modeling algorithm LDA with online variational Bayes.

Based on previous discussion in this email thread, I asked Matt if he could relicense his onlineldavb code to BSD. And now his code is relicensed, so I create a PR for it.

I use the name OnlineLDA for this model and put it in decomposition folder. And since the model can run both online and batch update, I implemented both fit and partial_fit method.
The algorithm part and unit test is done and ready for review. Will work on an example next.

Check List:
  • algorithm implementation
  • unit test
  • profiling
  • optimization
  • example
  • documentation
Reference:

[1] "Online Learning for Latent Dirichlet Allocation", Matthew D. Hoffman, David M. Blei, Francis Bach
[2] original onlineldavb code (with BSD license)

@coveralls

This comment has been minimized.

coveralls commented Sep 13, 2014

Coverage Status

Coverage increased (+0.02%) when pulling e872e80 on chyikwei:onlineldavb into 78fbd25 on scikit-learn:master.

@mblondel

This comment has been minimized.

Member

mblondel commented Sep 20, 2014

For the record, here's a fast scikit-learn compatible LDA implementation:
https://github.com/ariddell/lda/

CC @ariddell

@ariddell

This comment has been minimized.

Contributor

ariddell commented Sep 20, 2014

It would be great to see LDA in sklearn in any form!

On the subject of online algorithms, apparently onlinehdp has very good results and I think it has the same order of operations requirements as online LDA: I. Sato, K. Kurihara, and H. Nakagawa. Practical collapsed variational Bayes inference for hierarchical Dirichlet process. In Proc. of the 18th ACM SIGKDD

@mblondel

This comment has been minimized.

Member

mblondel commented Sep 20, 2014

@ariddell What parameter inference method does your implementation use? Would you consider relicensing to BSD?

@ariddell

This comment has been minimized.

Contributor

ariddell commented Sep 20, 2014

My implementation uses collapsed Gibbs sampling, rather different from online LDA. I'd be willing to do a one-off relicense to BSD for scikit-learn if there was interest.

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Sep 20, 2014

@ariddell yeah. onlineHDP have similar operations as onlineLDA. But I am not sure is if the E-step can be executed in parallel since the topic number will change over time. (I never go through details of its source code yet.)

btw, after saw your implementation, I think I should do some profiling first and see if I can optimize my current implementation.

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Oct 14, 2014

profiling result for important functions:
https://gist.github.com/chyikwei/59c3f024ff3148efe1df

@amueller

This comment has been minimized.

Member

amueller commented Jan 9, 2015

Can we please rename this to LatentDirichletAllocation even though that is long? We have a (badly named) LDA class in scikit-learn.

@amueller

This comment has been minimized.

Member

amueller commented Jan 9, 2015

Not sure decomposition is the right folder, but I don't have a better idea ^^

@amueller

This comment has been minimized.

Member

amueller commented Jan 9, 2015

How does this compare against the gensim implementation? Is that the same approach?

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Jan 9, 2015

@amueller

  1. ok. I will rename it.
  2. I am not sure about the folder either, but "decomposition" is the best one I can find.
  3. yes. based on gensim's LDA page, I think its approach is also M. Hoffman's online LDA.
@amueller

This comment has been minimized.

Member

amueller commented Jan 9, 2015

For 3) it would be cool if you could give a performance comparison (and maybe also a comparison of how well it fit the data?) as a sanity check?

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Jan 9, 2015

ok. will add performance comparison with gensim's implementation. For "how well it fit the data", I will compare perplexity.

@amueller

This comment has been minimized.

Member

amueller commented Jan 9, 2015

Thanks :)

@ariddell

This comment has been minimized.

Contributor

ariddell commented Jan 10, 2015

I've almost got the transform method working for LDA in https://github.com/ariddell/lda (fit and fit_transform work fine); I would imagine Gibbs sampling beats online LDA in perplexity and reasonably fast for small to medium datasets -- and I'd be very curious to see how things play out with large datasets.

@chyikwei I'd be happy to help add Gibbs sampling to the benchmarks once you settle on them.

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Jan 12, 2015

quick update:

  1. renamed model to LatentDirichletAllocation
  2. add some cython optimization
  3. will work on performance comparison with gensim next. (not familiar with its interface yet.)

@ariddell thx! I will start with 20 news group data set first, and we can try large one later.

@amueller

This comment has been minimized.

Member

amueller commented Jan 13, 2015

It would be very interesting to see how the collapsed gibbs sampler compares to this, indeed.

@ariddell

This comment has been minimized.

Contributor

ariddell commented Jan 13, 2015

For larger datasets, there's Enron and PubMed: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Jan 14, 2015

Hi,
I put the comparison between my LDA implementation and gensim in this spreadsheet. (My script is here.)

I use 20 newsgroup dataset and compared both online and batch update.
In batch mode, the performance is close.
In online mode, the speed difference is caused by how often we compute perplexity in the training step.
My partial_fit method doesn't compute perplexity at all, so it is much faster than gensim. (For detail, check notes in the result spreadsheet.)

Note: One thing I haven't figured out is why perplexity goes up in gensim as the number of workers increased. I will double check that

@ariddell It will be cool if you can add Gibbs sampler's result. I will check the larger datasets link you post next.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jan 14, 2015

Is there a reason that perplexity is computed by default at every step?
It seems to me that we could make model convergence faster by computing
it every 2 or 4 step.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jan 14, 2015

Well, the spreadsheet is overall in favor of your implementation. Good work!

@amueller

This comment has been minimized.

Member

amueller commented Jan 14, 2015

yeah that looks promising :)
It might be nice to have a plot_ example that generates an image. Then the output will be visible in the example gallery on the website.

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Jan 14, 2015

@GaelVaroquaux There is no reason to compute perplexity in every step. I will add a parameter for this (similar to gensim's eval_every). Also, I think I need to check gensim's source code and setting to make sure I am doing apple to apple comparison. (If we both use Matt Hoffman's code, the result should be similar.)

@amueller not sure what's the best way to visualize topic models. (usually, I just check top words in each topic.) Any idea?

@amueller

This comment has been minimized.

Member

amueller commented Jan 14, 2015

Maybe pick the top three words in each topic and then to a bar-graph on how likely they are under each of the topics?

return score
def preplexity(self, X, gamma, sub_sampling=False):

This comment has been minimized.

@amueller

amueller Jan 14, 2015

Member

It would make sense to have a score method based on transform and perplexity, right?

This comment has been minimized.

@chyikwei

chyikwei Jan 14, 2015

Contributor

yes. that make sense. will add it.

Parameters
----------
n_topics: int, optional (default: 10)

This comment has been minimized.

@amueller

amueller Jan 14, 2015

Member

There needs to be a space before : for the numpy doc format (yes, everywhere, sorry).

This comment has been minimized.

@chyikwei

chyikwei Jan 14, 2015

Contributor

k. will fix it

@amueller

This comment has been minimized.

Member

amueller commented Jan 14, 2015

This might be a stupid question, but if we wanted to add the collapsed gibbs sampler version, say the one by @ariddell, could we use the same public interface and branch using an algorithm='collabsed_gibbs' parameter?

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Jan 14, 2015

yes. I think we can share interface for different implementation. online variational Bayes algorithm just have a few more parameters used for online update, which can be ignored by gibbs sampler.

@amueller

This comment has been minimized.

Member

amueller commented Jan 14, 2015

Ok cool :)

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Jan 15, 2015

quick update:

  1. added evaluate_every parameter to control how often we calculate perplexity.

    In my test, training is about 2x faster when I disable evaluation in my test. (58s vs 101s)
  2. add score method and fix document format.
  3. updated performance comparison spreadsheet.

    I found gensim use chunksize * workers for each update (link), so the performance get worse as workers increases. The updated result is here.(check blue rows, the online performance for different worker are similar now.)
@ogrisel

This comment has been minimized.

Member

ogrisel commented Jun 25, 2015

Please also rename self.rng_ to self.random_state_ to be consistent with other probabilistic models.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jun 25, 2015

Also I think we should make sure that the score method is in line with the ongoing work by @xuewei4d on (Bayesian) GMMs. It would be grat @xuewei4d if you could do a review of this PR to check what could be possible source inconsistencies between your GSoC models and this one.

@xuewei4d

This comment has been minimized.

Contributor

xuewei4d commented Jul 4, 2015

I did a quick review on the score method. Because I didn't consider BayesianGaussianMixture when coding DensityMixin and GaussianMixture, current DensityMixin is not compatible with any variational models. It has to compute additional term to the score, i.e. the score comes from variational distributions. Anyway, the lower bound or approximation bound is the right thing we need to compute for variational methods.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Jul 7, 2015

@chyikwei could you please answer or address the comments of @amueller on the _online_lda.pyx file: https://github.com/scikit-learn/scikit-learn/pull/3659/files ?

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Jul 7, 2015

sure. I will benchmark the cython code again since there are some code changes after I run line_profiler last time.

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Jul 8, 2015

Here is the cython code benchmark (on _update_doc_distribution function)

  1. mean_change improvement: from 32.8% to 6%. (complete result)
  2. _log_dirichlet_expectation improvement: from 38.4% to 28.9%. (complete result)

For reference, here is my profiling code.

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Jul 8, 2015

And should I move cython decorators to top of the file?

I see some files use #cython: boundscheck=False(example) and some use cython.boundscheck(False)(example).

cnts = X[idx_d, ids]
temp = dirichlet_doc_topic[idx_d, :, np.newaxis] + self.dirichlet_component_[:, ids]
tmax = temp.max(axis=0)
norm_phi = np.log(np.sum(np.exp(temp - tmax), axis=0)) + tmax

This comment has been minimized.

@xuewei4d

xuewei4d Jul 8, 2015

Contributor

scikit-learn has a logsumexp function in utils.extmath

This comment has been minimized.

@chyikwei

chyikwei Jul 9, 2015

Contributor

cool. will use it. thanks!

@amueller

This comment has been minimized.

Member

amueller commented Jul 11, 2015

I haven't checked the documentation in detail but I'd be happy to merge this now. @ogrisel what do you think?

@amueller amueller changed the title from [MRG] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes to [MRG + 1] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes Aug 3, 2015

@amueller

This comment has been minimized.

Member

amueller commented Aug 3, 2015

ping @ogrisel again ;)

@amueller

This comment has been minimized.

Member

amueller commented Aug 3, 2015

@larsmans are you interested in this? I think it is in pretty good shape.

larsmans added a commit that referenced this pull request Aug 9, 2015

Merge pull request #3659 from chyikwei/onlineldavb
ENH Latent Dirichlet Allocation (LDA) with online variational Bayes

@larsmans larsmans merged commit a6c6e73 into scikit-learn:master Aug 9, 2015

2 checks passed

continuous-integration/appveyor AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@larsmans

This comment has been minimized.

Member

larsmans commented Aug 9, 2015

Merged this. Let's finish any nitpicking in master.

@amueller

This comment has been minimized.

Member

amueller commented Aug 9, 2015

Thanks @larsmans, I agree :) 🍻 Thank you so much @chyikwei !

@mblondel

This comment has been minimized.

Member

mblondel commented Aug 11, 2015

It has been suggested to mention more clearly in the code that Matt Hoffmann allowed us to license the code as BSD even though it's derived from his GPL implementation:
https://twitter.com/EdwardRaffML/status/631123381212082176

@larsmans

This comment has been minimized.

Member

larsmans commented Aug 11, 2015

Yeah, it'd be good to reproduce the email or something. @chyikweiyau, could you do that?

@chyikwei

This comment has been minimized.

Contributor

chyikwei commented Aug 11, 2015

@larsmans sure. Here is the email I sent to Matt Hoffmann before.
And he emailed me relicensed code. Here are the files.

@chyikwei chyikwei deleted the chyikwei:onlineldavb branch Aug 12, 2015

@amueller

This comment has been minimized.

Member

amueller commented Aug 13, 2015

Maybe just add to the licence header "relicenced as BSD with the kind permission of Matt Hoffmann"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment