New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG + 1] implementing LDA(Latent Dirichlet Allocation) with online variational Bayes #3659
Conversation
For the record, here's a fast scikit-learn compatible LDA implementation: CC @ariddell |
It would be great to see LDA in sklearn in any form! On the subject of online algorithms, apparently onlinehdp has very good results and I think it has the same order of operations requirements as online LDA: I. Sato, K. Kurihara, and H. Nakagawa. Practical collapsed variational Bayes inference for hierarchical Dirichlet process. In Proc. of the 18th ACM SIGKDD |
@ariddell What parameter inference method does your implementation use? Would you consider relicensing to BSD? |
My implementation uses collapsed Gibbs sampling, rather different from online LDA. I'd be willing to do a one-off relicense to BSD for scikit-learn if there was interest. |
@ariddell yeah. onlineHDP have similar operations as onlineLDA. But I am not sure is if the E-step can be executed in parallel since the topic number will change over time. (I never go through details of its source code yet.) btw, after saw your implementation, I think I should do some profiling first and see if I can optimize my current implementation. |
profiling result for important functions: |
Can we please rename this to |
Not sure decomposition is the right folder, but I don't have a better idea ^^ |
How does this compare against the gensim implementation? Is that the same approach? |
|
For 3) it would be cool if you could give a performance comparison (and maybe also a comparison of how well it fit the data?) as a sanity check? |
ok. will add performance comparison with gensim's implementation. For "how well it fit the data", I will compare perplexity. |
Thanks :) |
I've almost got the transform method working for LDA in https://github.com/ariddell/lda (fit and fit_transform work fine); I would imagine Gibbs sampling beats online LDA in perplexity and reasonably fast for small to medium datasets -- and I'd be very curious to see how things play out with large datasets. @chyikwei I'd be happy to help add Gibbs sampling to the benchmarks once you settle on them. |
quick update:
@ariddell thx! I will start with 20 news group data set first, and we can try large one later. |
It would be very interesting to see how the collapsed gibbs sampler compares to this, indeed. |
For larger datasets, there's Enron and PubMed: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words |
Hi, I use 20 newsgroup dataset and compared both online and batch update. Note: One thing I haven't figured out is why perplexity goes up in gensim as the number of workers increased. I will double check that @ariddell It will be cool if you can add Gibbs sampler's result. I will check the larger datasets link you post next. |
Is there a reason that perplexity is computed by default at every step? |
Well, the spreadsheet is overall in favor of your implementation. Good work! |
yeah that looks promising :) |
@GaelVaroquaux There is no reason to compute perplexity in every step. I will add a parameter for this (similar to gensim's @amueller not sure what's the best way to visualize topic models. (usually, I just check top words in each topic.) Any idea? |
Maybe pick the top three words in each topic and then to a bar-graph on how likely they are under each of the topics? |
|
||
return score | ||
|
||
def preplexity(self, X, gamma, sub_sampling=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense to have a score
method based on transform
and perplexity
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. that make sense. will add it.
This might be a stupid question, but if we wanted to add the collapsed gibbs sampler version, say the one by @ariddell, could we use the same public interface and branch using an |
yes. I think we can share interface for different implementation. |
Ok cool :) |
Please also rename |
I did a quick review on the |
@chyikwei could you please answer or address the comments of @amueller on the |
sure. I will benchmark the cython code again since there are some code changes after I run |
Here is the cython code benchmark (on
For reference, here is my profiling code. |
cnts = X[idx_d, ids] | ||
temp = dirichlet_doc_topic[idx_d, :, np.newaxis] + self.dirichlet_component_[:, ids] | ||
tmax = temp.max(axis=0) | ||
norm_phi = np.log(np.sum(np.exp(temp - tmax), axis=0)) + tmax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scikit-learn has a logsumexp function in utils.extmath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. will use it. thanks!
I haven't checked the documentation in detail but I'd be happy to merge this now. @ogrisel what do you think? |
ping @ogrisel again ;) |
@larsmans are you interested in this? I think it is in pretty good shape. |
ENH Latent Dirichlet Allocation (LDA) with online variational Bayes
Merged this. Let's finish any nitpicking in master. |
It has been suggested to mention more clearly in the code that Matt Hoffmann allowed us to license the code as BSD even though it's derived from his GPL implementation: |
Yeah, it'd be good to reproduce the email or something. @chyikweiyau, could you do that? |
Maybe just add to the licence header "relicenced as BSD with the kind permission of Matt Hoffmann"? |
This PR is an implementation of Matt Hoffman's topic modeling algorithm
LDA with online variational Bayes
.Based on previous discussion in this email thread, I asked Matt if he could relicense his
onlineldavb
code to BSD. And now his code is relicensed, so I create a PR for it.I use the name
OnlineLDA
for this model and put it indecomposition
folder. And since the model can run both online and batch update, I implemented bothfit
andpartial_fit
method.The algorithm part and unit test is done and ready for review. Will work on an example next.
Check List:
Reference:
[1] "Online Learning for Latent Dirichlet Allocation", Matthew D. Hoffman, David M. Blei, Francis Bach
[2] original onlineldavb code (with BSD license)