Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG + 1] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 #7992

Merged
merged 24 commits into from Dec 20, 2016
Merged

[MRG + 1] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 #7992

merged 24 commits into from Dec 20, 2016

Conversation

garyForeman
Copy link
Contributor

Reference Issue

Fixes #7954

What does this implement/fix? Explain your changes.

This fixes the broken perplexity method of the LatentDirichletAllocation class. This method was broken in the 0.18 release when the default behavior of the transform method switched to returning normalized document topic distributions. However, the perplexity calculation uses likelihoods rather than probabilities.

To fix the issue, I have added an optional argument, normalize, to the transform method. The default is set to True such that if the argument is not specified, the behavior of the transform method remains unchanged from the version 0.18 release. When the transform method is called from the perplexity method, normalize=False is passed so that the likelihoods rather than the probabilities are returned.

Any other comments?

While I believe the fix proposed here is the most elegant, I would understand if we didn't want to add an additional argument to the transform method. If this is the case, I am happy to pivot the implementation to use a "private" class attribute that stores the normalization information.

@amueller amueller added the Bug label Dec 6, 2016
@amueller amueller added this to the 0.19 milestone Dec 6, 2016
@amueller
Copy link
Member

amueller commented Dec 6, 2016

Thank you for the fix.

Can you check on the failing tests?
We also need a regression test for this bug.

I think we don't want to add to the public api for this (imho), and I don't think we want to store the state in a private variable.

I think my preferred solution would be to add a private method _transform or _unnormalized_transform that is called both in transform and perplexity.

@garyForeman
Copy link
Contributor Author

@amueller I really like your solution, which is not something I would have come up with on my own. I'll work on implementing your suggestion. Thanks for the advice!

FYI, the failing tests are to do with the fact that I've changed the api, which, as you've suggested, is not ideal. Specifically, the test failed when the fit_transform method implemented in TransformMixin had no way to pass the normalize argument when it called transform.

@amueller
Copy link
Member

amueller commented Dec 6, 2016

Ok lets see how it goes when you implement the suggestion :)

@raghavrv raghavrv changed the title Fix perplexity method by adding normalize argument to transform method, ISSUE #7954 [WIP] Fix perplexity method by adding normalize argument to transform method, Issue #7954 Dec 6, 2016
@garyForeman
Copy link
Contributor Author

I've implemented an _unnormalized_transform method as you've suggested, which technically passes the test module and fixes the perplexity method when passed doc_topic_distr=None. However, if a user saves the output of transform and passes the result to perplexity, the result will be incorrect because transform will only provide a normalized document topic distribution. How would you like to handle this situation?

@amueller
Copy link
Member

amueller commented Dec 6, 2016

That's a good question and I don't really have a straight-forward answer.
I find the API for perplexity with the optional doc_topic_distr a bit odd, because we have no way of checking whether that was computed correctly.
Also, that name seems a bit confusing because it's not the normalized distribution - though I'm not that familiar with the language of LDA and might get it wrong.

We could deprecate passing the precomputed likelihood in the public interface. It might have been that it was only added there for the use in the training algorithm, but I'm not sure.

Alternatively we could add a check that throws a warning if someone passes normalized probabilities to perplexity, though that seems to be a bit of a hack to me.

@garyForeman
Copy link
Contributor Author

garyForeman commented Dec 7, 2016

I'm inclined to agree with the deprecation of the optional doc_topic_distr argument especially considering that as of sklearn v0.18, the user no longer has any way to access an appropriately unnormalized distribution. Would you like me to resubmit what I currently have written?

@amueller
Copy link
Member

amueller commented Dec 7, 2016

Would you like me to resubmit what I currently have written?

Not sure what you mean. You can update the branch with the appropriate changes.
You can go ahead and make the changes but I'd like someone else to chime in if they think that's the best solution. @jnothman ?

@garyForeman garyForeman changed the title [WIP] Fix perplexity method by adding normalize argument to transform method, Issue #7954 [WIP] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 Dec 7, 2016
@amueller
Copy link
Member

amueller commented Dec 8, 2016

So that is ok as a fix, but now the API is a bit weird as you agreed, so I think you should go ahead and deprecate doc_topic_distr

@garyForeman
Copy link
Contributor Author

Sounds good, I'll get to it in the next day or so.

@jnothman
Copy link
Member

Should this now be MRG?

@garyForeman
Copy link
Contributor Author

I wasn't planning to make any more updates unless you or @amueller had further suggestions. So in my biased opinion, it's ready to merge :)

@amueller amueller changed the title [WIP] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 [MRG] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 Dec 12, 2016
@@ -719,3 +739,21 @@ def perplexity(self, X, doc_topic_distr=None, sub_sampling=False):
perword_bound = bound / word_cnt

return np.exp(-1.0 * perword_bound)

def perplexity(self, X, sub_sampling=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to deprecate the doc_topic_distr parameter, you can't just remove it. It should raise a deprecation warning and be ignored, I think (because the results will likely be incorrect).
All public API must remain the same between releases unless there was a deprecation before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense. Sorry for the confusion.

"""

if doc_topic_distr is not None:
DeprecationWarning("Argument 'doc_topic_distr' is deprecated as "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say explicitly that it is ignored.

@amueller
Copy link
Member

Please add a regression test that checks that the perplexity is now computed correctly. I think the best way would be to check against the one computed during fit, but I'm not sure.

Also, please add an entry to whatsnew, and add a versionchanged entry to transform that mentions the deprecation of the argument.

doc_topic_distr : None or array, shape=(n_samples, n_topics)
Document topic distribution.
If it is None, it will be generated by applying transform on X.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can actually use a .. deprecated note here

@jnothman
Copy link
Member

@amueller, we're not going to get someone more familiar with this code to comment, are we?

@amueller
Copy link
Member

amueller commented Dec 16, 2016

we could try to summon @ogrisel or @larsmans but I'm not even sure how familiar they are with the code (and I don't know if I'm allowed to light incense in my office). @chyikwei is the original author and might have some input.

@chyikwei
Copy link
Contributor

The fix looks good to me.

We need to pass unnormalized doc_topic_distr to estimate the lower bound of perplexity, and that's why transform method return unnormalized value before. (Eq.16 in the reference paper).

For perplexity function, it makes sense to deprecate doc_topic_distr since transform doesn't return it anymore. And maybe we should also deprecate sub_sampling. (It is only used in online method)

@jnothman
Copy link
Member

Ideally, we'd also check that the warning is raised when doc_topic_distr is passed (None or otherwise). Otherwise, we should merge this ASAP

@garyForeman
Copy link
Contributor Author

garyForeman commented Dec 19, 2016

I'll go ahead and write a regression test for the deprecated doc_topic_distr warning message.

IMHO, deprecating the sub_sampling argument sounds like an issue for a separate pull request. After a quick read through the code, it looked as though sub_sampling is set to False everywhere, but I wouldn't say I have a great understanding of the implications behind deprecating this argument.

@chyikwei
Copy link
Contributor

yeah. deprecate sub_sampling should be a separate issue.

@jnothman jnothman merged commit 6a01e89 into scikit-learn:master Dec 20, 2016
@jnothman
Copy link
Member

Thanks a lot @garyForeman, and @chyikwei, your continued advice is much appreciated.

@naoyak
Copy link
Contributor

naoyak commented Jan 2, 2017

@amueller @garyForeman

I was discussing with @jnothman a similar PR (#8137 (comment)) and it came up that this change doesn't really follow the standard deprecation procedure.

I think I understand the justification that the parameter is pretty much useless if the user isn't able to obtain and supply a valid unnormalized distribution, but maybe the functionality should be restored, or at least a loud and clear notice in the docs and the DeprecationWarning that passing doc_topic_distr does nothing?

Sorry to butt in on a topic I know little about and just wanted to relay that input.

@garyForeman
Copy link
Contributor Author

@naoyak @jnothman

I'm not sure I would say that passing doc_topic_distr does nothing. The code executes as if the argument was not passed by computing the properly unnormalized topic distribution, which is needed in order to produce correct results.

#8137 appears to be focused on tidying up the API, whereas this is a bug fix. It makes sense to maintain the ridge_alpha parameter in the near future, otherwise, you will break users' code. This change does not break users' code, but will admittedly increase run time.

As @amueller and I discussed above, the ideas we had for maintaining working functionality of the doc_topic_distr were to either change the API or store normalization constants in a private attribute. Neither idea was palatable.

@jnothman
Copy link
Member

jnothman commented Jan 2, 2017

Sorry for not reviewing carefully. If doc_topic_distr is not being passed on to _perplexity_precomp_distr, the warning should say that it is already being ignored.

@garyForeman
Copy link
Contributor Author

garyForeman commented Jan 2, 2017

Ok, I can add that in. I'll open a new pull request shortly.

"""
if doc_topic_distr != 'deprecated':
warnings.warn("Argument 'doc_topic_distr' is deprecated and will "
"be ignored as of 0.19. Support for this argument "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'll change this from future to present tense, i.e. "...is deprecated and is being ignored as of 0.19." Does that address the issue?

@jnothman
Copy link
Member

jnothman commented Jan 2, 2017 via email

X : array-like or sparse matrix, [n_samples, n_features]
Document word matrix.

doc_topic_distr : None or array, shape=(n_samples, n_topics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should probably note clearly that doc_topic_distr is currently discarded as well, since that's where users look first.

@jnothman
Copy link
Member

jnothman commented Jan 2, 2017 via email

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
…hod, Issue scikit-learn#7954 (scikit-learn#7992)

Also deprecate doc_topic_distr argument in perplexity method
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…hod, Issue scikit-learn#7954 (scikit-learn#7992)

Also deprecate doc_topic_distr argument in perplexity method
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…hod, Issue scikit-learn#7954 (scikit-learn#7992)

Also deprecate doc_topic_distr argument in perplexity method
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…hod, Issue scikit-learn#7954 (scikit-learn#7992)

Also deprecate doc_topic_distr argument in perplexity method
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…hod, Issue scikit-learn#7954 (scikit-learn#7992)

Also deprecate doc_topic_distr argument in perplexity method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LatentDirichletAllocation perplexity method broken in version 0.18.1
5 participants