[MRG + 1] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 #7992

garyForeman · 2016-12-06T15:26:44Z

Reference Issue

What does this implement/fix? Explain your changes.

This fixes the broken perplexity method of the LatentDirichletAllocation class. This method was broken in the 0.18 release when the default behavior of the transform method switched to returning normalized document topic distributions. However, the perplexity calculation uses likelihoods rather than probabilities.

To fix the issue, I have added an optional argument, normalize, to the transform method. The default is set to True such that if the argument is not specified, the behavior of the transform method remains unchanged from the version 0.18 release. When the transform method is called from the perplexity method, normalize=False is passed so that the likelihoods rather than the probabilities are returned.

Any other comments?

While I believe the fix proposed here is the most elegant, I would understand if we didn't want to add an additional argument to the transform method. If this is the case, I am happy to pivot the implementation to use a "private" class attribute that stores the normalization information.

amueller · 2016-12-06T18:45:00Z

Thank you for the fix.

Can you check on the failing tests?
We also need a regression test for this bug.

I think we don't want to add to the public api for this (imho), and I don't think we want to store the state in a private variable.

I think my preferred solution would be to add a private method _transform or _unnormalized_transform that is called both in transform and perplexity.

garyForeman · 2016-12-06T19:01:10Z

@amueller I really like your solution, which is not something I would have come up with on my own. I'll work on implementing your suggestion. Thanks for the advice!

FYI, the failing tests are to do with the fact that I've changed the api, which, as you've suggested, is not ideal. Specifically, the test failed when the fit_transform method implemented in TransformMixin had no way to pass the normalize argument when it called transform.

amueller · 2016-12-06T19:03:45Z

Ok lets see how it goes when you implement the suggestion :)

garyForeman · 2016-12-06T19:55:26Z

I've implemented an _unnormalized_transform method as you've suggested, which technically passes the test module and fixes the perplexity method when passed doc_topic_distr=None. However, if a user saves the output of transform and passes the result to perplexity, the result will be incorrect because transform will only provide a normalized document topic distribution. How would you like to handle this situation?

amueller · 2016-12-06T20:36:44Z

That's a good question and I don't really have a straight-forward answer.
I find the API for perplexity with the optional doc_topic_distr a bit odd, because we have no way of checking whether that was computed correctly.
Also, that name seems a bit confusing because it's not the normalized distribution - though I'm not that familiar with the language of LDA and might get it wrong.

We could deprecate passing the precomputed likelihood in the public interface. It might have been that it was only added there for the use in the training algorithm, but I'm not sure.

Alternatively we could add a check that throws a warning if someone passes normalized probabilities to perplexity, though that seems to be a bit of a hack to me.

garyForeman · 2016-12-07T01:19:02Z

I'm inclined to agree with the deprecation of the optional doc_topic_distr argument especially considering that as of sklearn v0.18, the user no longer has any way to access an appropriately unnormalized distribution. Would you like me to resubmit what I currently have written?

amueller · 2016-12-07T16:18:32Z

Would you like me to resubmit what I currently have written?

Not sure what you mean. You can update the branch with the appropriate changes.
You can go ahead and make the changes but I'd like someone else to chime in if they think that's the best solution. @jnothman ?

…ent topic distribution

…ansform

… use _unnormalized_transform to set distr

amueller · 2016-12-08T18:22:37Z

So that is ok as a fix, but now the API is a bit weird as you agreed, so I think you should go ahead and deprecate doc_topic_distr

garyForeman · 2016-12-09T13:58:17Z

Sounds good, I'll get to it in the next day or so.

jnothman · 2016-12-12T03:36:02Z

Should this now be MRG?

garyForeman · 2016-12-12T16:16:58Z

I wasn't planning to make any more updates unless you or @amueller had further suggestions. So in my biased opinion, it's ready to merge :)

amueller · 2016-12-12T16:34:39Z

sklearn/decomposition/online_lda.py

@@ -719,3 +739,21 @@ def perplexity(self, X, doc_topic_distr=None, sub_sampling=False):
        perword_bound = bound / word_cnt

        return np.exp(-1.0 * perword_bound)
+
+    def perplexity(self, X, sub_sampling=False):


You need to deprecate the doc_topic_distr parameter, you can't just remove it. It should raise a deprecation warning and be ignored, I think (because the results will likely be incorrect).
All public API must remain the same between releases unless there was a deprecation before.

Ok, makes sense. Sorry for the confusion.

…f it is not set to None

amueller · 2016-12-12T19:48:09Z

sklearn/decomposition/online_lda.py

+        """
+
+        if doc_topic_distr is not None:
+            DeprecationWarning("Argument 'doc_topic_distr' is deprecated as "


Maybe say explicitly that it is ignored.

amueller · 2016-12-12T19:49:40Z

Please add a regression test that checks that the perplexity is now computed correctly. I think the best way would be to check against the one computed during fit, but I'm not sure.

Also, please add an entry to whatsnew, and add a versionchanged entry to transform that mentions the deprecation of the argument.

jnothman · 2016-12-15T01:24:15Z

sklearn/decomposition/online_lda.py

+        doc_topic_distr : None or array, shape=(n_samples, n_topics)
+            Document topic distribution.
+            If it is None, it will be generated by applying transform on X.
+


you can actually use a .. deprecated note here

jnothman · 2016-12-15T01:25:16Z

@amueller, we're not going to get someone more familiar with this code to comment, are we?

… description

amueller · 2016-12-16T21:10:42Z

we could try to summon @ogrisel or @larsmans but I'm not even sure how familiar they are with the code (and I don't know if I'm allowed to light incense in my office). @chyikwei is the original author and might have some input.

chyikwei · 2016-12-16T23:24:11Z

The fix looks good to me.

We need to pass unnormalized doc_topic_distr to estimate the lower bound of perplexity, and that's why transform method return unnormalized value before. (Eq.16 in the reference paper).

For perplexity function, it makes sense to deprecate doc_topic_distr since transform doesn't return it anymore. And maybe we should also deprecate sub_sampling. (It is only used in online method)

jnothman · 2016-12-19T04:05:49Z

Ideally, we'd also check that the warning is raised when doc_topic_distr is passed (None or otherwise). Otherwise, we should merge this ASAP

garyForeman · 2016-12-19T14:40:41Z

I'll go ahead and write a regression test for the deprecated doc_topic_distr warning message.

IMHO, deprecating the sub_sampling argument sounds like an issue for a separate pull request. After a quick read through the code, it looked as though sub_sampling is set to False everywhere, but I wouldn't say I have a great understanding of the implications behind deprecating this argument.

chyikwei · 2016-12-19T15:31:56Z

yeah. deprecate sub_sampling should be a separate issue.

jnothman · 2016-12-20T04:20:34Z

Thanks a lot @garyForeman, and @chyikwei, your continued advice is much appreciated.

naoyak · 2017-01-02T11:50:42Z

@amueller @garyForeman

I was discussing with @jnothman a similar PR (#8137 (comment)) and it came up that this change doesn't really follow the standard deprecation procedure.

I think I understand the justification that the parameter is pretty much useless if the user isn't able to obtain and supply a valid unnormalized distribution, but maybe the functionality should be restored, or at least a loud and clear notice in the docs and the DeprecationWarning that passing doc_topic_distr does nothing?

Sorry to butt in on a topic I know little about and just wanted to relay that input.

garyForeman · 2017-01-02T16:20:21Z

@naoyak @jnothman

I'm not sure I would say that passing doc_topic_distr does nothing. The code executes as if the argument was not passed by computing the properly unnormalized topic distribution, which is needed in order to produce correct results.

#8137 appears to be focused on tidying up the API, whereas this is a bug fix. It makes sense to maintain the ridge_alpha parameter in the near future, otherwise, you will break users' code. This change does not break users' code, but will admittedly increase run time.

As @amueller and I discussed above, the ideas we had for maintaining working functionality of the doc_topic_distr were to either change the API or store normalization constants in a private attribute. Neither idea was palatable.

jnothman · 2017-01-02T21:00:40Z

Sorry for not reviewing carefully. If doc_topic_distr is not being passed on to _perplexity_precomp_distr, the warning should say that it is already being ignored.

garyForeman · 2017-01-02T21:24:23Z

Ok, I can add that in. I'll open a new pull request shortly.

garyForeman · 2017-01-02T22:14:11Z

sklearn/decomposition/online_lda.py

+        """
+        if doc_topic_distr != 'deprecated':
+            warnings.warn("Argument 'doc_topic_distr' is deprecated and will "
+                          "be ignored as of 0.19. Support for this argument "


So I'll change this from future to present tense, i.e. "...is deprecated and is being ignored as of 0.19." Does that address the issue?

jnothman · 2017-01-02T22:39:03Z

If there's no way to keep using doc_topic_distr then yes.

…

On 3 January 2017 at 09:14, Gary Foreman ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/decomposition/online_lda.py <#7992 (review)> : > + Document word matrix. + + doc_topic_distr : None or array, shape=(n_samples, n_topics) + Document topic distribution. + If it is None, it will be generated by applying transform on X. + + .. deprecated:: 0.19 + + Returns + ------- + score : float + Perplexity score. + """ + if doc_topic_distr != 'deprecated': + warnings.warn("Argument 'doc_topic_distr' is deprecated and will " + "be ignored as of 0.19. Support for this argument " So I'll change this from future to present tense, i.e. "...is deprecated and is being ignored as of 0.19." Does that address the issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7992 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yOw8k4qtQNKVxhz1kjrX0wtMkmZks5rOXa0gaJpZM4LFiE9> .

naoyak · 2017-01-02T22:57:16Z

sklearn/decomposition/online_lda.py

+        X : array-like or sparse matrix, [n_samples, n_features]
+            Document word matrix.
+
+        doc_topic_distr : None or array, shape=(n_samples, n_topics)


The docstring should probably note clearly that doc_topic_distr is currently discarded as well, since that's where users look first.

jnothman · 2017-01-02T23:52:38Z

well it doesn't do anything, the docstring should certainly not say that it does

…

On 3 Jan 2017 9:57 am, "Naoya Kanai" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/decomposition/online_lda.py <#7992 (review)> : > + + def perplexity(self, X, doc_topic_distr='deprecated', sub_sampling=False): + """Calculate approximate perplexity for data X. + + Perplexity is defined as exp(-1. * log-likelihood per word) + + .. versionchanged:: 0.19 + *doc_topic_distr* argument has been depricated because user no + longer has access to unnormalized distribution + + Parameters + ---------- + X : array-like or sparse matrix, [n_samples, n_features] + Document word matrix. + + doc_topic_distr : None or array, shape=(n_samples, n_topics) The docstring should probably note clearly that doc_topic_distr is currently discarded as well, since that's where users look first. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7992 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_dgrRQ2e47ghPPdgo1B8pWztjHyks5rOYDNgaJpZM4LFiE9> .

…hod, Issue scikit-learn#7954 (scikit-learn#7992) Also deprecate doc_topic_distr argument in perplexity method

Fix perplexity method by adding normalize argument to transform method

c2bc36f

amueller added the Bug label Dec 6, 2016

amueller added this to the 0.19 milestone Dec 6, 2016

raghavrv changed the title ~~Fix perplexity method by adding normalize argument to transform method, ISSUE #7954~~ [WIP] Fix perplexity method by adding normalize argument to transform method, Issue #7954 Dec 6, 2016

Gary Foreman added 4 commits December 7, 2016 10:25

Add _unnormalized_transform method that returns an unnormalized docum…

31d958a

…ent topic distribution

Update LatentDirichletAllocation score method to use _unnormalized_tr…

61f3f7a

…ansform

Update test_lda_score_perplexity and test_perplexity_intput_format to…

e84a974

… use _unnormalized_transform to set distr

Remove superfluous white space

283dfe4

garyForeman changed the title ~~[WIP] Fix perplexity method by adding normalize argument to transform method, Issue #7954~~ [WIP] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 Dec 7, 2016

amueller mentioned this pull request Dec 8, 2016

Perplexity not monotonically decreasing for batch Latent Dirichlet Allocation #6777

Open

Gary Foreman added 3 commits December 9, 2016 13:29

Depricate doc_topic_distr argument in perplexity method

2e79d88

Modify testing module to allow for doc_topic_distr deprication

c11885f

Merge branch 'master' into perplexity_fix

b3a4f7f

amueller changed the title ~~[WIP] Fix perplexity method by adding _unnormalized_transform method, Issue #7954~~ [MRG] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 Dec 12, 2016

amueller reviewed Dec 12, 2016

View reviewed changes

Gary Foreman added 2 commits December 12, 2016 13:30

Reintroduce doc_topic_distr argument but throw a DeprecationWarning i…

c0940ef

…f it is not set to None

Merge branch 'master' into perplexity_fix

b7b0a0a

amueller reviewed Dec 12, 2016

View reviewed changes

Change doc_topic_distr default to 'deprecated' in perplexity method

1c0df9b

jnothman reviewed Dec 15, 2016

View reviewed changes

Add .. debrecated note in perplexity doc string under doc_topic_distr…

7048950

… description

Gary Foreman added 2 commits December 19, 2016 11:10

Add regression test for doc_topic_distr deprecation warning message

943dbd9

Merge branch 'master' into perplexity_fix

e4bed53

jnothman merged commit 6a01e89 into scikit-learn:master Dec 20, 2016

naoyak mentioned this pull request Jan 1, 2017

[MRG+1] Deprecate ridge_alpha param on SparsePCA.transform() #8137

Merged

garyForeman commented Jan 2, 2017

View reviewed changes

naoyak reviewed Jan 2, 2017

View reviewed changes

garyForeman mentioned this pull request Jan 2, 2017

[MRG] Change deprecation warning for doc_topic_distr from future to present… #8146

Merged

breschke mentioned this pull request Jan 24, 2017

Comparing LDA between gensim and sklearn piskvorky/gensim#457

Closed

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG + 1] Fix perplexity method by adding _unnormalized_transform met…

0d94be1

…hod, Issue scikit-learn#7954 (scikit-learn#7992) Also deprecate doc_topic_distr argument in perplexity method

Przemo10 mentioned this pull request Mar 17, 2017

update fork (#1) #8606

Closed

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG + 1] Fix perplexity method by adding _unnormalized_transform met…

d44bf89

…hod, Issue scikit-learn#7954 (scikit-learn#7992) Also deprecate doc_topic_distr argument in perplexity method

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG + 1] Fix perplexity method by adding _unnormalized_transform met…

122b8b7

…hod, Issue scikit-learn#7954 (scikit-learn#7992) Also deprecate doc_topic_distr argument in perplexity method

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG + 1] Fix perplexity method by adding _unnormalized_transform met…

2b76652

…hod, Issue scikit-learn#7954 (scikit-learn#7992) Also deprecate doc_topic_distr argument in perplexity method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG + 1] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 #7992

[MRG + 1] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 #7992

garyForeman commented Dec 6, 2016

amueller commented Dec 6, 2016

garyForeman commented Dec 6, 2016

amueller commented Dec 6, 2016

garyForeman commented Dec 6, 2016

amueller commented Dec 6, 2016

garyForeman commented Dec 7, 2016 •

edited

amueller commented Dec 7, 2016

amueller commented Dec 8, 2016

garyForeman commented Dec 9, 2016

jnothman commented Dec 12, 2016

garyForeman commented Dec 12, 2016

amueller Dec 12, 2016

garyForeman Dec 12, 2016

amueller Dec 12, 2016

amueller commented Dec 12, 2016

jnothman Dec 15, 2016

jnothman commented Dec 15, 2016

amueller commented Dec 16, 2016 •

edited

chyikwei commented Dec 16, 2016

jnothman commented Dec 19, 2016

garyForeman commented Dec 19, 2016 •

edited

chyikwei commented Dec 19, 2016

jnothman commented Dec 20, 2016

naoyak commented Jan 2, 2017

garyForeman commented Jan 2, 2017

jnothman commented Jan 2, 2017

garyForeman commented Jan 2, 2017 •

edited

garyForeman Jan 2, 2017

jnothman commented Jan 2, 2017 via email

naoyak Jan 2, 2017

jnothman commented Jan 2, 2017 via email

[MRG + 1] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 #7992

[MRG + 1] Fix perplexity method by adding _unnormalized_transform method, Issue #7954 #7992

Conversation

garyForeman commented Dec 6, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

amueller commented Dec 6, 2016

garyForeman commented Dec 6, 2016

amueller commented Dec 6, 2016

garyForeman commented Dec 6, 2016

amueller commented Dec 6, 2016

garyForeman commented Dec 7, 2016 • edited

amueller commented Dec 7, 2016

amueller commented Dec 8, 2016

garyForeman commented Dec 9, 2016

jnothman commented Dec 12, 2016

garyForeman commented Dec 12, 2016

amueller Dec 12, 2016

Choose a reason for hiding this comment

garyForeman Dec 12, 2016

Choose a reason for hiding this comment

amueller Dec 12, 2016

Choose a reason for hiding this comment

amueller commented Dec 12, 2016

jnothman Dec 15, 2016

Choose a reason for hiding this comment

jnothman commented Dec 15, 2016

amueller commented Dec 16, 2016 • edited

chyikwei commented Dec 16, 2016

jnothman commented Dec 19, 2016

garyForeman commented Dec 19, 2016 • edited

chyikwei commented Dec 19, 2016

jnothman commented Dec 20, 2016

naoyak commented Jan 2, 2017

garyForeman commented Jan 2, 2017

jnothman commented Jan 2, 2017

garyForeman commented Jan 2, 2017 • edited

garyForeman Jan 2, 2017

Choose a reason for hiding this comment

jnothman commented Jan 2, 2017 via email

naoyak Jan 2, 2017

Choose a reason for hiding this comment

jnothman commented Jan 2, 2017 via email

garyForeman commented Dec 7, 2016 •

edited

amueller commented Dec 16, 2016 •

edited

garyForeman commented Dec 19, 2016 •

edited

garyForeman commented Jan 2, 2017 •

edited