Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fix lower_bound_ not equal to max lower bound in mixture models when n_init > 1 #10870

Merged

Conversation

ageron
Copy link
Contributor

@ageron ageron commented Mar 25, 2018

Reference Issues/PRs

Fixes #10869

What does this implement/fix? Explain your changes.

Just set the lower_bound_ to be equal to the max_lower_bound at the end of the loop over the initializations (at the end of BaseMixture.fit()).
Also fix the test_init() function that was supposed to catch this bug. I do this by looping over multiple random states rather than just trying one (which had a 50% chance of wrongly succeeding).

Any other comments?

Thanks for making such a great library! :)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right, but it makes the repeated setting of lower_bound_ look inexplicable (unless it were intended for debugging a crash!). Can we use a local lower_bound instead?

@ageron ageron force-pushed the gaussian_mixture_lower_bound_fix branch from a0bffd9 to 1b9f54e Compare March 26, 2018 12:34
@ageron
Copy link
Contributor Author

ageron commented Mar 26, 2018

Hi @jnothman ,
Good point. I just made lower_bound a local variable instead.

@ageron ageron force-pushed the gaussian_mixture_lower_bound_fix branch from 5957ea2 to 1b9f54e Compare March 26, 2018 14:22
@ageron
Copy link
Contributor Author

ageron commented Mar 26, 2018

There were a couple traps, but I finally managed to get this thing to work. Here's the new behavior:

  • When warm_start is False or it is the first time we fit the model, then:
  • When warm_start is True and it is not the first call to the fit() method, then:
    • self.lower_bound_ is the max lower bound found during this call to the fit() method (even if it is lower than the max lower bound found earlier).
      • This is useful if you train a model on a first dataset, then you train it on another dataset: the lower bound may be lower after the second call to fit(), and that's expected.
    • Moreover, self.converged_ is True if and only if the lower bound is within tol of the previous lower bound at any iteration. Upon the first iteration, prev_lower_bound is initialized to self.lower_bound_ (the max lower bound from the previous run).
      • I had to implement this logic to allow the test_monotonic_likelihood() to work, since it uses warm_start=True and max_iter=1, and therefore the only way it can detect that it has converged is if it looks at self.lower_bound_ from the previous iteration. There's a tiny risk that the algorithm has in fact not converged, but just happened to stumble upon a very close lower bound estimate. Maybe prev_lower_bound should be set to self.lower_bound_ only when max_iter=1, or -np.infty otherwise?
        Wdyt?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic looks reasonable. Do attribute docstrings need improvement?

rand_data = RandomData(np.random.RandomState(random_state), scale=1)
n_components = rand_data.n_components
X = rand_data.X['full']
for random_state in range(100):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you double check that this runs quite quickly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It runs in 4 seconds on my laptop. Is that sufficiently quick? If not, we could reduce this to less than a second by iterating just 25 times. The probability for the unfixed code to pass this test would be 1/2^25, which is roughly 3e-8. Tell me what you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 4s for this kind of test is excessively long, IMO. And if the unfixed code only has a chance of 3e-8 of passing (did you get that the right way around?), then we can run it for fewer than that. Alternatively, we can reduce the size of X if that's not actually related to the property being tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, reducing to 25 iterations. I think I got the calculation right: the unfixed code sets the lower_bound_ to the lower bound of the last initialization, which has 50% chance of being higher than the lower bound of the first initialization. So in order to pass the updated test, the unfixed code would need to have 50% chance 25 times in a row, which 0.5^25 = 3e-8. In other words, the new test would catch the bug in the unfixed code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks. A smaller dataset would also be fine.

@@ -191,6 +191,7 @@ def fit(self, X, y=None):
X = _check_X(X, self.n_components, ensure_min_samples=2)
self._check_initial_parameters(X)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rm blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, indeed. Fixing this now.

@ageron
Copy link
Contributor Author

ageron commented Mar 27, 2018

I just updated the documentation, I hope it's clear.

I also changed the logic slightly: the previous lower_bound_ is now only used to check for convergence when warm_start is True and max_iter is 1. Indeed, if max_iter>1, it's safer to check for convergence after the 1st iteration, in case the dataset has changed.

@ageron
Copy link
Contributor Author

ageron commented Apr 9, 2018

Hi @jnothman , is there anything else you need from me to fix this bug or are we okay to merge?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can that new logic be tested?

This requires a second review before merge

@jnothman
Copy link
Member

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@ageron
Copy link
Contributor Author

ageron commented Apr 11, 2018

Thanks for your feedback, @jnothman . I just updated doc/whats_new/v0.20.rst, as requested.

@jnothman
Copy link
Member

I also asked: Can that new logic be tested?

@ageron
Copy link
Contributor Author

ageron commented Apr 11, 2018

Hi @jnothman , I just added the tests for the new logic. In short:

  1. I added a test that convergence is properly detected when warm_start=True, with different values of max_iter (1, 2, and 50).
  2. I added a test that convergence is never detected at the 1st iteration when warm_start=True and max_iter > 1. This ensures that no convergence will be detected by mistake at the first iteration if the dataset is changed.

@ageron
Copy link
Contributor Author

ageron commented May 4, 2018

Hi @jnothman ,
I've updated this PR because there were some conflicting changes in sklearn/mixture/base.py regarding n_iter_.
Could you please confirm that everything looks good to you and merge this PR if it does?
Thanks!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a review by someone else still.

@@ -26,6 +26,8 @@ random sampling procedures.
- :class:`linear_model.OrthogonalMatchingPursuit` (bug fix)
- :class:`metrics.roc_auc_score` (bug fix)
- :class:`metrics.roc_curve` (bug fix)
- :class:`mixture.BayesianGaussianMixture` (bug bix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me: does this pr actually change the prediction? If it's only affecting the attribute value, I think we should leave it out of here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it may change the prediction in some very rare case. Consider:

gm = GaussianMixture(warm_start=True, max_iter=10)
gm.fit(X1)
gm.fit(X2)

If X2 is different from X1, is does not make sense to use the previous value of lower_bound_ as a starting point. The previous implementation would do that, and therefore it might wrongly detect convergence after the first iteration (this is unlikely, but possible, in particular if X2 is very similar to X1, or if tol is large). This PR fixes this. Thus, the fixed algorithm might converge to a better solution and produce different (better) predictions.

You could consider these as separate issues: (1) wrong lower_bound_ value, and (2) wrong convergence detection logic. However, they're both caused by the same few lines of code, so I fixed them both in this one PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation

@@ -212,8 +214,8 @@ Model evaluation and meta-estimators
:issue:`9304` by :user:`Breno Freitas <brenolf>`.

- Add `return_estimator` parameter in :func:`model_selection.cross_validate` to
return estimators fitted on each split. :issue:`9686` by :user:`Aurélien Bellet
<bellet>`.
return estimators fitted on each split. :issue:`9686` by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh! Please revert all changes unrelated to the present fix!

Please do not change unrelated things. It makes your contribution harder to review and may introduce merge conflicts to other pull requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sorry about that, makes sense (FYI, I was trying to make the file use 80 characters per line).

gmm.fit(X)
if gmm.converged_:
break
assert_true(gmm.converged_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: since moving to pytest, we're trying to avoid such assert_* functions. Use a bare assert instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fixing this now.

@jnothman jnothman changed the title Fix lower_bound_ not equal to max lower bound in mixture models when n_init > 1 [MRG+1] Fix lower_bound_ not equal to max lower bound in mixture models when n_init > 1 May 6, 2018
which the model has the largest likelihood or lower bound. Within each
trial, the method iterates between E-step and M-step for `max_iter`
times until the change of likelihood or lower bound is less than
`tol`, otherwise, a `ConvergenceWarning` is raised.
If `warm_start` is `True`, then `n_init` is ignored and a single
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be double backticks everywhere....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, just fixed this, thanks.

# lower_bound_ is very close to the lower_bound_ after the previous call
# to the fit method.
# Unlikely, but possible and problematic, so we might as well avoid it.
rng = np.random.RandomState(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm being slow. I don't see where this case is handled and I don't understand the test. Why do we always reset with max_iter > 1 now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, let me explain. The scenario I'm trying to avoid, which could happen today (without this PR), is using warm_start=True, and you run gm.fit(X1) followed by gm.fit(X2), where X1 and X2 are different datasets, and if you are unlucky the first iteration of gm.fit(X2) happens to compute a lower bound very close to the final lower bound of gm.fit(X1), so the algorithm thinks the algorithm has converged when it fact it should have continued to iterate. Sure, this is quite unlikely with the default tol and if X1 and X2 are very different, but it might be dangerously likely if X1 and X2 are very similar, but not identical (or if tol is high).
Since it is hard to find two datasets X1 and X2 where this scenario occurs, I test this scenario by setting tol to infinity.
This case is handled on line 217 of base.py: we start the iterations with lower_bound = (-np.infty if do_init or self.max_iter > 1 else self.lower_bound_). So we only continue from the final lower bound of the last call to fit() if warm_start is True and max_iter == 1. The assumption is that people who use warm_start=True and max_iter = 1 are certainly doing this to manually run the training loop themselves on a single dataset, but if they are using max_iter > 1, it is unclear whether they are running consecutive calls to fit() on the same dataset or not, so we should err on the safe side.
So in short there are two issues in the current implementation: (1) if n_init > 1, the lower bound is the one from the last initialization, not from the best initialization, and (2) there is a risk of false convergence if warm_start is True. Since they are both due to the same few lines of code, I fixed them both in this one PR.
Hope this helps.

@jnothman
Copy link
Member

jnothman commented May 29, 2018 via email

@ageron
Copy link
Contributor Author

ageron commented May 30, 2018

@jnothman, thanks for your feedback. I did not know that switching datasets between warm_start fits was an abnormal use, in fact I have often done this after updating a dataset with new data, to avoid starting training from scratch. How else can this be done? In any case, I feel that this will require more discussion, so if you like I can split this PR in two: one part will just focus on fixing the original bug (incorrect lower_bound_ when n_init > 1) and another about the risk when switching datasets. Sounds good?

@jnothman
Copy link
Member

jnothman commented May 31, 2018 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only concerning part here is that max_iter controls multiple things now. If it's not hard to split that from this PR, it would help us ensure that at least the bug is fixed for release

@ageron
Copy link
Contributor Author

ageron commented May 31, 2018

Sure @jnothman , it shouldn't be too hard, I'll split the PR (probably this week-end).

@ageron
Copy link
Contributor Author

ageron commented Jun 5, 2018

Hi @jnothman , I updated this PR to keep only the fix for the original bug, i.e., the lower_bound_ was the max lower_bound_ of the last initialization (when n_init > 1) rather than the max lower_bound_ across all initializations.
I'll file a separate bug and a separate PR for the wrong convergence detection when consecutive fits use different datasets and using warm start.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM.

@ageron
Copy link
Contributor Author

ageron commented Jun 19, 2018

Hi there @amueller , this PR needs a second review, whenever you have the chance.

@jnothman jnothman added this to the 0.20 milestone Jun 20, 2018
@GaelVaroquaux
Copy link
Member

I resolved the conflicts. I will merge once the tests pass.

@GaelVaroquaux
Copy link
Member

All tests pass aside from appveyor, which is lagging behind as an effect of the sprint.

Merging!

@GaelVaroquaux GaelVaroquaux merged commit beb2aa0 into scikit-learn:master Jul 16, 2018
@GaelVaroquaux
Copy link
Member

Thank you!!!

@ageron
Copy link
Contributor Author

ageron commented Jul 19, 2018

Thanks to all the reviewers! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In Gaussian mixtures, when n_init > 1, the lower_bound_ is not always the max
4 participants