Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] #7908 Addressed issue in first iteration of RANSAC regression #7914

Merged
merged 16 commits into from Jan 5, 2017

Conversation

@mthorrell
Copy link
Contributor

@mthorrell mthorrell commented Nov 20, 2016

Reference Issue

Fixes #7908

What does this implement/fix? Explain your changes.

On the first iteration of RANSAC regression, if no inliers are found, an error is produced and the code is stopped. Ideally the procedure would just skip that iteration and continue on to the next iteration where it would use a different random sample which could produce valid inliers.

Generally this error is produced when n_inliers_subset and n_inliers_best are both zero. My fix was to set the initial value for n_inliers_best to 1. Thus if n_inliers_subset >= 1, the code follows the normal path, and if n_inliers_subset == 0, the code progresses to the next iteration. This fixes the bug as described in this issue.

However, setting n_inliers_best = 1 creates an issue again in the first iteration of this loop in the case when n_inliers_subset == 1. The subsequent comparison is made: score_subset < score_best. Since score_best is initialized to np.inf, the code will incorrectly skip to the next iteration, ignoring the fact that in the first iteration, we did find valid inliers. The fix for this is to change the initial value of score_best to -np.inf. In general, I think this is better practice when initializing these types of variables anyway.

Any other comments?

@mthorrell
Copy link
Contributor Author

@mthorrell mthorrell commented Nov 27, 2016

In this small update, I adjusted the offending test so it did not require a specific output; however, an error is still required to be raised in the residual threshold = 0 case.

@amueller
Copy link
Member

@amueller amueller commented Nov 28, 2016

thanks for the PR. I think changing the tests was the way to go! (I don't have time for a review right now unfortunately)

@amueller amueller added this to the 0.19 milestone Nov 28, 2016
@amueller amueller added the Bug label Nov 28, 2016
@mthorrell
Copy link
Contributor Author

@mthorrell mthorrell commented Nov 30, 2016

Thanks.

I made a couple minor changes:

  • The test now more specifically checks the error message since I think this is better practice.
  • I also cleaned up the formatting of that error message.
Michael Horrell
msg = (
"RANSAC could not find valid consensus set, because"
" either the `residual_threshold` rejected all the samples or"
" `is_data_valid` and `is_model_valid` returned False for all"

This comment has been minimized.

@amueller

amueller Nov 30, 2016
Member

refering to internal functions is not very helpful for the users imho.

This comment has been minimized.

@mthorrell

mthorrell Dec 4, 2016
Author Contributor

I believe you are referring to is_data_valid and is_model_valid. These are user-defined functions much like residual_threshold is a user-defined parameter. When a user does not define these functions, RANSAC does not call them--they are initialized to None and evaluation is skipped. Unless you disagree, these may still be valid to refer to.

Perhaps a better message might read:

RANSAC could not find a valid consensus set. All max_trials iterations were skipped because each randomly chosen sub-sample either: didn't produce a valid set of inliers due to a small residual_threshold or was invalidated by is_data_valid or is_model_valid if either of these functions have been defined by the user.

This comment has been minimized.

@amueller

amueller Dec 6, 2016
Member

Oh, sorry, my fault. The current message is ok, though your suggestion is even better!

@amueller
Copy link
Member

@amueller amueller commented Nov 30, 2016

maybe @ahojnnes wants to review?

Michael Horrell
Copy link
Member

@jnothman jnothman left a comment

I am not certain this is the right fix. This still fails if a 0-inlier sample is drawn at any point after the first >0-inlier sample is drawn.

We could introduce a parameter to control how many 0-inlier samples are allowed through, or just get rid of this control (the main risk being lots of time wasted).

A couple of more cosmetic things we can do to improve this behaviour:

  • If we continue to raise an error when no inliers are found, we should report the minimum residual found (and perhaps other summary statistics of the residual distribution) as well as the current threshold, so that the user has a means to tune the parameter.
  • introduce a verbose option that reports things like the median residual, number of inliers, etc., at each iteration.

Ping @CSchoel for your opinion.

" iterations were skipped because each randomly chosen"
" sub-sample either: didn't produce a valid set of inliers"
" due to a small `residual_threshold` or was invalidated by"
" `is_data_valid` or `is_model_valid` if either of these"

This comment has been minimized.

@jnothman

jnothman Dec 7, 2016
Member

I think the "if either ..." can be removed

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 7, 2016

Sorry, silly me. This patch does indeed fix the issue because the check for n_inliers_subset == 0 only applies when it is >= the best subset (defaulting to 1 in this PR). But that seems to mean that the "No inliers found" error is unreachable; and as a result we may draw all trials with 0 samples and this is expensive.

If this is the right patch, you need to write a test. But we may want a more configurable parameter to control the number of 0-inlier samples. Certainly, we should be giving the user more information on how to set the threshold appropriately.

@mthorrell
Copy link
Contributor Author

@mthorrell mthorrell commented Dec 8, 2016

The if statement containing n_inliers_subset == 0 was unreachable and hence was removed, so there shouldn't be any unreachable code generated by this PR.

If this is the fix, a test already exists that generates an appropriate error. Unless I'm missing something, there would be nothing else to do here except potentially edit the error message--again assuming we go with this fix.

However, I do think it makes sense to put a limit on the number of iterations that are tried which end in skips. Also, is_data_valid or is_model_valid themselves could be computationally expensive, so I would think we would count skips due to data/model failure the same as skips due to a strict residual_threshold. I suppose one could explicitly track the causes of different skips separately (is_data_valid vs. is_model_valid vs. residual_threshold), but this seems excessive to me.

I'll submit a commit that allows a user to set a max_skips parameter. It will default to max_trials.

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 8, 2016

(Sorry for my confusions. Was trying to grok the overall algorithm)

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 8, 2016

If this is the fix, a test already exists that generates an appropriate error.

A non-regression test, which allows 0 inliers to be drawn in the first iteration, would be ideal.

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 8, 2016

Yes, I don't mind max_skips. I also don't think there's any harm in reporting diagnostics of various kinds, summarising residuals or reason for skipping / not updating.

Michael Horrell added 2 commits Dec 17, 2016
@mthorrell
Copy link
Contributor Author

@mthorrell mthorrell commented Dec 17, 2016

err... need to finish the tests, so this is still WIP

Michael Horrell added 2 commits Dec 22, 2016
Michael Horrell
@jnothman
Copy link
Member

@jnothman jnothman commented Dec 27, 2016

Still still WIP?

Copy link
Member

@jnothman jnothman left a comment

It would be useful to set these diagnostics as attributes of the model, i.e. n_skips_no_inliers_, etc.

We may also then choose to remove them from the error message (and certainly from the warning, to help the warnings module ignore duplicates), pointing the user to these attributes instead.

@@ -111,6 +111,11 @@ class RANSACRegressor(BaseEstimator, MetaEstimatorMixin, RegressorMixin):
max_trials : int, optional
Maximum number of iterations for random sample selection.
max_skips : int, optional
Maximum number of iterations that can be skipped due to finding zero
inliers or invalid data defined by `is_data_valid` or invalid models

This comment has been minimized.

@jnothman

jnothman Dec 27, 2016
Member

need double-backticks

Copy link
Member

@jnothman jnothman left a comment

I'd still consider including a diagnostic about the stringency of the threshold, either reporting some statistic of the residuals (e.g. average min) or some statistic of the inlier subset size.

@mthorrell
Copy link
Contributor Author

@mthorrell mthorrell commented Dec 28, 2016

I like the attributes idea. The error and warning messages seemed a bit clunky. Let me see if I can do something about the stringency of the threshold as well.

Michael Horrell added 2 commits Dec 28, 2016
@mthorrell
Copy link
Contributor Author

@mthorrell mthorrell commented Dec 28, 2016

I added the n_skips_no_inliers etc. attributes and adjusted the error messages and tests.

I did not add anything indicating the stringency of the threshold. Should that be different PR? This one is getting rather large imo.

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 29, 2016

Copy link
Member

@jnothman jnothman left a comment

Otherwise LGTM!

max_skips : int, optional
Maximum number of iterations that can be skipped due to finding zero
inliers or invalid data defined by ``is_data_valid`` or invalid models
defined by ``is_data_valid``.

This comment has been minimized.

@jnothman

jnothman Dec 29, 2016
Member

should be is_model_valid

"RANSAC skipped more iterations than `max_skips` without"
" finding a valid consensus set. Iterations were skipped"
" because each randomly chosen sub-sample failed the"
" passing criteria. The object attributes"

This comment has been minimized.

@jnothman

jnothman Dec 29, 2016
Member

object -> estimator. But perhaps "See estimator attributes for diagnostics (n_skips*)." is sufficient

@jnothman jnothman changed the title [MRG] #7908 Addressed issue in first iteration of RANSAC regression [MRG+1] #7908 Addressed issue in first iteration of RANSAC regression Dec 29, 2016
@agramfort
Copy link
Member

@agramfort agramfort commented Dec 31, 2016

+1 for merge

@jnothman merge if you're happy

@agramfort agramfort changed the title [MRG+1] #7908 Addressed issue in first iteration of RANSAC regression [MRG+2] #7908 Addressed issue in first iteration of RANSAC regression Dec 31, 2016
Maximum number of iterations that can be skipped due to finding zero
inliers or invalid data defined by ``is_data_valid`` or invalid models
defined by ``is_model_valid``.

This comment has been minimized.

@jnothman

jnothman Dec 31, 2016
Member

I suppose nowadays we're meant to have .. versionadded annotations in docstrings.

This comment has been minimized.

@mthorrell

mthorrell Jan 2, 2017
Author Contributor

Is it accurate to say .. versionadded:: 0.19 or should it still be 0.18?

@jnothman
Copy link
Member

@jnothman jnothman commented Jan 2, 2017

Michael Horrell
@mthorrell
Copy link
Contributor Author

@mthorrell mthorrell commented Jan 4, 2017

made the versionadded change. I believe this is finished, assuming the tests pass.

Copy link
Member

@jnothman jnothman left a comment

Sorry, I'd forgotten to ask you to add an entry to whats_new.rst (under enhancements I think? bug fixes if you rather?)

Michael Horrell
@mthorrell
Copy link
Contributor Author

@mthorrell mthorrell commented Jan 5, 2017

I made the entry in whats_new.rst. Is there anything else needed?

@jnothman jnothman merged commit d0ce0d9 into scikit-learn:master Jan 5, 2017
0 of 3 checks passed
0 of 3 checks passed
ci/circleci CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@jnothman
Copy link
Member

@jnothman jnothman commented Jan 5, 2017

All good, thanks!

raghavrv added a commit to raghavrv/scikit-learn that referenced this pull request Jan 5, 2017
sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants