New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] #7908 Addressed issue in first iteration of RANSAC regression #7914

Merged
merged 16 commits into from Jan 5, 2017

Conversation

Projects
None yet
4 participants
@mthorrell
Contributor

mthorrell commented Nov 20, 2016

Reference Issue

Fixes #7908

What does this implement/fix? Explain your changes.

On the first iteration of RANSAC regression, if no inliers are found, an error is produced and the code is stopped. Ideally the procedure would just skip that iteration and continue on to the next iteration where it would use a different random sample which could produce valid inliers.

Generally this error is produced when n_inliers_subset and n_inliers_best are both zero. My fix was to set the initial value for n_inliers_best to 1. Thus if n_inliers_subset >= 1, the code follows the normal path, and if n_inliers_subset == 0, the code progresses to the next iteration. This fixes the bug as described in this issue.

However, setting n_inliers_best = 1 creates an issue again in the first iteration of this loop in the case when n_inliers_subset == 1. The subsequent comparison is made: score_subset < score_best. Since score_best is initialized to np.inf, the code will incorrectly skip to the next iteration, ignoring the fact that in the first iteration, we did find valid inliers. The fix for this is to change the initial value of score_best to -np.inf. In general, I think this is better practice when initializing these types of variables anyway.

Any other comments?

@mthorrell

This comment has been minimized.

Show comment
Hide comment
@mthorrell

mthorrell Nov 27, 2016

Contributor

In this small update, I adjusted the offending test so it did not require a specific output; however, an error is still required to be raised in the residual threshold = 0 case.

Contributor

mthorrell commented Nov 27, 2016

In this small update, I adjusted the offending test so it did not require a specific output; however, an error is still required to be raised in the residual threshold = 0 case.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 28, 2016

Member

thanks for the PR. I think changing the tests was the way to go! (I don't have time for a review right now unfortunately)

Member

amueller commented Nov 28, 2016

thanks for the PR. I think changing the tests was the way to go! (I don't have time for a review right now unfortunately)

@amueller amueller added this to the 0.19 milestone Nov 28, 2016

@amueller amueller added the Bug label Nov 28, 2016

@mthorrell

This comment has been minimized.

Show comment
Hide comment
@mthorrell

mthorrell Nov 30, 2016

Contributor

Thanks.

I made a couple minor changes:

  • The test now more specifically checks the error message since I think this is better practice.
  • I also cleaned up the formatting of that error message.
Contributor

mthorrell commented Nov 30, 2016

Thanks.

I made a couple minor changes:

  • The test now more specifically checks the error message since I think this is better practice.
  • I also cleaned up the formatting of that error message.
Michael Horrell
Show outdated Hide outdated sklearn/linear_model/tests/test_ransac.py
msg = (
"RANSAC could not find valid consensus set, because"
" either the `residual_threshold` rejected all the samples or"
" `is_data_valid` and `is_model_valid` returned False for all"

This comment has been minimized.

@amueller

amueller Nov 30, 2016

Member

refering to internal functions is not very helpful for the users imho.

@amueller

amueller Nov 30, 2016

Member

refering to internal functions is not very helpful for the users imho.

This comment has been minimized.

@mthorrell

mthorrell Dec 4, 2016

Contributor

I believe you are referring to is_data_valid and is_model_valid. These are user-defined functions much like residual_threshold is a user-defined parameter. When a user does not define these functions, RANSAC does not call them--they are initialized to None and evaluation is skipped. Unless you disagree, these may still be valid to refer to.

Perhaps a better message might read:

RANSAC could not find a valid consensus set. All max_trials iterations were skipped because each randomly chosen sub-sample either: didn't produce a valid set of inliers due to a small residual_threshold or was invalidated by is_data_valid or is_model_valid if either of these functions have been defined by the user.

@mthorrell

mthorrell Dec 4, 2016

Contributor

I believe you are referring to is_data_valid and is_model_valid. These are user-defined functions much like residual_threshold is a user-defined parameter. When a user does not define these functions, RANSAC does not call them--they are initialized to None and evaluation is skipped. Unless you disagree, these may still be valid to refer to.

Perhaps a better message might read:

RANSAC could not find a valid consensus set. All max_trials iterations were skipped because each randomly chosen sub-sample either: didn't produce a valid set of inliers due to a small residual_threshold or was invalidated by is_data_valid or is_model_valid if either of these functions have been defined by the user.

This comment has been minimized.

@amueller

amueller Dec 6, 2016

Member

Oh, sorry, my fault. The current message is ok, though your suggestion is even better!

@amueller

amueller Dec 6, 2016

Member

Oh, sorry, my fault. The current message is ok, though your suggestion is even better!

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 30, 2016

Member

maybe @ahojnnes wants to review?

Member

amueller commented Nov 30, 2016

maybe @ahojnnes wants to review?

Michael Horrell
@jnothman

I am not certain this is the right fix. This still fails if a 0-inlier sample is drawn at any point after the first >0-inlier sample is drawn.

We could introduce a parameter to control how many 0-inlier samples are allowed through, or just get rid of this control (the main risk being lots of time wasted).

A couple of more cosmetic things we can do to improve this behaviour:

  • If we continue to raise an error when no inliers are found, we should report the minimum residual found (and perhaps other summary statistics of the residual distribution) as well as the current threshold, so that the user has a means to tune the parameter.
  • introduce a verbose option that reports things like the median residual, number of inliers, etc., at each iteration.

Ping @CSchoel for your opinion.

Show outdated Hide outdated sklearn/linear_model/tests/test_ransac.py
" iterations were skipped because each randomly chosen"
" sub-sample either: didn't produce a valid set of inliers"
" due to a small `residual_threshold` or was invalidated by"
" `is_data_valid` or `is_model_valid` if either of these"

This comment has been minimized.

@jnothman

jnothman Dec 7, 2016

Member

I think the "if either ..." can be removed

@jnothman

jnothman Dec 7, 2016

Member

I think the "if either ..." can be removed

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 7, 2016

Member

Sorry, silly me. This patch does indeed fix the issue because the check for n_inliers_subset == 0 only applies when it is >= the best subset (defaulting to 1 in this PR). But that seems to mean that the "No inliers found" error is unreachable; and as a result we may draw all trials with 0 samples and this is expensive.

If this is the right patch, you need to write a test. But we may want a more configurable parameter to control the number of 0-inlier samples. Certainly, we should be giving the user more information on how to set the threshold appropriately.

Member

jnothman commented Dec 7, 2016

Sorry, silly me. This patch does indeed fix the issue because the check for n_inliers_subset == 0 only applies when it is >= the best subset (defaulting to 1 in this PR). But that seems to mean that the "No inliers found" error is unreachable; and as a result we may draw all trials with 0 samples and this is expensive.

If this is the right patch, you need to write a test. But we may want a more configurable parameter to control the number of 0-inlier samples. Certainly, we should be giving the user more information on how to set the threshold appropriately.

@mthorrell

This comment has been minimized.

Show comment
Hide comment
@mthorrell

mthorrell Dec 8, 2016

Contributor

The if statement containing n_inliers_subset == 0 was unreachable and hence was removed, so there shouldn't be any unreachable code generated by this PR.

If this is the fix, a test already exists that generates an appropriate error. Unless I'm missing something, there would be nothing else to do here except potentially edit the error message--again assuming we go with this fix.

However, I do think it makes sense to put a limit on the number of iterations that are tried which end in skips. Also, is_data_valid or is_model_valid themselves could be computationally expensive, so I would think we would count skips due to data/model failure the same as skips due to a strict residual_threshold. I suppose one could explicitly track the causes of different skips separately (is_data_valid vs. is_model_valid vs. residual_threshold), but this seems excessive to me.

I'll submit a commit that allows a user to set a max_skips parameter. It will default to max_trials.

Contributor

mthorrell commented Dec 8, 2016

The if statement containing n_inliers_subset == 0 was unreachable and hence was removed, so there shouldn't be any unreachable code generated by this PR.

If this is the fix, a test already exists that generates an appropriate error. Unless I'm missing something, there would be nothing else to do here except potentially edit the error message--again assuming we go with this fix.

However, I do think it makes sense to put a limit on the number of iterations that are tried which end in skips. Also, is_data_valid or is_model_valid themselves could be computationally expensive, so I would think we would count skips due to data/model failure the same as skips due to a strict residual_threshold. I suppose one could explicitly track the causes of different skips separately (is_data_valid vs. is_model_valid vs. residual_threshold), but this seems excessive to me.

I'll submit a commit that allows a user to set a max_skips parameter. It will default to max_trials.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 8, 2016

Member

(Sorry for my confusions. Was trying to grok the overall algorithm)

Member

jnothman commented Dec 8, 2016

(Sorry for my confusions. Was trying to grok the overall algorithm)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 8, 2016

Member

If this is the fix, a test already exists that generates an appropriate error.

A non-regression test, which allows 0 inliers to be drawn in the first iteration, would be ideal.

Member

jnothman commented Dec 8, 2016

If this is the fix, a test already exists that generates an appropriate error.

A non-regression test, which allows 0 inliers to be drawn in the first iteration, would be ideal.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 8, 2016

Member

Yes, I don't mind max_skips. I also don't think there's any harm in reporting diagnostics of various kinds, summarising residuals or reason for skipping / not updating.

Member

jnothman commented Dec 8, 2016

Yes, I don't mind max_skips. I also don't think there's any harm in reporting diagnostics of various kinds, summarising residuals or reason for skipping / not updating.

Michael Horrell added some commits Dec 17, 2016

@mthorrell

This comment has been minimized.

Show comment
Hide comment
@mthorrell

mthorrell Dec 17, 2016

Contributor

err... need to finish the tests, so this is still WIP

Contributor

mthorrell commented Dec 17, 2016

err... need to finish the tests, so this is still WIP

Michael Horrell added some commits Dec 22, 2016

Michael Horrell
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 27, 2016

Member

Still still WIP?

Member

jnothman commented Dec 27, 2016

Still still WIP?

@jnothman

It would be useful to set these diagnostics as attributes of the model, i.e. n_skips_no_inliers_, etc.

We may also then choose to remove them from the error message (and certainly from the warning, to help the warnings module ignore duplicates), pointing the user to these attributes instead.

Show outdated Hide outdated sklearn/linear_model/ransac.py
@@ -111,6 +111,11 @@ class RANSACRegressor(BaseEstimator, MetaEstimatorMixin, RegressorMixin):
max_trials : int, optional
Maximum number of iterations for random sample selection.
max_skips : int, optional
Maximum number of iterations that can be skipped due to finding zero
inliers or invalid data defined by `is_data_valid` or invalid models

This comment has been minimized.

@jnothman

jnothman Dec 27, 2016

Member

need double-backticks

@jnothman

jnothman Dec 27, 2016

Member

need double-backticks

@jnothman

I'd still consider including a diagnostic about the stringency of the threshold, either reporting some statistic of the residuals (e.g. average min) or some statistic of the inlier subset size.

@mthorrell

This comment has been minimized.

Show comment
Hide comment
@mthorrell

mthorrell Dec 28, 2016

Contributor

I like the attributes idea. The error and warning messages seemed a bit clunky. Let me see if I can do something about the stringency of the threshold as well.

Contributor

mthorrell commented Dec 28, 2016

I like the attributes idea. The error and warning messages seemed a bit clunky. Let me see if I can do something about the stringency of the threshold as well.

Michael Horrell added some commits Dec 28, 2016

@mthorrell

This comment has been minimized.

Show comment
Hide comment
@mthorrell

mthorrell Dec 28, 2016

Contributor

I added the n_skips_no_inliers etc. attributes and adjusted the error messages and tests.

I did not add anything indicating the stringency of the threshold. Should that be different PR? This one is getting rather large imo.

Contributor

mthorrell commented Dec 28, 2016

I added the n_skips_no_inliers etc. attributes and adjusted the error messages and tests.

I did not add anything indicating the stringency of the threshold. Should that be different PR? This one is getting rather large imo.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Dec 29, 2016

Member
Member

jnothman commented Dec 29, 2016

@jnothman

Otherwise LGTM!

Show outdated Hide outdated sklearn/linear_model/ransac.py
max_skips : int, optional
Maximum number of iterations that can be skipped due to finding zero
inliers or invalid data defined by ``is_data_valid`` or invalid models
defined by ``is_data_valid``.

This comment has been minimized.

@jnothman

jnothman Dec 29, 2016

Member

should be is_model_valid

@jnothman

jnothman Dec 29, 2016

Member

should be is_model_valid

Show outdated Hide outdated sklearn/linear_model/ransac.py
"RANSAC skipped more iterations than `max_skips` without"
" finding a valid consensus set. Iterations were skipped"
" because each randomly chosen sub-sample failed the"
" passing criteria. The object attributes"

This comment has been minimized.

@jnothman

jnothman Dec 29, 2016

Member

object -> estimator. But perhaps "See estimator attributes for diagnostics (n_skips*)." is sufficient

@jnothman

jnothman Dec 29, 2016

Member

object -> estimator. But perhaps "See estimator attributes for diagnostics (n_skips*)." is sufficient

@jnothman jnothman changed the title from [MRG] #7908 Addressed issue in first iteration of RANSAC regression to [MRG+1] #7908 Addressed issue in first iteration of RANSAC regression Dec 29, 2016

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Dec 31, 2016

Member

+1 for merge

@jnothman merge if you're happy

Member

agramfort commented Dec 31, 2016

+1 for merge

@jnothman merge if you're happy

@agramfort agramfort changed the title from [MRG+1] #7908 Addressed issue in first iteration of RANSAC regression to [MRG+2] #7908 Addressed issue in first iteration of RANSAC regression Dec 31, 2016

Maximum number of iterations that can be skipped due to finding zero
inliers or invalid data defined by ``is_data_valid`` or invalid models
defined by ``is_model_valid``.

This comment has been minimized.

@jnothman

jnothman Dec 31, 2016

Member

I suppose nowadays we're meant to have .. versionadded annotations in docstrings.

@jnothman

jnothman Dec 31, 2016

Member

I suppose nowadays we're meant to have .. versionadded annotations in docstrings.

This comment has been minimized.

@mthorrell

mthorrell Jan 2, 2017

Contributor

Is it accurate to say .. versionadded:: 0.19 or should it still be 0.18?

@mthorrell

mthorrell Jan 2, 2017

Contributor

Is it accurate to say .. versionadded:: 0.19 or should it still be 0.18?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jan 2, 2017

Member
Member

jnothman commented Jan 2, 2017

Michael Horrell
@mthorrell

This comment has been minimized.

Show comment
Hide comment
@mthorrell

mthorrell Jan 4, 2017

Contributor

made the versionadded change. I believe this is finished, assuming the tests pass.

Contributor

mthorrell commented Jan 4, 2017

made the versionadded change. I believe this is finished, assuming the tests pass.

@jnothman

Sorry, I'd forgotten to ask you to add an entry to whats_new.rst (under enhancements I think? bug fixes if you rather?)

Michael Horrell
@mthorrell

This comment has been minimized.

Show comment
Hide comment
@mthorrell

mthorrell Jan 5, 2017

Contributor

I made the entry in whats_new.rst. Is there anything else needed?

Contributor

mthorrell commented Jan 5, 2017

I made the entry in whats_new.rst. Is there anything else needed?

@jnothman jnothman merged commit d0ce0d9 into scikit-learn:master Jan 5, 2017

0 of 3 checks passed

ci/circleci CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jan 5, 2017

Member

All good, thanks!

Member

jnothman commented Jan 5, 2017

All good, thanks!

raghavrv added a commit to raghavrv/scikit-learn that referenced this pull request Jan 5, 2017

[MRG+2] Avoid failure in first iteration of RANSAC regression (#7914)
Fixes #7908 

Adds RANSACRegressor attributes n_skips_* for diagnostics

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG+2] Avoid failure in first iteration of RANSAC regression (#7914)
Fixes #7908 

Adds RANSACRegressor attributes n_skips_* for diagnostics

@Przemo10 Przemo10 referenced this pull request Mar 17, 2017

Closed

update fork (#1) #8606

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+2] Avoid failure in first iteration of RANSAC regression (#7914)
Fixes #7908 

Adds RANSACRegressor attributes n_skips_* for diagnostics

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+2] Avoid failure in first iteration of RANSAC regression (#7914)
Fixes #7908 

Adds RANSACRegressor attributes n_skips_* for diagnostics

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+2] Avoid failure in first iteration of RANSAC regression (#7914)
Fixes #7908 

Adds RANSACRegressor attributes n_skips_* for diagnostics

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+2] Avoid failure in first iteration of RANSAC regression (#7914)
Fixes #7908 

Adds RANSACRegressor attributes n_skips_* for diagnostics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment