-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] ransac is running too many iterations #8271
Conversation
@ahojnnes you seem to be the one who implemented RANSAC, do you have an opinion on this change? @aivision2020 a first step to convince me as someone who is not familiar at all with this code: run the examples using RANSAC and make sure that the output is similar to the one with master. You can find the latter on scikit-learn.org/dev/auto_examples. |
yhat_ransac = clt_ransac.predict(X) | ||
inlier_mask = clt_ransac.inlier_mask_ | ||
print '*************************' | ||
print 'actual trials:' , clt_ransac.n_trials_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assert_equal(clt_ransac.n_trials_, _dynamic_max_trials(np.sum(inlier_mask), X.shape[0],
X.shape[1] + 1, clt_ransac.stop_probability))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a statistical algo so it can't be guaranteed that n_trials_ won't be higher, I'll check <= 21. That will give us a 1e-9 probability of failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you have a test for this (test_ransac_max_trials). but there are only 3 outliers and you run the test once. if you up the outliers to 40 (~10%), don't set random_state to zero, it will run 4 iterations once every 100ish runs. but that's ok.
Y[:80] += np.random.rand(80,1) | ||
|
||
for i in range(10): | ||
clt_ransac = linear_model.RANSACRegressor(linear_model.LinearRegression(),max_trials=100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd like to use fixed random state, but I realise this test is to ensure the property is invariant to randomness. So add random_state=i
here.
I don't know what clt
is here. maybe just call it reg
@@ -0,0 +1,20 @@ | |||
from sklearn import linear_model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should be in sklearn/linear_model/tests/test_ransac.py
and should be structured as a function with an assertion.
sklearn/linear_model/ransac.py
Outdated
@@ -415,13 +418,12 @@ def fit(self, X, y, sample_weight=None): | |||
X_inlier_best = X_inlier_subset | |||
y_inlier_best = y_inlier_subset | |||
|
|||
self.max_trials = min(self.max_trials, _dynamic_max_trials(n_inliers_best, n_samples, | |||
min_samples, self.stop_probability)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have aligned indentation as described by PEP8
some pep8 stuff in ransac.py
The CI errors seems genuine, please have a look at them. |
some pep8 stuff in ransac.py fix data generation (some outliers were not)
I've done my best. |
thank you. someone will take another look soon
…On 10 Feb 2017 6:04 pm, "aivision2020" ***@***.***> wrote:
I've done my best.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8271 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_QT9VxzXVcAOJTYSxDRkQw-xjFCks5rbAwTgaJpZM4L1BNU>
.
|
The PEP8 failures:
this is often a bit nasty, but you can change:
for instance. Or: if (...
or score_best >= self.stop_score):
# do stuff
can be fixed as max_trials = _dynamic_max_trials(len(X) - len(outliers), X.shape[0],
X.shape[1] + 1, 1 - 1e9) |
The AppVeyor issue may be a heisenbug, but is certainly not due to your code; the Circle issue isn't your problem either. Sorry I've not yet reviewed the substance of the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from these nitpicks and the PEP8 fixes, this LGTM
data[outliers[1], :] = (-1000, -1000) | ||
data[outliers[2], :] = (-100, -50) | ||
|
||
np.random.seed(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global seeding isn't appropriate. You can use rng = np.random.RandomState(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but it's important to seed with a different seed that I use in the test, or I will "randomly" sample only outliers. That was fun
I used rng = np.random.RandomState(1000)
data[outliers[2], :] = (-100, -50) | ||
|
||
np.random.seed(1) | ||
outliers = (np.random.rand(80) * len(X)).astype(np.uint32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is np.random.randint(len(X), size=80)
, but more obfuscated?
assert getattr(ransac_estimator, 'n_trials_', None) is None | ||
ransac_estimator.fit(X, y) | ||
assert_equal(ransac_estimator.n_trials_, 2) | ||
for i in range(400): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long does this take to run? I think we'd rather a smaller number of runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't take long. I chose 400 based on the probability of the previous code to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long is not long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.6 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had already said this is too long. Can you find some way to cut it back, using different probabilities, etc.
random_state=i) | ||
assert getattr(ransac_estimator, 'n_trials_', None) is None | ||
ransac_estimator.fit(X, y) | ||
max_trials = _dynamic_max_trials(len(X) - len(outliers), X.shape[0], X.shape[1] + 1, 1 - 1e9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for clarity, replace 1 - 1e9
with 1 - ransac_estimator.stop_probability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1e9 is a much smaller number than 1-ransac_estimator.stop_probability. Using the same stop_probability for estimation and varification would cause a non-trivial chance of failure even for well behaved code. I could use (1-ransac_estimator.stop_probability)*1e3. The point is that is should be much smaller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Please add a comment for what this is intending.
@jnothman so? Are you waiting for me? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
travis reports flake8 failures:
./sklearn/linear_model/ransac.py:427:55: E502 the backslash is redundant between brackets
if (n_inliers_best >= self.stop_n_inliers \
^
./sklearn/linear_model/ransac.py:428:25: E127 continuation line over-indented for visual indent
or score_best >= self.stop_score):
^
./sklearn/linear_model/tests/test_ransac.py:96:9: E265 block comment should start with '# '
#there is a 1e9 chance it will take thise many trials. No good reason
^
./sklearn/linear_model/tests/test_ransac.py:97:9: E265 block comment should start with '# '
#1e2 isn't enough, can still happen
^
./sklearn/linear_model/tests/test_ransac.py:529:1: W391 blank line at end of file
^
Also note that once I've approved your PR, we will still await another reviewer.
assert getattr(ransac_estimator, 'n_trials_', None) is None | ||
ransac_estimator.fit(X, y) | ||
assert_equal(ransac_estimator.n_trials_, 2) | ||
for i in range(400): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had already said this is too long. Can you find some way to cut it back, using different probabilities, etc.
for i in range(400): | ||
ransac_estimator = RANSACRegressor(base_estimator, min_samples=2, | ||
random_state=i) | ||
assert getattr(ransac_estimator, 'n_trials_', None) is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we avoid raw assert
. Please use assert_true
Codecov Report
@@ Coverage Diff @@
## master #8271 +/- ##
=========================================
Coverage ? 94.44%
=========================================
Files ? 342
Lines ? 61423
Branches ? 0
=========================================
Hits ? 58013
Misses ? 3410
Partials ? 0
Continue to review full report at Codecov.
|
@jnothman I made the pep8 changes. where does this stand? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this stand? In a large muddled mass of Pull Requests awaiting review, unfortunately. I'm happy to approve it, and would like you to add an entry in the changelog (whats_new.rst
), but again, it will await another review.
min_samples, | ||
self.stop_probability)): | ||
if n_inliers_best >= self.stop_n_inliers or \ | ||
score_best >= self.stop_score: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indentation is failing flake8. One option:
if n_inliers_best >= self.stop_n_inliers or \
score_best >= self.stop_score:
For such a small fix this is a really arduous process. I hope it will make
it in at some point.
…On Mon, Mar 13, 2017 at 7:51 AM, Joel Nothman ***@***.***> wrote:
***@***.**** commented on this pull request.
Where does this stand? In a large muddled mass of Pull Requests awaiting
review, unfortunately. I'm happy to approve it, and would like you to add
an entry in the changelog (whats_new.rst), but again, it will await
another review.
------------------------------
In sklearn/linear_model/ransac.py
<#8271 (comment)>
:
> # break if sufficient number of inliers or score is reached
- if (n_inliers_best >= self.stop_n_inliers
- or score_best >= self.stop_score
- or self.n_trials_
- >= _dynamic_max_trials(n_inliers_best, n_samples,
- min_samples,
- self.stop_probability)):
+ if n_inliers_best >= self.stop_n_inliers or \
+ score_best >= self.stop_score:
This indentation is failing flake8. One option:
if n_inliers_best >= self.stop_n_inliers or \
score_best >= self.stop_score:
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#8271 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANbZjw-qVuan-QJ42c5jp8Kex4Y3a2iTks5rlNlHgaJpZM4L1BNU>
.
|
I'm sorry that you feel the process is arduous, but a second reviewer often
finds things the first didn't. Review also means that any individual is
more likely to be familiar enough with some part of the code to maintain it.
It's very possible that the pr will get merged without more input from you,
but it may be slow on our part. we are a group of volunteers and lately
have very little time between us to review.
…On 13 Mar 2017 5:27 pm, "aivision2020" ***@***.***> wrote:
For such a small fix this is a really arduous process. I hope it will make
it in at some point.
On Mon, Mar 13, 2017 at 7:51 AM, Joel Nothman ***@***.***>
wrote:
> ***@***.**** commented on this pull request.
>
> Where does this stand? In a large muddled mass of Pull Requests awaiting
> review, unfortunately. I'm happy to approve it, and would like you to add
> an entry in the changelog (whats_new.rst), but again, it will await
> another review.
> ------------------------------
>
> In sklearn/linear_model/ransac.py
> <#8271#
discussion_r105593845>
> :
>
> > # break if sufficient number of inliers or score is reached
> - if (n_inliers_best >= self.stop_n_inliers
> - or score_best >= self.stop_score
> - or self.n_trials_
> - >= _dynamic_max_trials(n_inliers_best, n_samples,
> - min_samples,
> - self.stop_probability)):
> + if n_inliers_best >= self.stop_n_inliers or \
> + score_best >= self.stop_score:
>
> This indentation is failing flake8. One option:
>
> if n_inliers_best >= self.stop_n_inliers or \
> score_best >= self.stop_score:
>
> —
> You are receiving this because you modified the open/close state.
> Reply to this email directly, view it on GitHub
> <#8271#
pullrequestreview-26479926>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ANbZjw-qVuan-
QJ42c5jp8Kex4Y3a2iTks5rlNlHgaJpZM4L1BNU>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8271 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69uq9PiVAvQY1AtYIF3gJDhzhTTuks5rlOG8gaJpZM4L1BNU>
.
|
# there is a 1e9 chance it will take these many trials. No good reason | ||
# 1e2 isn't enough, can still happen | ||
max_trials = _dynamic_max_trials( | ||
len(X) - len(outliers), X.shape[0], X.shape[1] + 1, 1 - 1e9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this supposed to be 1 - 1e-9
?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change X.shape[1] + 1
explicitly to 3. It is not obvious that the default number of samples are set to number of features + 1 just by looking at the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you can move this computation outside the for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like shape[1]+1 since that's what ransac uses. I don't really mind though. I'll put it in comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and X.shape[1] + 1 = 2 (which is the min number of samples needed for a linear estimator in in 2d
ransac_estimator.fit(X, y) | ||
assert_equal(ransac_estimator.n_trials_, 2) | ||
for i in range(50): | ||
ransac_estimator = RANSACRegressor(base_estimator, min_samples=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: You can move this to the outside and use set_params
to avoid creating many objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually can you please move this to a separate test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new test is in spirit of the original test except the extra iterations
Thanks for your contribution! Can you please document this change under |
The 1 - 1e-9 is correct: 1-p for some small probability. I also have a
preference for n_features+1 over constant 3 so that the code was more
literate, but i don't mind. perhaps comments in the test are deserved.
…On 4 Apr 2017 5:51 am, "Manoj Kumar" ***@***.***> wrote:
Thanks for your contribution! Can you please document this change under
whatsnew and fix the pending flake error?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8271 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60AaBmtU1YT_TNNshaP0s3-rsJ9dks5rsU3CgaJpZM4L1BNU>
.
|
yeah, I know that. But in the code it is (1 - 1e9) not (1 - 1e-9) |
oh. whoops.
…On 4 April 2017 at 09:35, Manoj Kumar ***@***.***> wrote:
yeah, I know that. But in the code it is (1 - 1e9) not (1 - 1e-9)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8271 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6ytr-uoU1RbTKj2O0Z6IToyasayGks5rsYIpgaJpZM4L1BNU>
.
|
Move computation out of loop, fix 1e-9 (from 1e9)...
where do I edit this whatsnew?
…On Mon, Apr 3, 2017 at 10:51 PM, Manoj Kumar ***@***.***> wrote:
Thanks for your contribution! Can you please document this change under
whatsnew and fix the pending flake error?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#8271 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANbZjyzQ9kKeRrVpNoiHioM_RTJ_pgzGks5rsU3fgaJpZM4L1BNU>
.
|
doc/whats_new.rst, under bug fixes
…On 22 April 2017 at 19:42, aivision2020 ***@***.***> wrote:
where do I edit this whatsnew?
On Mon, Apr 3, 2017 at 10:51 PM, Manoj Kumar ***@***.***>
wrote:
> Thanks for your contribution! Can you please document this change under
> whatsnew and fix the pending flake error?
>
> —
> You are receiving this because you modified the open/close state.
> Reply to this email directly, view it on GitHub
> <#8271#
issuecomment-291254042>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-
auth/ANbZjyzQ9kKeRrVpNoiHioM_RTJ_pgzGks5rsU3fgaJpZM4L1BNU>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8271 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xLotKeIt8IWOqSUhxmQACnEIcr0ks5rycuVgaJpZM4L1BNU>
.
|
ransac_estimator = RANSACRegressor(base_estimator, min_samples=2) | ||
for i in range(50): | ||
ransac_estimator.set_params(min_samples=2,random_state=i) | ||
ransac_estimator.n_trials_ = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this line. It will be overwritten by fit.
Can you please fix the couple of pep8 errors? LGTM apart from that. |
@@ -415,13 +418,14 @@ def fit(self, X, y, sample_weight=None): | |||
X_inlier_best = X_inlier_subset | |||
y_inlier_best = y_inlier_subset | |||
|
|||
max_trials = min( | |||
max_trials, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_trials = min(max_trials,
_dynamic_max_trials(n_inliers_best, n_samples,
min_samples, self.stop_probability))
IMO, this is more readable.
# 1e-2 isn't enough, can still happen | ||
# 2 is the what ransac defines as min_samples = X.shape[1] + 1 | ||
max_trials = _dynamic_max_trials( | ||
len(X) - len(outliers), X.shape[0], 2, 1 - 1e-9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be easier to read without the need to refer to the docstring:
max_trials = _dynamic_max_trials(n_inliers=len(X) - len(outliers),
n_samples=len(X),
min_samples=2,
probability=1 - 1e-9)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you guys are killing me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the only difference here that len(X)
is used instead of X.shape[0]
?
Anyway, yes, I agree we need to try less nitpicking on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, visual indent stuff. No matter.
Thanks, @aivision2020. Merging |
And thanks for raising the issue and persisting with it. |
Fixing number of iteration in ransac
Fixes #8251.
#8251 (comment)