Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] ransac is running too many iterations #8271

Merged
merged 17 commits into from
Apr 29, 2017

Conversation

aivision2020
Copy link
Contributor

@aivision2020 aivision2020 commented Feb 2, 2017

Fixing number of iteration in ransac

Fixes #8251.

#8251 (comment)

@lesteve
Copy link
Member

lesteve commented Feb 2, 2017

@ahojnnes you seem to be the one who implemented RANSAC, do you have an opinion on this change?

@aivision2020 a first step to convince me as someone who is not familiar at all with this code: run the examples using RANSAC and make sure that the output is similar to the one with master. You can find the latter on scikit-learn.org/dev/auto_examples.

yhat_ransac = clt_ransac.predict(X)
inlier_mask = clt_ransac.inlier_mask_
print '*************************'
print 'actual trials:' , clt_ransac.n_trials_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert_equal(clt_ransac.n_trials_, _dynamic_max_trials(np.sum(inlier_mask), X.shape[0],
X.shape[1] + 1, clt_ransac.stop_probability))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a statistical algo so it can't be guaranteed that n_trials_ won't be higher, I'll check <= 21. That will give us a 1e-9 probability of failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you have a test for this (test_ransac_max_trials). but there are only 3 outliers and you run the test once. if you up the outliers to 40 (~10%), don't set random_state to zero, it will run 4 iterations once every 100ish runs. but that's ok.

Y[:80] += np.random.rand(80,1)

for i in range(10):
clt_ransac = linear_model.RANSACRegressor(linear_model.LinearRegression(),max_trials=100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd like to use fixed random state, but I realise this test is to ensure the property is invariant to randomness. So add random_state=i here.

I don't know what clt is here. maybe just call it reg

@@ -0,0 +1,20 @@
from sklearn import linear_model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should be in sklearn/linear_model/tests/test_ransac.py and should be structured as a function with an assertion.

@@ -415,13 +418,12 @@ def fit(self, X, y, sample_weight=None):
X_inlier_best = X_inlier_subset
y_inlier_best = y_inlier_subset

self.max_trials = min(self.max_trials, _dynamic_max_trials(n_inliers_best, n_samples,
min_samples, self.stop_probability))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have aligned indentation as described by PEP8

@aivision2020
Copy link
Contributor Author

aivision2020
master

some pep8 stuff in ransac.py
@aivision2020 aivision2020 reopened this Feb 3, 2017
@lesteve
Copy link
Member

lesteve commented Feb 8, 2017

The CI errors seems genuine, please have a look at them.

@aivision2020
Copy link
Contributor Author

I've done my best.

@jnothman
Copy link
Member

jnothman commented Feb 11, 2017 via email

@jnothman
Copy link
Member

The PEP8 failures:

./sklearn/linear_model/ransac.py:428:17: W503 line break before binary operator
                or score_best >= self.stop_score):
                ^

this is often a bit nasty, but you can change:

        if (...
            or score_best >= self.stop_score):
            # do stuff

to 
```python
            if ... \
               or score_best >= self.stop_score):
                # do stuff

for instance. Or:

            if (...
                    or score_best >= self.stop_score):
                # do stuff
./sklearn/linear_model/tests/test_ransac.py:95:80: E501 line too long (101 > 79 characters)
        max_trials = _dynamic_max_trials(len(X) - len(outliers), X.shape[0], X.shape[1] + 1, 1 - 1e9)

can be fixed as

        max_trials = _dynamic_max_trials(len(X) - len(outliers), X.shape[0],
                                         X.shape[1] + 1, 1 - 1e9)

@jnothman
Copy link
Member

The AppVeyor issue may be a heisenbug, but is certainly not due to your code; the Circle issue isn't your problem either.

Sorry I've not yet reviewed the substance of the PR.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from these nitpicks and the PEP8 fixes, this LGTM

data[outliers[1], :] = (-1000, -1000)
data[outliers[2], :] = (-100, -50)

np.random.seed(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global seeding isn't appropriate. You can use rng = np.random.RandomState(1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but it's important to seed with a different seed that I use in the test, or I will "randomly" sample only outliers. That was fun

I used rng = np.random.RandomState(1000)

data[outliers[2], :] = (-100, -50)

np.random.seed(1)
outliers = (np.random.rand(80) * len(X)).astype(np.uint32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is np.random.randint(len(X), size=80), but more obfuscated?

assert getattr(ransac_estimator, 'n_trials_', None) is None
ransac_estimator.fit(X, y)
assert_equal(ransac_estimator.n_trials_, 2)
for i in range(400):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long does this take to run? I think we'd rather a smaller number of runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't take long. I chose 400 based on the probability of the previous code to fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long is not long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.6 seconds

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had already said this is too long. Can you find some way to cut it back, using different probabilities, etc.

random_state=i)
assert getattr(ransac_estimator, 'n_trials_', None) is None
ransac_estimator.fit(X, y)
max_trials = _dynamic_max_trials(len(X) - len(outliers), X.shape[0], X.shape[1] + 1, 1 - 1e9)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity, replace 1 - 1e9 with 1 - ransac_estimator.stop_probability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1e9 is a much smaller number than 1-ransac_estimator.stop_probability. Using the same stop_probability for estimation and varification would cause a non-trivial chance of failure even for well behaved code. I could use (1-ransac_estimator.stop_probability)*1e3. The point is that is should be much smaller

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Please add a comment for what this is intending.

@aivision2020
Copy link
Contributor Author

@jnothman so? Are you waiting for me?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

travis reports flake8 failures:


./sklearn/linear_model/ransac.py:427:55: E502 the backslash is redundant between brackets
            if (n_inliers_best >= self.stop_n_inliers \
                                                      ^
./sklearn/linear_model/ransac.py:428:25: E127 continuation line over-indented for visual indent
                        or score_best >= self.stop_score):
                        ^
./sklearn/linear_model/tests/test_ransac.py:96:9: E265 block comment should start with '# '
        #there is a 1e9 chance it will take thise many trials. No good reason
        ^
./sklearn/linear_model/tests/test_ransac.py:97:9: E265 block comment should start with '# '
        #1e2 isn't enough, can still happen
        ^
./sklearn/linear_model/tests/test_ransac.py:529:1: W391 blank line at end of file
^

Also note that once I've approved your PR, we will still await another reviewer.

assert getattr(ransac_estimator, 'n_trials_', None) is None
ransac_estimator.fit(X, y)
assert_equal(ransac_estimator.n_trials_, 2)
for i in range(400):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had already said this is too long. Can you find some way to cut it back, using different probabilities, etc.

for i in range(400):
ransac_estimator = RANSACRegressor(base_estimator, min_samples=2,
random_state=i)
assert getattr(ransac_estimator, 'n_trials_', None) is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we avoid raw assert. Please use assert_true

@codecov
Copy link

codecov bot commented Feb 27, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@53f8082). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #8271   +/-   ##
=========================================
  Coverage          ?   94.44%           
=========================================
  Files             ?      342           
  Lines             ?    61423           
  Branches          ?        0           
=========================================
  Hits              ?    58013           
  Misses            ?     3410           
  Partials          ?        0
Impacted Files Coverage Δ
sklearn/linear_model/tests/test_ransac.py 100% <100%> (ø)
sklearn/linear_model/ransac.py 98.67% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53f8082...8b63d25. Read the comment docs.

@aivision2020
Copy link
Contributor Author

@jnothman I made the pep8 changes. where does this stand?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this stand? In a large muddled mass of Pull Requests awaiting review, unfortunately. I'm happy to approve it, and would like you to add an entry in the changelog (whats_new.rst), but again, it will await another review.

min_samples,
self.stop_probability)):
if n_inliers_best >= self.stop_n_inliers or \
score_best >= self.stop_score:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indentation is failing flake8. One option:

            if n_inliers_best >= self.stop_n_inliers or \
               score_best >= self.stop_score:

@jnothman jnothman changed the title ransac is running too many iterations [MRG+1] ransac is running too many iterations Mar 13, 2017
@aivision2020
Copy link
Contributor Author

aivision2020 commented Mar 13, 2017 via email

@jnothman
Copy link
Member

jnothman commented Mar 13, 2017 via email

# there is a 1e9 chance it will take these many trials. No good reason
# 1e2 isn't enough, can still happen
max_trials = _dynamic_max_trials(
len(X) - len(outliers), X.shape[0], X.shape[1] + 1, 1 - 1e9)
Copy link
Member

@MechCoder MechCoder Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this supposed to be 1 - 1e-9?!

Copy link
Member

@MechCoder MechCoder Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change X.shape[1] + 1 explicitly to 3. It is not obvious that the default number of samples are set to number of features + 1 just by looking at the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you can move this computation outside the for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like shape[1]+1 since that's what ransac uses. I don't really mind though. I'll put it in comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and X.shape[1] + 1 = 2 (which is the min number of samples needed for a linear estimator in in 2d

ransac_estimator.fit(X, y)
assert_equal(ransac_estimator.n_trials_, 2)
for i in range(50):
ransac_estimator = RANSACRegressor(base_estimator, min_samples=2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: You can move this to the outside and use set_params to avoid creating many objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually can you please move this to a separate test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new test is in spirit of the original test except the extra iterations

@MechCoder
Copy link
Member

Thanks for your contribution! Can you please document this change under whatsnew and fix the pending flake error?

@jnothman
Copy link
Member

jnothman commented Apr 3, 2017 via email

@MechCoder
Copy link
Member

yeah, I know that. But in the code it is (1 - 1e9) not (1 - 1e-9)

@jnothman
Copy link
Member

jnothman commented Apr 3, 2017 via email

Move computation out of loop, fix 1e-9 (from 1e9)...
@aivision2020
Copy link
Contributor Author

aivision2020 commented Apr 22, 2017 via email

@jnothman
Copy link
Member

jnothman commented Apr 22, 2017 via email

ransac_estimator = RANSACRegressor(base_estimator, min_samples=2)
for i in range(50):
ransac_estimator.set_params(min_samples=2,random_state=i)
ransac_estimator.n_trials_ = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this line. It will be overwritten by fit.

@MechCoder
Copy link
Member

Can you please fix the couple of pep8 errors? LGTM apart from that.

@@ -415,13 +418,14 @@ def fit(self, X, y, sample_weight=None):
X_inlier_best = X_inlier_subset
y_inlier_best = y_inlier_subset

max_trials = min(
max_trials,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_trials = min(max_trials,                                                                                                                                                                                                                 
                 _dynamic_max_trials(n_inliers_best, n_samples,                                                                                                                                                                              
                                     min_samples, self.stop_probability))

IMO, this is more readable.

# 1e-2 isn't enough, can still happen
# 2 is the what ransac defines as min_samples = X.shape[1] + 1
max_trials = _dynamic_max_trials(
len(X) - len(outliers), X.shape[0], 2, 1 - 1e-9)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be easier to read without the need to refer to the docstring:

max_trials = _dynamic_max_trials(n_inliers=len(X) - len(outliers),                                                                                                                                                                           
                                 n_samples=len(X),                                                                                                                                                                                           
                                 min_samples=2,                                                                                                                                                                                              
                                 probability=1 - 1e-9)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you guys are killing me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the only difference here that len(X) is used instead of X.shape[0]?

Anyway, yes, I agree we need to try less nitpicking on this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, visual indent stuff. No matter.

@jnothman
Copy link
Member

Thanks, @aivision2020. Merging

@jnothman jnothman merged commit 6e8ebda into scikit-learn:master Apr 29, 2017
@jnothman
Copy link
Member

And thanks for raising the issue and persisting with it.

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants