Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG Fix instability issue of ARDRegression (with speedup) #16849

Merged
merged 5 commits into from Apr 23, 2020

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Apr 5, 2020

This should fix #15186
Closes #16102

This comes with a significant speed up: on master the estimator can only scale to a few hundreds of samples. This PR can comfortably fit 100k samples in about 1 sec.

@NicolasHug NicolasHug changed the title [WIP] Fix instability issue of ARDRegression (with speedup) [MRG] Fix instability issue of ARDRegression (with speedup) Apr 5, 2020
@NicolasHug NicolasHug added this to the 0.23 milestone Apr 5, 2020
@NicolasHug NicolasHug added the Bug label Apr 5, 2020
Comment on lines +186 to +188
# With the inputs above, ARDRegression prunes both of the two coefficients
# in the first iteration. Hence, the expected shape of `sigma_` is (0, 0).
assert clf.sigma_.shape == (0, 0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the target is constant, it makes sense to me that both coefficients should bet set to zero. So I think this is a bug fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By setting clf.sigma_ = np.array([]), the later call to clf.predict(X, return_std=True) would return a different result for the std when compared to master. Is this okay?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's a bug fix. Sigma should be nothing when keep_lambda is all false.

You might want to slightly delay your review @thomasjpfan , I'm going to update this for the n_features > n_samples case

Comment on lines 205 to 206
@pytest.mark.parametrize('seed', range(100))
def test_ard_accuracy_on_easy_problem(seed):
Copy link
Member Author

@NicolasHug NicolasHug Apr 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should now run properly on many different seeds (there was a 7% failure rate before), and with a much higher precision

gram = np.dot(X_keep.T, X_keep)
eye = np.eye(gram.shape[0])
sigma_inv = lambda_[keep_lambda] * eye + alpha_ * gram
sigma_ = pinvh(sigma_inv)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that's basically the main fix. Instead of relying on the woodbury formula, we just directly invert the matrix (called S_N in the ref). This seems to be much more stable and also much faster because here we invert a matrix that is n_features x features whereas in master the inversion is on a n_samples x n_samples.

Strangely enough, this somewhat reverts a commit from 10 years ago b2d0a77, whose goal was to make things faster but I'm not sure it did. @vmichel if you're still around your input would be greatly appreciated!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be that numpy operations are now much faster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it, the bottleneck is the matrix inversion here

@@ -12,7 +12,7 @@
from ._base import LinearModel, _rescale_data
from ..base import RegressorMixin
from ..utils.extmath import fast_logdet
from ..utils.fixes import pinvh
from scipy.linalg import pinvh
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the scipy's backport to we can remove it now

@NicolasHug
Copy link
Member Author

Pinging @rth @jnothman @thomasjpfan @glemaitre since you took a look at the previous PR. Thanks!

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to get this one in. It looks good to me, I think.

@@ -290,6 +290,10 @@ Changelog
of strictly inferior for maximum of `absgrad` and `tol` in `utils.optimize._newton_cg`.
:pr:`16266` by :user:`Rushabh Vasani <rushabh-v>`.

- |Fix| |Efficiency| :class:`linear_model.ARDRegression` is more stable and
much faster. It can now scale to hundreds of thousands of samples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gram = np.dot(X_keep.T, X_keep)
eye = np.eye(gram.shape[0])
sigma_inv = lambda_[keep_lambda] * eye + alpha_ * gram
sigma_ = pinvh(sigma_inv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be that numpy operations are now much faster?

@NicolasHug
Copy link
Member Author

I've updated the PR so that the woodbury formula is still used when n_features > n_samples. It is faster in this case.

CC @amueller

@adrinjalali
Copy link
Member

Since the results of the two methods are not the same, should it be exposed as kind of an engine or solver or something?

@NicolasHug
Copy link
Member Author

they are the same, we have a test for that

@adrinjalali
Copy link
Member

Here's where I'm confused:

You changed the solver block, and added the keep_lambda logic, and it fixed the test issue, and the coefficients issue. I though the solver was the main issue there. Now you added the old one, and the tests are still passing. So the issue was with the keep_lambda and the solver is only a performance improvement?

@NicolasHug
Copy link
Member Author

The only thing I added about keep_lambda is if keep_lambda.any(): , and it's only a consequence of the solver being more stable now. Before, that case never was reached.

I though the solver was the main issue there

Yes (it's not strictly speaking a solver it's more of a way of computing the inverse of a specific matrix. In the end, the same pinv function is used).

So the issue was with the keep_lambda and the solver is only a performance improvement?

It's not just a performance improvement, it's also a stability improvement. It's more stable to invert dot(X.T, X) when n_samples > n_features rather than inverting dot(X, X.T) because dot(X, X.T) is of shape n_samples x n_samples which makes it poorly conditioned (the rank is still n_features)

@adrinjalali
Copy link
Member

I'm happy with this one, another review maybe? @thomasjpfan wanna have another look?

@thomasjpfan thomasjpfan changed the title [MRG] Fix instability issue of ARDRegression (with speedup) BUG Fix instability issue of ARDRegression (with speedup) Apr 23, 2020
@thomasjpfan thomasjpfan merged commit 946fdde into scikit-learn:master Apr 23, 2020
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
@woctezuma
Copy link

woctezuma commented Aug 13, 2021

Sorry to bother if I am wrong. However, I have noticed that the plot in the documentation of ARD is different with sklearn 0.23.X compared to version 0.22.X, and I wonder if this commit introduced the issue (maybe it is the expected behavior).

cf. #20740 or directly the last plot shown in this example in the documentation and embedded in my post below.
Wrong plot

@glemaitre
Copy link
Member

I open a new issue such that this problem should be investigated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_bayes.py fails
5 participants