BUG Fix instability issue of ARDRegression (with speedup) #16849

NicolasHug · 2020-04-05T21:00:16Z

This should fix #15186
Closes #16102

This comes with a significant speed up: on master the estimator can only scale to a few hundreds of samples. This PR can comfortably fit 100k samples in about 1 sec.

NicolasHug · 2020-04-05T21:35:52Z

sklearn/linear_model/tests/test_bayes.py

+    # With the inputs above, ARDRegression prunes both of the two coefficients
+    # in the first iteration. Hence, the expected shape of `sigma_` is (0, 0).
+    assert clf.sigma_.shape == (0, 0)


If the target is constant, it makes sense to me that both coefficients should bet set to zero. So I think this is a bug fix

By setting clf.sigma_ = np.array([]), the later call to clf.predict(X, return_std=True) would return a different result for the std when compared to master. Is this okay?

yes, it's a bug fix. Sigma should be nothing when keep_lambda is all false.

You might want to slightly delay your review @thomasjpfan , I'm going to update this for the n_features > n_samples case

NicolasHug · 2020-04-05T21:36:51Z

sklearn/linear_model/tests/test_bayes.py

+@pytest.mark.parametrize('seed', range(100))
+def test_ard_accuracy_on_easy_problem(seed):


The test should now run properly on many different seeds (there was a 7% failure rate before), and with a much higher precision

NicolasHug · 2020-04-05T21:43:16Z

sklearn/linear_model/_bayes.py

+            gram = np.dot(X_keep.T, X_keep)
+            eye = np.eye(gram.shape[0])
+            sigma_inv = lambda_[keep_lambda] * eye + alpha_ * gram
+            sigma_ = pinvh(sigma_inv)


So that's basically the main fix. Instead of relying on the woodbury formula, we just directly invert the matrix (called S_N in the ref). This seems to be much more stable and also much faster because here we invert a matrix that is n_features x features whereas in master the inversion is on a n_samples x n_samples.

Strangely enough, this somewhat reverts a commit from 10 years ago b2d0a77, whose goal was to make things faster but I'm not sure it did. @vmichel if you're still around your input would be greatly appreciated!

could it be that numpy operations are now much faster?

I doubt it, the bottleneck is the matrix inversion here

NicolasHug · 2020-04-05T21:43:42Z

sklearn/linear_model/_bayes.py

@@ -12,7 +12,7 @@
 from ._base import LinearModel, _rescale_data
 from ..base import RegressorMixin
 from ..utils.extmath import fast_logdet
-from ..utils.fixes import pinvh
+from scipy.linalg import pinvh


We don't need the scipy's backport to we can remove it now

NicolasHug · 2020-04-05T21:45:11Z

Pinging @rth @jnothman @thomasjpfan @glemaitre since you took a look at the previous PR. Thanks!

adrinjalali

It'd be nice to get this one in. It looks good to me, I think.

adrinjalali · 2020-04-20T12:51:55Z

doc/whats_new/v0.23.rst

@@ -290,6 +290,10 @@ Changelog
  of strictly inferior for maximum of `absgrad` and `tol` in `utils.optimize._newton_cg`.
  :pr:`16266` by :user:`Rushabh Vasani <rushabh-v>`.

+- |Fix| |Efficiency| :class:`linear_model.ARDRegression` is more stable and
+  much faster. It can now scale to hundreds of thousands of samples.


you could also mention https://github.com/scikit-learn/scikit-learn/pull/16849/files#r403760030 here maybe?

adrinjalali · 2020-04-20T12:52:54Z

sklearn/linear_model/_bayes.py

+            gram = np.dot(X_keep.T, X_keep)
+            eye = np.eye(gram.shape[0])
+            sigma_inv = lambda_[keep_lambda] * eye + alpha_ * gram
+            sigma_ = pinvh(sigma_inv)


could it be that numpy operations are now much faster?

…dregression_stability

NicolasHug · 2020-04-20T16:57:24Z

I've updated the PR so that the woodbury formula is still used when n_features > n_samples. It is faster in this case.

CC @amueller

adrinjalali · 2020-04-20T18:20:35Z

Since the results of the two methods are not the same, should it be exposed as kind of an engine or solver or something?

NicolasHug · 2020-04-20T18:21:37Z

they are the same, we have a test for that

adrinjalali · 2020-04-20T18:35:00Z

Here's where I'm confused:

You changed the solver block, and added the keep_lambda logic, and it fixed the test issue, and the coefficients issue. I though the solver was the main issue there. Now you added the old one, and the tests are still passing. So the issue was with the keep_lambda and the solver is only a performance improvement?

NicolasHug · 2020-04-20T18:58:17Z

The only thing I added about keep_lambda is if keep_lambda.any(): , and it's only a consequence of the solver being more stable now. Before, that case never was reached.

I though the solver was the main issue there

Yes (it's not strictly speaking a solver it's more of a way of computing the inverse of a specific matrix. In the end, the same pinv function is used).

So the issue was with the keep_lambda and the solver is only a performance improvement?

It's not just a performance improvement, it's also a stability improvement. It's more stable to invert dot(X.T, X) when n_samples > n_features rather than inverting dot(X, X.T) because dot(X, X.T) is of shape n_samples x n_samples which makes it poorly conditioned (the rank is still n_features)

adrinjalali · 2020-04-23T07:57:57Z

I'm happy with this one, another review maybe? @thomasjpfan wanna have another look?

…rn#16849)

woctezuma · 2021-08-13T07:38:19Z

Sorry to bother if I am wrong. However, I have noticed that the plot in the documentation of ARD is different with sklearn 0.23.X compared to version 0.22.X, and I wonder if this commit introduced the issue (maybe it is the expected behavior).

cf. #20740 or directly the last plot shown in this example in the documentation and embedded in my post below.

glemaitre · 2021-08-16T09:02:24Z

I open a new issue such that this problem should be investigated.

phew

25ffe81

github-actions bot added module:linear_model module:utils labels Apr 5, 2020

NicolasHug changed the title ~~[WIP] Fix instability issue of ARDRegression (with speedup)~~ [MRG] Fix instability issue of ARDRegression (with speedup) Apr 5, 2020

removed error raising

3d4d0ca

NicolasHug added this to the 0.23 milestone Apr 5, 2020

NicolasHug added the Bug label Apr 5, 2020

whatsnew

8713c18

NicolasHug commented Apr 5, 2020

View reviewed changes

adrinjalali approved these changes Apr 20, 2020

View reviewed changes

NicolasHug added 2 commits April 20, 2020 11:14

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ar…

155fae0

…dregression_stability

still use woodbury formula when n_features > n_samples

7e49e0a

thomasjpfan approved these changes Apr 23, 2020

View reviewed changes

thomasjpfan changed the title ~~[MRG] Fix instability issue of ARDRegression (with speedup)~~ BUG Fix instability issue of ARDRegression (with speedup) Apr 23, 2020

thomasjpfan merged commit 946fdde into scikit-learn:master Apr 23, 2020

adrinjalali mentioned this pull request Apr 23, 2020

failed test_ard_accuracy_on_easy_problem using specific seed #15420

Closed

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

BUG Fix instability issue of ARDRegression (with speedup) (scikit-lea…

b80731a

…rn#16849)

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

BUG Fix instability issue of ARDRegression (with speedup) (scikit-lea…

442e319

…rn#16849)

woctezuma mentioned this pull request Aug 13, 2021

Wrong illustration for ARD #20740

Closed

glemaitre mentioned this pull request Aug 16, 2021

Issue with ARDRegression example #20755

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG Fix instability issue of ARDRegression (with speedup) #16849

BUG Fix instability issue of ARDRegression (with speedup) #16849

NicolasHug commented Apr 5, 2020 •

edited

NicolasHug Apr 5, 2020

thomasjpfan Apr 20, 2020

NicolasHug Apr 20, 2020

NicolasHug Apr 5, 2020 •

edited

NicolasHug Apr 5, 2020

adrinjalali Apr 20, 2020

NicolasHug Apr 20, 2020

NicolasHug Apr 5, 2020

NicolasHug commented Apr 5, 2020

adrinjalali left a comment

adrinjalali Apr 20, 2020

adrinjalali Apr 20, 2020

NicolasHug commented Apr 20, 2020

adrinjalali commented Apr 20, 2020

NicolasHug commented Apr 20, 2020

adrinjalali commented Apr 20, 2020

NicolasHug commented Apr 20, 2020

adrinjalali commented Apr 23, 2020

woctezuma commented Aug 13, 2021 •

edited

glemaitre commented Aug 16, 2021

		@pytest.mark.parametrize('seed', range(100))
		def test_ard_accuracy_on_easy_problem(seed):

BUG Fix instability issue of ARDRegression (with speedup) #16849

BUG Fix instability issue of ARDRegression (with speedup) #16849

Conversation

NicolasHug commented Apr 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug Apr 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Apr 5, 2020

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Apr 20, 2020

adrinjalali commented Apr 20, 2020

NicolasHug commented Apr 20, 2020

adrinjalali commented Apr 20, 2020

NicolasHug commented Apr 20, 2020

adrinjalali commented Apr 23, 2020

woctezuma commented Aug 13, 2021 • edited

glemaitre commented Aug 16, 2021

NicolasHug commented Apr 5, 2020 •

edited

NicolasHug Apr 5, 2020 •

edited

woctezuma commented Aug 13, 2021 •

edited