MRG - Add SVD-based solver to ridge regression. #1914

fabianp · 2013-05-02T09:46:56Z

I added a solver based on the thin SVD of X to compute the ridge
coefficients. This is much more stable than the cholesky decomposition
for singular matrices.

While in the Ridge regression the Gram matrix becomes nonsingular
because of the regularization, I've come across problems when the
regularization parameter is very small. This pull request remedies it,
making the algorithm defined and stable even for the border case
alpha=0 (useful for cross validating over a grid of parameters)

I took the liberty to change the default algorithm of to this one
(when there are no sample weights). I haven't done any benchmarks.
It should be slower than dense_cholesky but not much ...

I added a solver based on the thin SVD of X to compute the ridge coefficients. This is much more stable than the cholesky decomposition for singular matrices. While in the Ridge regression the Gram matrix becomes nonsingular because of the regularization, I've come across problems when the regularization parameter is very small. This pull request remedies, making the algorithm defined and stable even for the border case alpha=0. I took the liberty to change the default algorithm of to this one (when there are no sample weights).

jaquesgrobler · 2013-05-02T10:54:50Z

Very nice 👍

jaquesgrobler · 2013-05-02T10:55:13Z

sklearn/linear_model/ridge.py

@@ -82,10 +85,10 @@ def ridge_regression(X, y, alpha, sample_weight=1.0, solver='auto',
    has_sw = isinstance(sample_weight, np.ndarray) or sample_weight != 1.0

    if solver == 'auto':
-        # cholesky if it's a dense array and cg in
+        # svd if it's a dense array and cg in


So svd will be the default solver now?

Oh, okay, I see you mention that in the PR description :) I'm +1 for this, as, although it may be slower by a tad,
the increased stability is a great win..especially for new people that don't want to worry
about border case problems etc.

Yes, my opinion is that having a correct solution is the most important thing. I'm going to do some benchmarks to see how important is the performance change.

mblondel · 2013-05-02T13:28:32Z

Could you benchmark with the "lsqr" solver? It is super fast in my experience.

fabianp · 2013-05-03T17:45:15Z

I was benchmarking and I observed that the results where identical for all solvers ... then I realized there was a bug in the Ridge object, the parameter solver was not passed onto the computational method!. This if fixed in last commit.

mblondel · 2013-05-03T23:50:08Z

I was sure that it used to work so I had a look. The bug was introduced in 07c56d7.

This shows that we should add one test per solver (and we need a way to make sure that the right solver is called).

mblondel · 2013-05-04T00:18:29Z

One easy way to test that the solver is correctly passed is to check that an exception is correctly issued when the solver doesn't exist.

mblondel · 2013-05-04T00:36:20Z

Just remembered why lsqr is not the default solver: it's not available in scipy 0.7.

when the cholesky fails.

fabianp · 2013-05-07T15:23:05Z

I created some benchmarks (code: https://gist.github.com/fabianp/5533111). There are two plots, in the first one I fixed the number of samples and augmented the number of features. In the second one I did the opposite, I fixed the number of features and augmented the number of samples.

What strikes me is that the SVD is really slow, so it is out of question to put it by default. The iterative solvers have good properties even for dense data, so that's an option. However they are not very efficient for problems with multiple targets, so that's definitely going to hurt performance on some applications.

What I propose and implemented in the last commit is to keep Cholesky as the default as it was before and jump to the SVD solver whenever the Cholesky one breaks down. This way the behavior doesn't change for most usages but the algorithm doesn't break down when X is not full rank.

jaquesgrobler · 2013-05-07T15:45:18Z

Wow.. slow SVD,.. The Cholesky idea sounds good to me. 👍

mblondel · 2013-05-07T18:11:53Z

The error intertals in the plot render really great!
On May 8, 2013 12:24 AM, "Fabian Pedregosa" notifications@github.com
wrote:

I created some benchmarks (code: https://gist.github.com/fabianp/5533111).
There are two plots, in the first one I fixed the number of samples and
augmented the number of features. In the second one I did the opposite, I
fixed the number of features and augmented the number of samples.

[image: bench augmenting number of features]https://a248.e.akamai.net/camo.github.com/d623384b70a94f36293b7993a235918e279d4c8d/687474703a2f2f6673656f616e652e6e65742f746d702f323031332f62656e63685f72696467655f66656174757265732e706e67
[image: bench augmenting number of samples]https://a248.e.akamai.net/camo.github.com/54a3cef9fe542aced4283a4d743c23910e49e418/687474703a2f2f6673656f616e652e6e65742f746d702f323031332f62656e63685f72696467655f73616d706c65732e706e67

What strikes me is that the SVD is really slow, so it is out of question
to put it by default. The iterative solvers have good properties even for
dense data, so that's an option. However they are not very efficient for
problems with multiple targets, so that's definitely going to hurt
performance on some applications.

What I propose and implemented in the last commit is to keep Cholesky as
the default as it was before and jump to the SVD solver whenever the
Cholesky one breaks down. This way the behavior doesn't change for most
usages but the algorithm doesn't break down when X is not full rank.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/1914#issuecomment-17549173
.

amueller · 2013-05-07T19:36:41Z

Can you explain again why we don't want sparse_cg? How stable is that for small alpha?
It these are only slow for multiple targets, why not switch on that?

fabianp · 2013-05-08T09:31:00Z

Hi Andy,

It's a good question, one I would like to answer at some point but benchmarking iterative solvers vs direct solvers is outside the scope of this PR and I fear that if I dig into it deeper here it will never get finished. I just wanted an SVD-based solver that would solve the exceptions I was getting from the Cholesky-based one. I thought it would be a good idea to put it by default and then realized what a terrible mistake that was, hence the benchmarks.

In the benchmarks above the data I used was gaussian IID, which usually leads to well conditioned systems. For testing Cholesky vs SVD this doesn't matter much since they have a fixed cost that depends only on n_samples and n_features (not strictly true for SVD but gimme that), but for CG and LSQR this can have a huge difference. I suspect this is also the reason why CG is faster than LSQR on my benchmarks.

Another reason why the choice of iterative methods should be discussed with care is the issue of n_targets. The vector Y in ridge is of size (n_samples, n_targets), and both SVD and Cholesky make one matrix decomposition and solve for all n_targets using the same decomposition (I use this extensively, in my application we have very small matrices (100x100) but n_targets ~ 50000) .The iterative solvers on the other hand need to solve n_targets different systems one after the other. I suspect a reasonable rule of thumb would be to use direct methods for n_targets > min{n_features, n_samples} and iterative ones otherwise. Another idea would be to use preconditioning techniques for the sparse system in those cases. However, in both cases it requires some testing, and to benchmark with care :-)

amueller · 2013-05-08T09:38:15Z

Thanks for the explanation. Definitely, more testing is needed, and that is clearly out of scope here :)

fabianp · 2013-05-08T09:51:51Z

I'll blog about it soon. I promise :-)

eickenberg · 2013-06-04T09:33:21Z

Any new development on this Fabian?
The point about multiple targets in your last comment is the most important one with respect to performance. As soon as there is a large number of targets, decomposing the design matrix becomes useful. For one target it is definitely more efficient to solve the problem instead of decomposing the matrix beforehand.

Concerning different decompositions there is a distinction in flexibility. While the cholesky implementation decomposes the already penalized matrix, svd and eigen decompositions work with the design/gram matrix themselves and penalties can be added at liberty after this decomposition step. This decoupling makes efficient cross-validation possible, because it works with the one decomposition.

I was planning to extend this approach to individual penalties per target as I had proposed in another pull request. Ideally I would build on your contribution, so I will am really looking forward to your blog post ... :)

fabianp · 2013-06-04T12:03:31Z

Hey Michael,

Thanks for your input. This PR is only concerned with the SVD solver and does not tackle the multiple-target problem. As such, I consider it finished and think it's ready to be merged (@mblodel do you agree?)

As you say, once this is merged we can start thinking about an efficient ridge_path method based on the SVD decomposition. That would also be very useful for me :-)

mblondel · 2013-06-04T12:29:36Z

Yes +1 for merge.

mblondel · 2013-06-04T12:30:20Z

Issue #582 is tracking the ridge path feature request.

fabianp · 2013-06-04T12:35:23Z

Thanks Mathieu, I'll merge it in 48h if there are no further comments.

eickenberg · 2013-06-04T13:17:39Z

Cool!

Yup, ridge path sounds pretty neat as well: I may have a few tweaks to offer to make that fast in CV with any fold type (not analytically formulated as in RidgeLOOV, but that doesn't matter), unless you have already optimized it.

The way you wrote the code permits an easy extension to individual penalties for each target, which I will implement once this PR has passed, based on your code.

MRG - Add SVD-based solver to ridge regression.

fabianp · 2013-06-06T10:13:08Z

Merged, thanks

Fabian Pedregosa added 2 commits May 2, 2013 11:30

Remove unnecessary code in ridge svd

9c84263

jaquesgrobler reviewed May 2, 2013
View reviewed changes

BUG: solver was not passed to computational method in Ridge object

a8fe355

Use Cholesky solver by default, but use SVD as fallback

735a814

when the cholesky fails.

Use ValueError for non-existant solvers

0cf8e3d

fabianp added a commit that referenced this pull request Jun 6, 2013

Merge pull request #1914 from fabianp/ridge_svd

9ad829b

MRG - Add SVD-based solver to ridge regression.

fabianp merged commit 9ad829b into scikit-learn:master Jun 6, 2013

eickenberg mentioned this pull request Jun 14, 2013

Multi target ridge regression with individual penalties #2068

Closed

albertcthomas mentioned this pull request Oct 26, 2017

[MRG + 1] Labels of clustering should start at 0 or -1 if noise #10015

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG - Add SVD-based solver to ridge regression. #1914

MRG - Add SVD-based solver to ridge regression. #1914

fabianp commented May 2, 2013

jaquesgrobler commented May 2, 2013

jaquesgrobler May 2, 2013

jaquesgrobler May 2, 2013

fabianp May 2, 2013

jaquesgrobler May 2, 2013

mblondel commented May 2, 2013

fabianp commented May 3, 2013

mblondel commented May 3, 2013

mblondel commented May 4, 2013

mblondel commented May 4, 2013

fabianp commented May 7, 2013

jaquesgrobler commented May 7, 2013

mblondel commented May 7, 2013

amueller commented May 7, 2013

fabianp commented May 8, 2013

amueller commented May 8, 2013

fabianp commented May 8, 2013

eickenberg commented Jun 4, 2013

fabianp commented Jun 4, 2013

mblondel commented Jun 4, 2013

mblondel commented Jun 4, 2013

fabianp commented Jun 4, 2013

eickenberg commented Jun 4, 2013

fabianp commented Jun 6, 2013

MRG - Add SVD-based solver to ridge regression. #1914

MRG - Add SVD-based solver to ridge regression. #1914

Conversation

fabianp commented May 2, 2013

jaquesgrobler commented May 2, 2013

jaquesgrobler May 2, 2013

Choose a reason for hiding this comment

jaquesgrobler May 2, 2013

Choose a reason for hiding this comment

fabianp May 2, 2013

Choose a reason for hiding this comment

jaquesgrobler May 2, 2013

Choose a reason for hiding this comment

mblondel commented May 2, 2013

fabianp commented May 3, 2013

mblondel commented May 3, 2013

mblondel commented May 4, 2013

mblondel commented May 4, 2013

fabianp commented May 7, 2013

jaquesgrobler commented May 7, 2013

mblondel commented May 7, 2013

amueller commented May 7, 2013

fabianp commented May 8, 2013

amueller commented May 8, 2013

fabianp commented May 8, 2013

eickenberg commented Jun 4, 2013

fabianp commented Jun 4, 2013

mblondel commented Jun 4, 2013

mblondel commented Jun 4, 2013

fabianp commented Jun 4, 2013

eickenberg commented Jun 4, 2013

fabianp commented Jun 6, 2013