Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG - Add SVD-based solver to ridge regression. #1914

Merged
merged 5 commits into from Jun 6, 2013

Conversation

fabianp
Copy link
Member

@fabianp fabianp commented May 2, 2013

I added a solver based on the thin SVD of X to compute the ridge
coefficients. This is much more stable than the cholesky decomposition
for singular matrices.

While in the Ridge regression the Gram matrix becomes nonsingular
because of the regularization, I've come across problems when the
regularization parameter is very small. This pull request remedies it,
making the algorithm defined and stable even for the border case
alpha=0 (useful for cross validating over a grid of parameters)

I took the liberty to change the default algorithm of to this one
(when there are no sample weights). I haven't done any benchmarks.
It should be slower than dense_cholesky but not much ...

Fabian Pedregosa added 2 commits May 2, 2013 11:30
I added a solver based on the thin SVD of X to compute the ridge
coefficients. This is much more stable than the cholesky decomposition
for singular matrices.

While in the Ridge regression the Gram matrix becomes nonsingular
because of the regularization, I've come across problems when the
regularization parameter is very small. This pull request remedies,
making the algorithm defined and stable even for the border case
alpha=0.

I took the liberty to change the default algorithm of to this one
(when there are no sample weights).
@jaquesgrobler
Copy link
Member

Very nice 👍

@@ -82,10 +85,10 @@ def ridge_regression(X, y, alpha, sample_weight=1.0, solver='auto',
has_sw = isinstance(sample_weight, np.ndarray) or sample_weight != 1.0

if solver == 'auto':
# cholesky if it's a dense array and cg in
# svd if it's a dense array and cg in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So svd will be the default solver now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay, I see you mention that in the PR description :) I'm +1 for this, as, although it may be slower by a tad,
the increased stability is a great win..especially for new people that don't want to worry
about border case problems etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my opinion is that having a correct solution is the most important thing. I'm going to do some benchmarks to see how important is the performance change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

@mblondel
Copy link
Member

mblondel commented May 2, 2013

Could you benchmark with the "lsqr" solver? It is super fast in my experience.

@fabianp
Copy link
Member Author

fabianp commented May 3, 2013

I was benchmarking and I observed that the results where identical for all solvers ... then I realized there was a bug in the Ridge object, the parameter solver was not passed onto the computational method!. This if fixed in last commit.

@mblondel
Copy link
Member

mblondel commented May 3, 2013

I was sure that it used to work so I had a look. The bug was introduced in 07c56d7.

This shows that we should add one test per solver (and we need a way to make sure that the right solver is called).

@mblondel
Copy link
Member

mblondel commented May 4, 2013

One easy way to test that the solver is correctly passed is to check that an exception is correctly issued when the solver doesn't exist.

@mblondel
Copy link
Member

mblondel commented May 4, 2013

Just remembered why lsqr is not the default solver: it's not available in scipy 0.7.

@fabianp
Copy link
Member Author

fabianp commented May 7, 2013

I created some benchmarks (code: https://gist.github.com/fabianp/5533111). There are two plots, in the first one I fixed the number of samples and augmented the number of features. In the second one I did the opposite, I fixed the number of features and augmented the number of samples.

bench augmenting number of features
bench augmenting number of samples

What strikes me is that the SVD is really slow, so it is out of question to put it by default. The iterative solvers have good properties even for dense data, so that's an option. However they are not very efficient for problems with multiple targets, so that's definitely going to hurt performance on some applications.

What I propose and implemented in the last commit is to keep Cholesky as the default as it was before and jump to the SVD solver whenever the Cholesky one breaks down. This way the behavior doesn't change for most usages but the algorithm doesn't break down when X is not full rank.

@jaquesgrobler
Copy link
Member

Wow.. slow SVD,.. The Cholesky idea sounds good to me. 👍

@mblondel
Copy link
Member

mblondel commented May 7, 2013

The error intertals in the plot render really great!
On May 8, 2013 12:24 AM, "Fabian Pedregosa" notifications@github.com
wrote:

I created some benchmarks (code: https://gist.github.com/fabianp/5533111).
There are two plots, in the first one I fixed the number of samples and
augmented the number of features. In the second one I did the opposite, I
fixed the number of features and augmented the number of samples.

[image: bench augmenting number of features]https://a248.e.akamai.net/camo.github.com/d623384b70a94f36293b7993a235918e279d4c8d/687474703a2f2f6673656f616e652e6e65742f746d702f323031332f62656e63685f72696467655f66656174757265732e706e67
[image: bench augmenting number of samples]https://a248.e.akamai.net/camo.github.com/54a3cef9fe542aced4283a4d743c23910e49e418/687474703a2f2f6673656f616e652e6e65742f746d702f323031332f62656e63685f72696467655f73616d706c65732e706e67

What strikes me is that the SVD is really slow, so it is out of question
to put it by default. The iterative solvers have good properties even for
dense data, so that's an option. However they are not very efficient for
problems with multiple targets, so that's definitely going to hurt
performance on some applications.

What I propose and implemented in the last commit is to keep Cholesky as
the default as it was before and jump to the SVD solver whenever the
Cholesky one breaks down. This way the behavior doesn't change for most
usages but the algorithm doesn't break down when X is not full rank.


Reply to this email directly or view it on GitHubhttps://github.com//pull/1914#issuecomment-17549173
.

@amueller
Copy link
Member

amueller commented May 7, 2013

Can you explain again why we don't want sparse_cg? How stable is that for small alpha?
It these are only slow for multiple targets, why not switch on that?

@fabianp
Copy link
Member Author

fabianp commented May 8, 2013

Hi Andy,

It's a good question, one I would like to answer at some point but benchmarking iterative solvers vs direct solvers is outside the scope of this PR and I fear that if I dig into it deeper here it will never get finished. I just wanted an SVD-based solver that would solve the exceptions I was getting from the Cholesky-based one. I thought it would be a good idea to put it by default and then realized what a terrible mistake that was, hence the benchmarks.

In the benchmarks above the data I used was gaussian IID, which usually leads to well conditioned systems. For testing Cholesky vs SVD this doesn't matter much since they have a fixed cost that depends only on n_samples and n_features (not strictly true for SVD but gimme that), but for CG and LSQR this can have a huge difference. I suspect this is also the reason why CG is faster than LSQR on my benchmarks.

Another reason why the choice of iterative methods should be discussed with care is the issue of n_targets. The vector Y in ridge is of size (n_samples, n_targets), and both SVD and Cholesky make one matrix decomposition and solve for all n_targets using the same decomposition (I use this extensively, in my application we have very small matrices (100x100) but n_targets ~ 50000) .The iterative solvers on the other hand need to solve n_targets different systems one after the other. I suspect a reasonable rule of thumb would be to use direct methods for n_targets > min{n_features, n_samples} and iterative ones otherwise. Another idea would be to use preconditioning techniques for the sparse system in those cases. However, in both cases it requires some testing, and to benchmark with care :-)

@amueller
Copy link
Member

amueller commented May 8, 2013

Thanks for the explanation. Definitely, more testing is needed, and that is clearly out of scope here :)

@fabianp
Copy link
Member Author

fabianp commented May 8, 2013

I'll blog about it soon. I promise :-)

@eickenberg
Copy link
Contributor

Any new development on this Fabian?
The point about multiple targets in your last comment is the most important one with respect to performance. As soon as there is a large number of targets, decomposing the design matrix becomes useful. For one target it is definitely more efficient to solve the problem instead of decomposing the matrix beforehand.

Concerning different decompositions there is a distinction in flexibility. While the cholesky implementation decomposes the already penalized matrix, svd and eigen decompositions work with the design/gram matrix themselves and penalties can be added at liberty after this decomposition step. This decoupling makes efficient cross-validation possible, because it works with the one decomposition.

I was planning to extend this approach to individual penalties per target as I had proposed in another pull request. Ideally I would build on your contribution, so I will am really looking forward to your blog post ... :)

@fabianp
Copy link
Member Author

fabianp commented Jun 4, 2013

Hey Michael,

Thanks for your input. This PR is only concerned with the SVD solver and does not tackle the multiple-target problem. As such, I consider it finished and think it's ready to be merged (@mblodel do you agree?)

As you say, once this is merged we can start thinking about an efficient ridge_path method based on the SVD decomposition. That would also be very useful for me :-)

@mblondel
Copy link
Member

mblondel commented Jun 4, 2013

Yes +1 for merge.

@mblondel
Copy link
Member

mblondel commented Jun 4, 2013

Issue #582 is tracking the ridge path feature request.

@fabianp
Copy link
Member Author

fabianp commented Jun 4, 2013

Thanks Mathieu, I'll merge it in 48h if there are no further comments.

@eickenberg
Copy link
Contributor

Cool!

Yup, ridge path sounds pretty neat as well: I may have a few tweaks to offer to make that fast in CV with any fold type (not analytically formulated as in RidgeLOOV, but that doesn't matter), unless you have already optimized it.

The way you wrote the code permits an easy extension to individual penalties for each target, which I will implement once this PR has passed, based on your code.

fabianp added a commit that referenced this pull request Jun 6, 2013
MRG - Add SVD-based solver to ridge regression.
@fabianp fabianp merged commit 9ad829b into scikit-learn:master Jun 6, 2013
@fabianp
Copy link
Member Author

fabianp commented Jun 6, 2013

Merged, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants