New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG - Add SVD-based solver to ridge regression. #1914
Conversation
I added a solver based on the thin SVD of X to compute the ridge coefficients. This is much more stable than the cholesky decomposition for singular matrices. While in the Ridge regression the Gram matrix becomes nonsingular because of the regularization, I've come across problems when the regularization parameter is very small. This pull request remedies, making the algorithm defined and stable even for the border case alpha=0. I took the liberty to change the default algorithm of to this one (when there are no sample weights).
Very nice 👍 |
@@ -82,10 +85,10 @@ def ridge_regression(X, y, alpha, sample_weight=1.0, solver='auto', | |||
has_sw = isinstance(sample_weight, np.ndarray) or sample_weight != 1.0 | |||
|
|||
if solver == 'auto': | |||
# cholesky if it's a dense array and cg in | |||
# svd if it's a dense array and cg in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So svd will be the default solver now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, okay, I see you mention that in the PR description :) I'm +1 for this, as, although it may be slower by a tad,
the increased stability is a great win..especially for new people that don't want to worry
about border case problems etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my opinion is that having a correct solution is the most important thing. I'm going to do some benchmarks to see how important is the performance change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :)
Could you benchmark with the "lsqr" solver? It is super fast in my experience. |
I was benchmarking and I observed that the results where identical for all solvers ... then I realized there was a bug in the Ridge object, the parameter solver was not passed onto the computational method!. This if fixed in last commit. |
I was sure that it used to work so I had a look. The bug was introduced in 07c56d7. This shows that we should add one test per solver (and we need a way to make sure that the right solver is called). |
One easy way to test that the solver is correctly passed is to check that an exception is correctly issued when the solver doesn't exist. |
Just remembered why lsqr is not the default solver: it's not available in scipy 0.7. |
when the cholesky fails.
I created some benchmarks (code: https://gist.github.com/fabianp/5533111). There are two plots, in the first one I fixed the number of samples and augmented the number of features. In the second one I did the opposite, I fixed the number of features and augmented the number of samples. What strikes me is that the SVD is really slow, so it is out of question to put it by default. The iterative solvers have good properties even for dense data, so that's an option. However they are not very efficient for problems with multiple targets, so that's definitely going to hurt performance on some applications. What I propose and implemented in the last commit is to keep Cholesky as the default as it was before and jump to the SVD solver whenever the Cholesky one breaks down. This way the behavior doesn't change for most usages but the algorithm doesn't break down when X is not full rank. |
Wow.. slow SVD,.. The Cholesky idea sounds good to me. 👍 |
The error intertals in the plot render really great!
|
Can you explain again why we don't want |
Hi Andy, It's a good question, one I would like to answer at some point but benchmarking iterative solvers vs direct solvers is outside the scope of this PR and I fear that if I dig into it deeper here it will never get finished. I just wanted an SVD-based solver that would solve the exceptions I was getting from the Cholesky-based one. I thought it would be a good idea to put it by default and then realized what a terrible mistake that was, hence the benchmarks. In the benchmarks above the data I used was gaussian IID, which usually leads to well conditioned systems. For testing Cholesky vs SVD this doesn't matter much since they have a fixed cost that depends only on n_samples and n_features (not strictly true for SVD but gimme that), but for CG and LSQR this can have a huge difference. I suspect this is also the reason why CG is faster than LSQR on my benchmarks. Another reason why the choice of iterative methods should be discussed with care is the issue of n_targets. The vector Y in ridge is of size (n_samples, n_targets), and both SVD and Cholesky make one matrix decomposition and solve for all n_targets using the same decomposition (I use this extensively, in my application we have very small matrices (100x100) but n_targets ~ 50000) .The iterative solvers on the other hand need to solve n_targets different systems one after the other. I suspect a reasonable rule of thumb would be to use direct methods for n_targets > min{n_features, n_samples} and iterative ones otherwise. Another idea would be to use preconditioning techniques for the sparse system in those cases. However, in both cases it requires some testing, and to benchmark with care :-) |
Thanks for the explanation. Definitely, more testing is needed, and that is clearly out of scope here :) |
I'll blog about it soon. I promise :-) |
Any new development on this Fabian? Concerning different decompositions there is a distinction in flexibility. While the cholesky implementation decomposes the already penalized matrix, svd and eigen decompositions work with the design/gram matrix themselves and penalties can be added at liberty after this decomposition step. This decoupling makes efficient cross-validation possible, because it works with the one decomposition. I was planning to extend this approach to individual penalties per target as I had proposed in another pull request. Ideally I would build on your contribution, so I will am really looking forward to your blog post ... :) |
Hey Michael, Thanks for your input. This PR is only concerned with the SVD solver and does not tackle the multiple-target problem. As such, I consider it finished and think it's ready to be merged (@mblodel do you agree?) As you say, once this is merged we can start thinking about an efficient ridge_path method based on the SVD decomposition. That would also be very useful for me :-) |
Yes +1 for merge. |
Issue #582 is tracking the ridge path feature request. |
Thanks Mathieu, I'll merge it in 48h if there are no further comments. |
Cool! Yup, ridge path sounds pretty neat as well: I may have a few tweaks to offer to make that fast in CV with any fold type (not analytically formulated as in RidgeLOOV, but that doesn't matter), unless you have already optimized it. The way you wrote the code permits an easy extension to individual penalties for each target, which I will implement once this PR has passed, based on your code. |
MRG - Add SVD-based solver to ridge regression.
Merged, thanks |
I added a solver based on the thin SVD of X to compute the ridge
coefficients. This is much more stable than the cholesky decomposition
for singular matrices.
While in the Ridge regression the Gram matrix becomes nonsingular
because of the regularization, I've come across problems when the
regularization parameter is very small. This pull request remedies it,
making the algorithm defined and stable even for the border case
alpha=0 (useful for cross validating over a grid of parameters)
I took the liberty to change the default algorithm of to this one
(when there are no sample weights). I haven't done any benchmarks.
It should be slower than dense_cholesky but not much ...