Problem with ridgeCV: centering the design matrix messes the solution when n < p #1807

Open
bthirion opened this Issue Mar 24, 2013 · 5 comments

Comments

Projects

Issues Without PR in scikit-learn 0.19

4 participants
Owner

bthirion commented Mar 24, 2013

Centering the design matrix cancels one of its singular values when n < p. This creates a numeric mess in the RidgeCV computation.
An example of the bad behaviour is given in:
https://gist.github.com/bthirion/5233416

I'm still not fully sure about the right solution. Comments welcome !
Best.

Owner

mblondel commented Mar 25, 2013

Thanks for the report! Any idea how to fix this?

Owner

agramfort commented Jul 25, 2013

i looked into it:

https://gist.github.com/agramfort/6078557

you'll see that if X is rank deficient even your code @bthirion fails to find the good alpha.

I think both implementations match. So the questions is how to fix these 10 lines of code
to it works with a rank deficient X....

Owner

bthirion commented Jul 25, 2013

On 25/07/2013 14:53, Alexandre Gramfort wrote:

i looked into it:

https://gist.github.com/agramfort/6078557

you'll see that if X is rank deficient even your code @bthirion
https://github.com/bthirion fails to find the good alpha.

Yes for sure.

I think both implementations match. So the questions is how to fix
these 10 lines of code
to it works with a rank deficient X....

I wanted to look at it, but haven't found time for that, sorry.
See you,

B


Reply to this email directly or view it on GitHub
#1807 (comment).

amueller added this to the 0.15.1 milestone Jul 18, 2014

@amueller amueller modified the milestone: 0.16, 0.17 Sep 11, 2015

@amueller amueller modified the milestone: 0.17, 0.18 Nov 2, 2015

Owner

bthirion commented Jan 3, 2016

Getting back to that one: see https://gist.github.com/bthirion/834b78c274e7f411665d
What sklearn's gcv computes is actually the mse, BUT in a setting (fit_intercept=False, pre-centered data), that gives trivial and misleading results as soon as n_features >= n_samples - 1

My suggestion is to change this to make the gcv results equivalent to a setting where fit_intercept=True.
Any opinion ?

Owner

agramfort commented Jan 3, 2016

@amueller amueller modified the milestone: 0.18, 0.19 Sep 22, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment