Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Simplify sample_weight support in Ridge. #4116

Closed
wants to merge 4 commits into from

Conversation

mblondel
Copy link
Member

I simplified the code by doing the rescaling upfront. This allows to support sample_weight in all solvers and fixes the long-standing bug #1190. Ping @eickenberg, @fabianp, @agramfort

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 7afa609 on mblondel:sample_weights into 2a2e7c6 on scikit-learn:master.

@agramfort
Copy link
Member

lgtm

@mblondel mblondel changed the title [MRG] Simplify sample_weight support in Ridge. [MRG+1] Simplify sample_weight support in Ridge. Jan 18, 2015
def _rescale_data(X, y, sample_weight):
"""Rescale data so as to support sample_weight"""
n_samples = X.shape[0]
sample_weight = sample_weight * np.ones(n_samples)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value for sample_weight coming in is None if I haven't missed a decision branch on the way. That should raise an error here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a branch at https://github.com/mblondel/scikit-learn/blob/sample_weights/sklearn/linear_model/ridge.py#L320 I was just assuming that _rescale_data is not called if sample_weight is None.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branch is not shown in the diff because it's already there in the current version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I think it is good the way it is. Moving a check for sample_weight=None would distribute decision making to auxiliary functions which would be weird. Sorry for the noise.

@eickenberg
Copy link
Contributor

I am actually surprised to see that we had the clause that sample weights forced the use of the cholesky solver. I was convinced I had implemented sw support for the other solvers, too. But I am probably confusing this with my efforts from half a year ago that I wasn't able to finalize.

I like the idea in general, but it does come with some drawbacks: As it is now, applying sample weights causes a weighted copy of X to be in memory, which for large design matrices can be prohibitive. On the other hand, if you multiply inplace, all other sorts of (thread-related?) problems can crop up in addition to the problem of sample weights equalling 0.

So I see it as a nice default fallback way of solving the sample weight problem, but it should probably give precedence to solver-specific implementations iff these are able to avoid full copies of X (for n_samples < n_features a copy of the kernel matrix is better than a copy of X).

@mblondel
Copy link
Member Author

@eickenberg For the n_features < n_samples case, there is not benefit in solver-specific handling since
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py#L103
allocates a copy of the design matrix as well. For the n_samples < n_features case, I agree with you, a copy of the kernel matrix could be better but I am not a fan of adding more branching in ridge_regression. In any case, I am keeping the sample_weight support in _solve_cholesky_kernel because this is needed for pre-computed kernels.

@eickenberg
Copy link
Contributor

For the n_features < n_samples case, there is not benefit in solver-specific handling since https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py#L103

Indeed, I had overlooked that. Makes sense. So then for me this is fine.

@mblondel
Copy link
Member Author

@eickenberg I removed the else clause.

from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.metrics.pairwise import pairwise_kernels
from sklearn.linear_model.ridge import _solve_cholesky_kernel
from sklearn.utils.validation import check_is_fitted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangential issue: I am confused now about whether or not to avoid relative imports in tests. For some reason, I thought absolute imports were preferred. Although I do agree that by a stroke of bad luck, the absolute import may cause the import of an entirely different package.

@agramfort why did you ask for absolute imports here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I see it: tests are like user scripts and so use absolute imports. Library is internal code so it uses relative imports. However, absolute imports should work too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that is the difference. Thanks for the clarification!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test should always use absolute imports in scikit-learn. It might not be the same version that the tests belong to, but it will be the version that the user actually gets.

@eickenberg
Copy link
Contributor

LGTM!

@mblondel
Copy link
Member Author

Merged by rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants