Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ridge(normalize=True) fails check_sample_weights_invariance #17444

Closed
rth opened this issue Jun 4, 2020 · 4 comments
Closed

Ridge(normalize=True) fails check_sample_weights_invariance #17444

rth opened this issue Jun 4, 2020 · 4 comments
Labels
Bug module:test-suite everything related to our tests

Comments

@rth
Copy link
Member

rth commented Jun 4, 2020

A possibly related issue to #3020 is that Ridge(normalize=True) breaks sample weight invariance properties: that is that removing samples is equivalent to setting sample weights to 0. For instance, running the check_sample_weights_invariance(kind=zeros) estimator check for Ridge(normalize=True) in #17441

fails by a significant amount for all solvers,

$ pytest sklearn/tests/test_common_non_default.py -k "Ridge and normalize and check_sample_weights_invariance" -v
[..]
________________ test_common_non_default[Ridge(normalize=True)-check_sample_weights_invariance(kind=zeros)] _________________
[..]
>           assert_allclose(x, y, rtol=rtol, atol=atol, err_msg=err_msg)
E           AssertionError: 
E           Not equal to tolerance rtol=1e-07, atol=1e-09
E           For Ridge, a zero sample_weight is not equivalent to removing the sample
E           Mismatch: 100%
E           Max absolute difference: 0.13578947
E           Max relative difference: 0.10287081
E            x: array([1.184211, 1.184211, 1.184211, 1.184211, 1.710526, 1.710526,
E                  1.710526, 1.710526, 1.289474, 1.289474, 1.289474, 1.289474,
E                  1.815789, 1.815789, 1.815789, 1.815789])
E            y: array([1.32, 1.32, 1.32, 1.32, 1.6 , 1.6 , 1.6 , 1.6 , 1.4 , 1.4 , 1.4 ,
E                  1.4 , 1.68, 1.68, 1.68, 1.68])

This makes somewhat sense as for instance StandardScaler doesn't account for sample weight and normalize likely doesn't also. But then I don't understand why his happens for Ridge but not LinearRegression, Lasso or ElasticNet.

@ogrisel
Copy link
Member

ogrisel commented Feb 22, 2021

#19426 is a fix for the handling of sample_weight when normalize=True for any linear regression model (not just Ridge). It is therefore not enough to fix the weight semantics problem which also happen for normalize=False.

To investigate the cause of the present issue (sample weight semantics), one could however try to use the minimal implementations proposed by @agramfort in the comment thread of #19426: #19426 (comment)

In particular, I have extended the ridge_lbfgs variant to check that doubling the weights is equivalent to doubling concatenating the training set with itself: #19426 (comment)

This minimal implementation could serve both as a debugging tool and a reference to compare to in the solver consistency test.

@ogrisel
Copy link
Member

ogrisel commented Feb 22, 2021

Here is a rather minimal reproduction case that shows that it also happens with normalize=False

Edit I made a mistake in the original code snippet. I fixed it below. I originally wrongly defined X_trimmed, y_trimmed as X[:cutoff], y[:cutoff] instead of X[cutoff:], y[cutoff:]. I confirm that there is no problem with normalize=False.

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.preprocessing import scale


rng = np.random.RandomState(0)
n_samples, n_features, n_latents = 100, 300, 50
Z = rng.normal(size=(n_samples, n_latents))  # Low-rank-ish latent variables
X = Z @ rng.normal(size=(n_latents, n_features))
X = scale(X)  # to use normalize=False
y = Z @ rng.normal(size=n_latents) + 1e-2 * rng.normal(n_samples)

cutoff = X.shape[0] // 3
sw = np.ones(shape=X.shape[0])
sw[:cutoff] = 0.
X_trimmed, y_trimmed = X[cutoff:, :], y[cutoff:]

params = dict(
    alpha=1.,
    normalize=False,
    fit_intercept=True,
)
ridge_weighted = Ridge(**params).fit(X, y, sample_weight=sw)
ridge_trimmed = Ridge(**params).fit(X_trimmed, y_trimmed)

# Check that no coef is too close to zero to be able to compare
# coef with rtol:
assert np.all(np.abs(ridge_weighted.coef_) > 1e-4)

np.testing.assert_allclose(
    ridge_weighted.coef_,
    ridge_trimmed.coef_,
)

@ogrisel
Copy link
Member

ogrisel commented Mar 3, 2021

For the normalize=True case, we indeed reproduce the original problem with the following simple reproducer:

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.preprocessing import scale


rng = np.random.RandomState(0)
n_samples, n_features, n_latents = 100, 300, 50
Z = rng.normal(size=(n_samples, n_latents))  # Low-rank-ish latent variables
X = Z @ rng.normal(size=(n_latents, n_features))
X = scale(X)  # to use normalize=False
y = Z @ rng.normal(size=n_latents) + 1e-2 * rng.normal(n_samples)

cutoff = X.shape[0] // 3
sw = np.ones(shape=X.shape[0])
sw[:cutoff] = 0.
X_trimmed, y_trimmed = X[cutoff:, :], y[cutoff:]

params = dict(
    alpha=1.,
    normalize=True,
    fit_intercept=True,
)
ridge_weighted = Ridge(**params).fit(X, y, sample_weight=sw)
ridge_trimmed = Ridge(**params).fit(X_trimmed, y_trimmed)

# Check that no coef is too close to zero to be able to compare
# coef with rtol:
assert np.all(np.abs(ridge_weighted.coef_) > 1e-4)

np.testing.assert_allclose(
    ridge_weighted.coef_,
    ridge_trimmed.coef_,
)
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 300 / 300 (100%)
Max absolute difference: 0.03935914
Max relative difference: 54.83439278
 x: array([-0.07431 , -0.095581, -0.236689,  0.018132,  0.127941,  0.128391,
        0.013526, -0.137097,  0.14734 ,  0.094625,  0.027823,  0.22886 ,
       -0.061843, -0.023826,  0.011162,  0.127679,  0.129191,  0.043196,...
 y: array([-8.113276e-02, -9.880947e-02, -2.575580e-01,  2.394298e-02,
        1.390761e-01,  1.397036e-01,  1.060587e-02, -1.531144e-01,
        1.615919e-01,  1.011639e-01,  3.139902e-02,  2.400596e-01,...

@thomasjpfan
Copy link
Member

With normalize deprecated in Ridge, I think we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug module:test-suite everything related to our tests
Projects
None yet
Development

No branches or pull requests

4 participants