Ridge(normalize=True) fails check_sample_weights_invariance #17444

rth · 2020-06-04T12:12:52Z

A possibly related issue to #3020 is that Ridge(normalize=True) breaks sample weight invariance properties: that is that removing samples is equivalent to setting sample weights to 0. For instance, running the check_sample_weights_invariance(kind=zeros) estimator check for Ridge(normalize=True) in #17441

fails by a significant amount for all solvers,

$ pytest sklearn/tests/test_common_non_default.py -k "Ridge and normalize and check_sample_weights_invariance" -v
[..]
________________ test_common_non_default[Ridge(normalize=True)-check_sample_weights_invariance(kind=zeros)] _________________
[..]
>           assert_allclose(x, y, rtol=rtol, atol=atol, err_msg=err_msg)
E           AssertionError: 
E           Not equal to tolerance rtol=1e-07, atol=1e-09
E           For Ridge, a zero sample_weight is not equivalent to removing the sample
E           Mismatch: 100%
E           Max absolute difference: 0.13578947
E           Max relative difference: 0.10287081
E            x: array([1.184211, 1.184211, 1.184211, 1.184211, 1.710526, 1.710526,
E                  1.710526, 1.710526, 1.289474, 1.289474, 1.289474, 1.289474,
E                  1.815789, 1.815789, 1.815789, 1.815789])
E            y: array([1.32, 1.32, 1.32, 1.32, 1.6 , 1.6 , 1.6 , 1.6 , 1.4 , 1.4 , 1.4 ,
E                  1.4 , 1.68, 1.68, 1.68, 1.68])

This makes somewhat sense as for instance StandardScaler doesn't account for sample weight and normalize likely doesn't also. But then I don't understand why his happens for Ridge but not LinearRegression, Lasso or ElasticNet.

The text was updated successfully, but these errors were encountered:

ogrisel · 2021-02-22T10:25:15Z

#19426 is a fix for the handling of sample_weight when normalize=True for any linear regression model (not just Ridge). It is therefore not enough to fix the weight semantics problem which also happen for normalize=False.

To investigate the cause of the present issue (sample weight semantics), one could however try to use the minimal implementations proposed by @agramfort in the comment thread of #19426: #19426 (comment)

In particular, I have extended the ridge_lbfgs variant to check that doubling the weights is equivalent to doubling concatenating the training set with itself: #19426 (comment)

This minimal implementation could serve both as a debugging tool and a reference to compare to in the solver consistency test.

ogrisel · 2021-02-22T10:44:15Z

~~Here is a rather minimal reproduction case that shows that it also happens with normalize=False~~

Edit I made a mistake in the original code snippet. I fixed it below. I originally wrongly defined X_trimmed, y_trimmed as X[:cutoff], y[:cutoff] instead of X[cutoff:], y[cutoff:]. I confirm that there is no problem with normalize=False.

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.preprocessing import scale


rng = np.random.RandomState(0)
n_samples, n_features, n_latents = 100, 300, 50
Z = rng.normal(size=(n_samples, n_latents))  # Low-rank-ish latent variables
X = Z @ rng.normal(size=(n_latents, n_features))
X = scale(X)  # to use normalize=False
y = Z @ rng.normal(size=n_latents) + 1e-2 * rng.normal(n_samples)

cutoff = X.shape[0] // 3
sw = np.ones(shape=X.shape[0])
sw[:cutoff] = 0.
X_trimmed, y_trimmed = X[cutoff:, :], y[cutoff:]

params = dict(
    alpha=1.,
    normalize=False,
    fit_intercept=True,
)
ridge_weighted = Ridge(**params).fit(X, y, sample_weight=sw)
ridge_trimmed = Ridge(**params).fit(X_trimmed, y_trimmed)

# Check that no coef is too close to zero to be able to compare
# coef with rtol:
assert np.all(np.abs(ridge_weighted.coef_) > 1e-4)

np.testing.assert_allclose(
    ridge_weighted.coef_,
    ridge_trimmed.coef_,
)

ogrisel · 2021-03-03T10:16:33Z

For the normalize=True case, we indeed reproduce the original problem with the following simple reproducer:

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.preprocessing import scale


rng = np.random.RandomState(0)
n_samples, n_features, n_latents = 100, 300, 50
Z = rng.normal(size=(n_samples, n_latents))  # Low-rank-ish latent variables
X = Z @ rng.normal(size=(n_latents, n_features))
X = scale(X)  # to use normalize=False
y = Z @ rng.normal(size=n_latents) + 1e-2 * rng.normal(n_samples)

cutoff = X.shape[0] // 3
sw = np.ones(shape=X.shape[0])
sw[:cutoff] = 0.
X_trimmed, y_trimmed = X[cutoff:, :], y[cutoff:]

params = dict(
    alpha=1.,
    normalize=True,
    fit_intercept=True,
)
ridge_weighted = Ridge(**params).fit(X, y, sample_weight=sw)
ridge_trimmed = Ridge(**params).fit(X_trimmed, y_trimmed)

# Check that no coef is too close to zero to be able to compare
# coef with rtol:
assert np.all(np.abs(ridge_weighted.coef_) > 1e-4)

np.testing.assert_allclose(
    ridge_weighted.coef_,
    ridge_trimmed.coef_,
)

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 300 / 300 (100%)
Max absolute difference: 0.03935914
Max relative difference: 54.83439278
 x: array([-0.07431 , -0.095581, -0.236689,  0.018132,  0.127941,  0.128391,
        0.013526, -0.137097,  0.14734 ,  0.094625,  0.027823,  0.22886 ,
       -0.061843, -0.023826,  0.011162,  0.127679,  0.129191,  0.043196,...
 y: array([-8.113276e-02, -9.880947e-02, -2.575580e-01,  2.394298e-02,
        1.390761e-01,  1.397036e-01,  1.060587e-02, -1.531144e-01,
        1.615919e-01,  1.011639e-01,  3.139902e-02,  2.400596e-01,...

thomasjpfan · 2022-04-13T19:09:11Z

With normalize deprecated in Ridge, I think we can close this issue.

rth added Bug: triage Bug and removed Bug: triage labels Jun 4, 2020

rth mentioned this issue Jun 4, 2020

TST Add tests for LinearRegression that sample weights act consistently #15554

Merged

cmarmo added the module:test-suite everything related to our tests label Sep 15, 2020

maikia mentioned this issue Feb 10, 2021

MRG fix Normalize for linear models when used with sample_weight #19426

Merged

ogrisel mentioned this issue Mar 1, 2021

Unexpected Normalization #2601

Closed

This was referenced Mar 4, 2021

[RFC] Improve the Ridge solver speed / convergence tradeoff with better solver defaults #19615

Open

[MRG] FIX sample_weight invariance for linear models #19616

Merged

thomasjpfan closed this as completed Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ridge(normalize=True) fails check_sample_weights_invariance #17444

Ridge(normalize=True) fails check_sample_weights_invariance #17444

rth commented Jun 4, 2020

ogrisel commented Feb 22, 2021 •

edited

Loading

ogrisel commented Feb 22, 2021 •

edited

Loading

ogrisel commented Mar 3, 2021

thomasjpfan commented Apr 13, 2022

Ridge(normalize=True) fails check_sample_weights_invariance #17444

Ridge(normalize=True) fails check_sample_weights_invariance #17444

Comments

rth commented Jun 4, 2020

ogrisel commented Feb 22, 2021 • edited Loading

ogrisel commented Feb 22, 2021 • edited Loading

ogrisel commented Mar 3, 2021

thomasjpfan commented Apr 13, 2022

ogrisel commented Feb 22, 2021 •

edited

Loading

ogrisel commented Feb 22, 2021 •

edited

Loading