Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinearRegression with zero sample_weights is not the same as excluding those rows #26164

Closed
lorentzenchr opened this issue Apr 12, 2023 · 17 comments

Comments

@lorentzenchr
Copy link
Member

lorentzenchr commented Apr 12, 2023

Describe the bug

Excluding rows having sample_weight == 0 in LinearRegression does not give the same results.

Steps/Code to Reproduce

from collections import Counter
import numpy as np
from sklearn.linear_model import LinearRegression

results = []
for i in range(100):
    rng = np.random.RandomState(i)
    n_samples, n_features = 10, 5
    X = rng.rand(n_samples, n_features)
    y = rng.rand(n_samples)
    reg = LinearRegression()
    sample_weight = rng.uniform(low=0.01, high=2, size=X.shape[0])
    sample_weight_0 = sample_weight.copy()
    sample_weight_0[-5:] = 0
    y[-5:] *= 1000  # to make excluding those samples important
    reg.fit(X, y, sample_weight=sample_weight_0)
    coef_0, intercept_0 = reg.coef_.copy(), reg.intercept_

    reg.fit(X[:-5], y[:-5], sample_weight=sample_weight[:-5])
    print(f"{coef_0=}")
    print(f"{reg.coef_=}")
    results.append(np.allclose(reg.coef_, coef_0, rtol=1e-6))
print(Counter(results))

Expected Results

Always True.

Actual Results

Counter({True: 79, False: 21})  # it fails 20% of the time

The print statement gives:

coef_0 =    array([ 1.43516166, -1.78826443,  0.15365526,  1.82233166, -1.6       ])
reg.coef_ = array([ 2.24022351, -1.04917851,  0.45341088,  0.80315086, -0.25726798])

Versions

System:
    python: 3.9.15 (main, Nov 15 2022, 05:24:15)  [Clang 14.0.0 (clang-1400.0.29.202)]
   machine: macOS-13.3.1-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.2
        numpy: 1.23.5
        scipy: 1.10.1
       Cython: 0.29.33

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
@glemaitre
Copy link
Member

Interestingly, I cannot reproduce:

coef_0=array([ 2.24022351, -1.04917851,  0.45341088,  0.80315086, -0.25726798])
reg.coef_=array([ 2.24022351, -1.04917851,  0.45341088,  0.80315086, -0.25726798])
System:
    python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ]
executable: /Users/glemaitre/mambaforge/envs/dev/bin/python3.10
   machine: macOS-13.2.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.3.dev0
          pip: 22.3.1
   setuptools: 65.5.1
        numpy: 1.24.2
        scipy: 1.10.1
       Cython: 0.29.33
       pandas: 1.5.3
   matplotlib: 3.7.1
       joblib: 1.3.0.dev0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/glemaitre/mambaforge/envs/dev/lib/libopenblas.0.dylib
        version: 0.3.21
threading_layer: openmp
   architecture: VORTEX
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/glemaitre/mambaforge/envs/dev/lib/libomp.dylib
        version: None
    num_threads: 8

@glemaitre
Copy link
Member

I can reproduce it on my Linux machine through:

System:
    python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
executable: /home/glemaitre/miniconda3/envs/dev/bin/python3.10
   machine: Linux-5.19.0-35-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.3.dev0
          pip: 22.3.1
   setuptools: 65.5.1
        numpy: 1.24.2
        scipy: 1.10.1
       Cython: 0.29.33
       pandas: 1.5.1
   matplotlib: 3.6.2
       joblib: 1.3.0.dev0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/glemaitre/miniconda3/envs/dev/lib/libopenblasp-r0.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/glemaitre/miniconda3/envs/dev/lib/libgomp.so.1.0.0
        version: None
    num_threads: 8

@glemaitre
Copy link
Member

Note that my Mac has an ARM processor.

@lorentzenchr
Copy link
Member Author

Could you try different random seeds on our MAC? I detected this bug while working on #15554 and then running SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -x sklearn/linear_model/tests/test_base.py::test_linear_regression_sample_weight_consistency.

@glemaitre
Copy link
Member

Indeed, this is an issue with the random seed. In both configs, it fails 20% of the time.

@glemaitre
Copy link
Member

@lorentzenchr Does it mean that the scipy.linalg.lstsq is not stable and could be related to what @ogrisel experimented in #22855 (comment)?

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Apr 14, 2023

Does it mean that the scipy.linalg.lstsq is not stable

I don't know. If so, we should report to scipy.
It could also be due to our handling of intercept and sample weights.

@ogrisel
Copy link
Member

ogrisel commented Apr 14, 2023

I confirm it also works on my M1 mac (similar versions for the dependencies and openblas).

@lorentzenchr can you reproduce the problem with Ridge(alpha=0) and various solvers on your linux machine? If it works with solver="lsqr" that would be one more reason to use lsqr as the default solver for LinearRegression.

Can you also please try to change the lapack driver used by LinearRegression's lstsq, e.g. gelsy instead of the gelsd default?

@ogrisel
Copy link
Member

ogrisel commented Apr 14, 2023

If you put 4 samples to 0 instead of 5, or alternatively increase the n_samples to make sure that the non-zero weighted data points are always more numerous than the number of features to get a non singular problem, do you still get the problem? It could be indeed related to the behavior I observed in #22855 (comment).

@glemaitre
Copy link
Member

I edited the script of the summary to show that it fails in 20% of the time for 100 seeds.

@ogrisel
Copy link
Member

ogrisel commented Apr 14, 2023

I confirm that I also reproduce 20% failures with the new script.

I also confirm that using Ridge(alpha=0, solver="lsqr") works fine 100% of the time.

@ogrisel
Copy link
Member

ogrisel commented Apr 14, 2023

I will try to change the lapack driver and maybe play with the cond option.

@ogrisel
Copy link
Member

ogrisel commented Apr 14, 2023

With the gelsy driver I get a 13% failure rate. So not great either.

@ogrisel
Copy link
Member

ogrisel commented Apr 14, 2023

By setting cond=1e-15 with the default gelsd lapack driver, I get 100% pass and we get the same solution as Ridge(alpha=0, solver="lsqr") (I only checked the last seed).

But I am not sure how to set a good value for cond in general.

@lorentzenchr
Copy link
Member Author

My take at this issue: It's the minimum norm issue with LinearRegression, see #22947 and #22855. The effective dimensions/shape of X are (5, 6) including the intercept, so an underdetermined system.

import numpy as np
from scipy.linalg import lstsq
import scipy.sparse as sparse
from sklearn.linear_model import LinearRegression


rng = np.random.RandomState(2)  # Seed dependent, np.allclose should be False.
n_samples, n_features = 10, 5
X = rng.rand(n_samples, n_features)
y = rng.rand(n_samples)
reg = LinearRegression()
sample_weight = rng.uniform(low=0.01, high=2, size=X.shape[0])
sample_weight_0 = sample_weight.copy()
sample_weight_0[-5:] = 0
y[-5:] *= 1000  # to make excluding those samples important

reg.fit(X, y, sample_weight=sample_weight_0)
coef_0 = np.r_[reg.coef_.copy(), reg.intercept_]

reg.fit(X[:-5], y[:-5], sample_weight=sample_weight[:-5])
coef_1 = np.r_[reg.coef_.copy(), reg.intercept_]

# For fun, we also use the lsqr solve by converting so sparse X
reg.fit(sparse.csr_array(X), y, sample_weight=sample_weight_0)
coef_2 = np.r_[reg.coef_.copy(), reg.intercept_] # same as coef_1

np.allclose(coef_0, coef_1, rtol=1e-6), np.allclose(coef_1, coef_2, rtol=1e-6)

Should be False, True.

# We solve with plain lstsq and intercept term.
X_with_intercept = np.c_[X, np.ones(10)]
coef_lstsq = lstsq(np.sqrt(sample_weight_0)[:, None] * X_with_intercept, np.sqrt(sample_weight_0) * y)[0]

# Norm of residues
[
    np.linalg.norm(sample_weight_0 * (X_with_intercept @ coef_0 - y)),
    np.linalg.norm(sample_weight_0 * (X_with_intercept @ coef_1 - y)),
    np.linalg.norm(sample_weight_0 * (X_with_intercept @ coef_lstsq - y)),
]

All around 1e-16. So all are valid solutions.

# Norm of coefficients
[
    np.linalg.norm(coef_0),
    np.linalg.norm(coef_1),
    np.linalg.norm(coef_lstsq),
]

[3.3496024918115093, 2.7998306762269527, 2.7862672453159094]
So the manual lstsq has the smallest L2-norm.

@thomasjpfan thomasjpfan added module:linear_model and removed Needs Triage Issue requires triage labels Apr 21, 2023
@ogrisel
Copy link
Member

ogrisel commented Apr 25, 2023

As discussed in #22947 the X_with_intercept approach can be problematic (for ridge) and think it's important that our LinearRegression and Ridge estimators deal with the intercept in a consistent manner.

So I would rather pass cond=1e-15 to lstsq in LinearRegression as a short term fix for the undedetermined case while keeping the current way to handle intercepts in OLS (that is fit_on_centered data followed by self.intercept_ = y_mean - self.coef_ @ X_mean) even if that means that the intercept does not take part in the computation of the norm of the minimal norm solution.

Later we can also discuss other solvers for LinearRegression to also improve its fit time performance #22855.

@lorentzenchr
Copy link
Member Author

I close in favor/duplicate of #22947.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants