Skip to content

Fix float32 cond linear regression#33249

Closed
antoinebaker wants to merge 12 commits intoscikit-learn:mainfrom
antoinebaker:fix_float32_cond_linear_regression
Closed

Fix float32 cond linear regression#33249
antoinebaker wants to merge 12 commits intoscikit-learn:mainfrom
antoinebaker:fix_float32_cond_linear_regression

Conversation

@antoinebaker
Copy link
Copy Markdown
Contributor

@antoinebaker antoinebaker commented Feb 9, 2026

Fixes #33032 but will reopen #26164.

What does this implement/fix? Explain your changes.

As explained in #33032, the cond choice in PR #30040 to solve #26164 introduced new bugs for instance on float32 data. This PR reverts the changes made in #30040 and adds a reproducer for the bugs reported in #33032.

Comments / next steps

This a temporary "fix" following the plan outlined here. We need to reopen #26164.

In follow-up PRs we should either:

  1. find a "good" choice for cond that ideally work on any data shape, dtype, and passes the sample weight consistency checks
  2. expose cond as a parameter in LinearRegression as done for the sparse case in Add tol to LinearRegression #30521, we could actually re-use the tol parameter for this (which would mean cond in the dense case).

Comment thread doc/whats_new/upcoming_changes/sklearn.linear_model/33249.fix.rst Outdated
Comment thread doc/whats_new/upcoming_changes/sklearn.linear_model/33249.fix.rst Outdated
Comment thread doc/whats_new/upcoming_changes/sklearn.linear_model/33249.fix.rst Outdated
antoinebaker and others added 2 commits February 16, 2026 14:35
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Junteng Li <JasonLiJT@users.noreply.github.com>
@antoinebaker antoinebaker marked this pull request as ready for review February 16, 2026 14:23
@antoinebaker
Copy link
Copy Markdown
Contributor Author

Thanks for the quick review @JasonLiJT @ogrisel. I think this small PR can be quickly merged, I've marked it ready for review.

What are your thoughts on the follow up PR, do you prefer option 1 (find a good default for cond) or 2 (exposing cond)?

I personally prefer option 2, I think it will be easier (finding a good default cond seems to be quite finicky).

Comment thread doc/whats_new/upcoming_changes/sklearn.linear_model/33249.fix.rst Outdated
Comment thread doc/whats_new/upcoming_changes/sklearn.linear_model/33249.fix.rst Outdated
Comment thread sklearn/linear_model/tests/test_base.py Outdated
Comment thread sklearn/linear_model/tests/test_base.py
Comment thread sklearn/linear_model/tests/test_base.py Outdated
@antoinebaker
Copy link
Copy Markdown
Contributor Author

antoinebaker commented Feb 25, 2026

Rah, this is annoying :(

The cond=None option (scipy default) is failing the recently added tests in #33020 (test_regularization_limits_ridge* which check that Ridge* estimators recovers LinearRegression when alpha is near zero).

I think the most probable cause is the cutoff for "zero eigenvalue" in the various solvers (for Ridge, _RidgeGCV, LinearRegression). Ideally we should come up with a consistent choice for all of them. In practice this means that for a given dataset, the number of nonzero eigenvalues, in other words the rank, should be the same for all solvers. This might be difficult as some solvers have the cutoff hardcoded, some use the scipy default, some expose it as a parameter.

I will try to investigate to be sure. But I have the intuition that we will probably need a more involved PR, either exposing consistently this cutoff or finding a good default.

cc @lorentzenchr @ogrisel

@antoinebaker antoinebaker marked this pull request as draft February 25, 2026 16:56
@antoinebaker
Copy link
Copy Markdown
Contributor Author

I think this small PR can be quickly merged, I've marked it ready for review.

I changed my mind :)

@antoinebaker
Copy link
Copy Markdown
Contributor Author

Closing in favor of #33565

@github-project-automation github-project-automation Bot moved this from In progress to Done in Labs Mar 18, 2026
@antoinebaker
Copy link
Copy Markdown
Contributor Author

antoinebaker commented Mar 18, 2026

I think the most probable cause is the cutoff for "zero eigenvalue" in the various solvers (for Ridge, _RidgeGCV, LinearRegression). Ideally we should come up with a consistent choice for all of them. In practice this means that for a given dataset, the number of nonzero eigenvalues, in other words the rank, should be the same for all solvers. This might be difficult as some solvers have the cutoff hardcoded, some use the scipy default, some expose it as a parameter.

I will try to investigate to be sure. But I have the intuition that we will probably need a more involved PR, either exposing consistently this cutoff or finding a good default.

TLDR: After investigation, finding a consistent choice for all solvers is not possible or too difficult. For some solvers we do not have any control, for the other solvers the tol argument refer to quite different stopping criteria, see the docstring of tol in Ridge for a quick summary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: LinearRegression is wrong on float32 and large n_samples (> 500k rows) due to cond param to scipy.linalg.lstsq

5 participants