Skip to content

FEAT - Basic GeneralizedLinearEstimatorCV #311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jun 26, 2025

Conversation

floriankozikowski
Copy link
Contributor

@floriankozikowski floriankozikowski commented May 22, 2025

Context of the PR

Implements a cross-validation wrapper for GeneralizedLinearEstimator along with a standalone example script. Provides hyperparameter grids over alpha and optional l1_ratio, custom k-fold splitting, warm starts, scoring, and final full-data refit (closes Issue #308 ).

Contributions of the PR

  • class inherits from GeneralizedLinearEstimator, CV inspired by Celer
  • full pass over l1_ratio (if available) and alpha grids
  • warm_start: carry forward each fold’s solution (w) as the initial w_start for the next alpha
  • score & select: compute MSE (or user-provided scorer) per fold, track the mean loss in mse_path, and pick the (alpha_, l1_ratio_) with lowest average loss
  • final refit: once the best hyperparameters are found, update self.penalty in place and call super().fit(X, y) to train the returned model on the entire dataset
  • plot_generalized_linear_estimator.cv provides a simple example file

Checks before merging PR

  • added documentation for any new feature
  • added unit tests
  • edited the what's new (if applicable)

skglm/cv.py Outdated
est = GeneralizedLinearEstimator(
datafit=self.datafit, penalty=pen, solver=self.solver
)
est.solver.warm_start = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using warm start here will have no impact since you just intanciated the model

@mathurinm
Copy link
Collaborator

@floko can you provide some timings on datasets of reasonable sizes ?

We should handle warm start better, by storing K estimators (one for each fold) and using warm start on them (so that on fold k, when fitting for one value of alpha, we use warm start with the previous value of alpha on the same fold.

Right now warm start is not used since we recreate an estimator at each alpha.

@mathurinm mathurinm requested review from Badr-MOUFAD June 16, 2025 11:57
from skglm.cv import GeneralizedLinearEstimatorCV


@pytest.mark.parametrize("seed", [0, 42])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fix a seed equal to your favorite number, testing 2 seems overkill

"""Test GeneralizedLinearEstimatorCV matches sklearn GridSearchCV for ElasticNet."""
X, y = make_regression(n_samples=100, n_features=20, noise=0.1, random_state=seed)

alphas = np.array([0.001, 0.01, 0.1, 1.0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

picking absolute values for alphas is dangerous ; scale them by \Vert X.T @ y \Vert_\infty / n which is the smallest alpha for which the solution is 0

Copy link
Collaborator

@Badr-MOUFAD Badr-MOUFAD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @floriankozikowski for the great work, here are some remarks

For the example, what do you think having something similar to
https://contrib.scikit-learn.org/skglm/auto_examples/plot_ucurve.html#sphx-glr-auto-examples-plot-ucurve-py

@Badr-MOUFAD
Copy link
Collaborator

@floriankozikowski, I'm unable to spot where _kfold_split is implemented, can you point me to where it is ?

btw, pin me when you are done with the PR, to make another review

@floriankozikowski
Copy link
Contributor Author

@Badr-MOUFAD
There is no function or method named kfold_split implemented in the code. Instead, I used scikit-learn's built-in KFold and StratifiedKFold classes directly for splitting the data into folds (around line 127 to 133).
For classification problems (e.g., Logistic, QuadraticSVC), I use:
kf = StratifiedKFold(n_splits=self.cv, shuffle=True, random_state=self.random_state) split_iter = kf.split(np.arange(n_samples), y)
For regression or other problems, I use:
kf = KFold(n_splits=self.cv, shuffle=True, random_state=self.random_state) split_iter = kf.split(np.arange(n_samples))

Does this make sense to you?

Other than that, I'm done for now (except that maybe in the examples file we could consider deleting a plot). Thanks for reviewing again!

@floriankozikowski floriankozikowski marked this pull request as ready for review June 18, 2025 14:34
@Badr-MOUFAD
Copy link
Collaborator

Does this make sense to you?

Make sense. I was asking because it was mentioned in the PR description.

@floriankozikowski , can you pls clean up the test_cv.py (prints, ...), if it has something important for the review, maybe, put in a debug_script.py that we can delete afterward

@floriankozikowski
Copy link
Contributor Author

@Badr-MOUFAD good catch, I removed it from the PR description (initially I had it like this, but then changed it at some point). Plus, I made the unit test shorter now. Lets me know if thats fine.
@mathurinm initially you added a part where you also test for alpha = 1e-4. For now I removed it as it was in the debugging sections of the script. But if you want it back let me know. When I tried it out the pytest still worked, but was much slower and I get convergence warnings.

Copy link
Collaborator

@Badr-MOUFAD Badr-MOUFAD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT @floriankozikowski, thanks for this amazing job 💪

just minor remarks before merging

@floriankozikowski
Copy link
Contributor Author

@Badr-MOUFAD everything addressed and ready to merge. Thanks for the feedback!

@Badr-MOUFAD Badr-MOUFAD dismissed mathurinm’s stale review June 26, 2025 09:29

All changes were addressed

@Badr-MOUFAD Badr-MOUFAD merged commit 4b2344c into scikit-learn-contrib:main Jun 26, 2025
4 checks passed
@floriankozikowski floriankozikowski deleted the GLM_CV_simple branch June 26, 2025 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants