[python-package] support sub-classing scikit-learn estimators #6783

jameslamb · 2025-01-10T06:39:24Z

I recently saw a Stack Overflow post ("Why can't I wrap LGBM?") expressing the same concerns from #4426 ... it's difficult to sub-class lightgbm's scikit-learn estimators.

It doesn't have to be! Look how minimal the code is for XGBRFRegressor:

https://github.com/dmlc/xgboost/blob/45009413ce9f0d2bdfcd0c9ea8af1e71e3c0a191/python-package/xgboost/sklearn.py#L1869

This PR proposes borrowing some patterns I learned while working on xgboost's scikit-learn estimators to make it easier to sub-class lightgbm estimators. This also has the nice side effect of simplifying the lightgbm.dask code 😁

Notes for Reviewers

Why make the breaking change of requiring keyword args?

As part of this PR, I'm proposing immediately switching the constructors for scikit-learn estimators here (including those in lightgbm.dask) to only supporting keyword arguments.

Why I'm proposing this instead of a deprecation cycle:

scikit-learn itself does this (HistGradientBoostingClassifier example)
- so all of its machinery passing parameters around as keyword arguments
- keyword arguments are recommended throughout https://scikit-learn.org/stable/developers/develop.html
I strongly suspect that using positional arguments for these constructors is rare
anyone relying on positional arguments will get a loud and easy-to-diagnose-and-fix error, so the effort to adjust should be minimal

import lightgbm as lgb
lgb.LGBMClassifier("gbdt")
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
# TypeError: LGBMClassifier.__init__() takes 1 positional argument but 2 were given

I posted a related answer to that Stack Overflow question

https://stackoverflow.com/a/79344862/3986677

…htGBM into python/sklearn-subclassing

tests/python_package_test/test_dask.py

StrikerRUS · 2025-01-27T16:15:04Z

Could you please setup an RTD build for this branch? I'd like to see how init signature will be rendered there.

jameslamb · 2025-01-27T16:18:37Z

Sure, here's a first build: https://readthedocs.org/projects/lightgbm/builds/26983170/

StrikerRUS

Great simplification, thanks for working on it!

I don't have any serious comments, just want to get some answers before approving.

docs/FAQ.rst

python-package/lightgbm/sklearn.py

tests/python_package_test/test_dask.py

StrikerRUS · 2025-01-27T17:44:13Z

python-package/lightgbm/dask.py

-            importance_type=importance_type,
-            **kwargs,
-        )
+        super().__init__(**kwargs)

    _base_doc = LGBMClassifier.__init__.__doc__


Do you think it's OK to have just one client argument in the signature, but describe all parent args in the docstring?..

I think it's a little better for users to see all the parameters right here, instead of having to click over to another page.

This is what XGBoost is doing too: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRFRegressor

But I do also appreciate that it could look confusing.

If we don't do it this way, then I'd recommend we add a link in the docs for `**kwargs`` in these estimators, like this:

**kwargs Other parameters for the model. These can be any of the keyword arguments for LGBMModel or any other LightGBM parameters documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

I have a weak preference for keeping it as-is (the signature in docs not having all parameters, but docstring having all parameters), but happy to change it if you think that's confusing.

Thanks for clarifying your opinion!
I love your suggestion for **kwargs description. But my preference is also weak 🙂
I think we need a third judge opinion for this question.

Either way, I'm approving this PR!

@jmoralez or @borchero could one of you comment on this thread and help us break the tie?

To make progress on the release, if we don't hear back in the next 2 days I'll merge this PR as-is and we can come back and change the docs later.

Sorry, I only saw this now! My personal preference would actually be to keep all of the parameters (similar to the previous state) and simply make them keyword arguments. While this results in more code and some duplication of defaults, I think that this is the clearest interface for users. If you think this is undesirable @jameslamb, I'd at least opt for documenting all of the "transitive" parameters, just like in the XGBoost docs.

Hmmm, I still think that

**kwargs Other parameters for the model. These can be any of the keyword arguments for LGBMModel or any other LightGBM parameters documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

would be better... But OK.

What I'm definitely sure in is that sklearn classes and Dask ones should follow the same pattern.

Sorry, I was so focused on the Dask estimators in the most recent round of changes that I forgot about the affect this would have on LGBM{Classifier,Ranker,Regressor}. I agree, I need to fix this inconsistency

I do think that it'd be better to have all the arguments listed out in the signature explicitly. That's helpful for code completion in editors and help() in a REPL. And I strongly suspect that users use LGBM{Classifier,Ranker,Regressor} directly much more often than they use LGBMModel. It introduces duplication in the code, but I personally am OK with that in exchange for those benefits for users, for the reasons I mentioned in #6783 (comment)

Given that set of possible benefits, @StrikerRUS would you be ok with me duplicating all the defaults into the __init__() signature of LGBM{Classifier,Ranker,Regressor} too (as currently happens for the Dask estimators) and expanding the tests to confirm that the arguments are all consistent between LGBMModel, LGBM{Classifier,Ranker,Regressor}, and DaskLGBM{Classifier,Ranker,Regressor}?

Or would you still prefer having **kwargs and a docstring like this?

**kwargs Other parameters for the model. These can be any of the keyword arguments for LGBMModel or any other LightGBM parameters documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

It seems from comments above that @borchero was also OK with either form... I think we are all struggling to choose a preferred form here. I don't have any other thoughts on this, so I'll happily defer to your decision.

OK, I'll approve consistent version with explicitly listed args.

Thank you! I'm sorry for how much effort reviewing this PR has turned into.

I do think LightGBM's users will appreciate sub-classing being easier, and still having tab completion for constructor arguments for LGBM{Classifier,Ranker,Regressor}.

I just pushed 3d351a4 repeating all the arguments in the constructors.

Also added a test in test_sklearn.py similar to the Dask one, to ensure that all the default values and the set of parameters stay the same.

Updated docs:

build: https://readthedocs.org/projects/lightgbm/builds/27144893/

docs: https://lightgbm.readthedocs.io/en/python-sklearn-subclassing/

Now they look the same:

I also re-ran the sub-classing example being added to FAQ.rst here to be sure it works.

No need to apologize! Thank you for working on this very important change!

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

StrikerRUS

Thank you very much!

borchero

Thanks!

docs/FAQ.rst

borchero · 2025-02-04T21:48:37Z

python-package/lightgbm/dask.py

-            importance_type=importance_type,
-            **kwargs,
-        )
+        super().__init__(**kwargs)

    _base_doc = LGBMClassifier.__init__.__doc__


Going over the code again, I think the number of times the args are repeated, it's a very practical consideration to use **kwargs.

…htGBM into python/sklearn-subclassing

docs/FAQ.rst

StrikerRUS · 2025-02-10T10:38:59Z

python-package/lightgbm/dask.py

-            importance_type=importance_type,
-            **kwargs,
-        )
+        super().__init__(**kwargs)

    _base_doc = LGBMClassifier.__init__.__doc__


Hmmm, I still think that

**kwargs Other parameters for the model. These can be any of the keyword arguments for LGBMModel or any other LightGBM parameters documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

would be better... But OK.

What I'm definitely sure in is that sklearn classes and Dask ones should follow the same pattern.

…htGBM into python/sklearn-subclassing

jameslamb · 2025-02-11T05:34:55Z

tests/python_package_test/test_sklearn.py

@@ -475,6 +501,193 @@ def test_clone_and_property():
    assert isinstance(clf.feature_importances_, np.ndarray)


+@pytest.mark.parametrize("estimator", (lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker))


I tried intentionally making the types of changes this test should catch.

LGBMClassifier: removed min_child_weight

LGBMRanker: moved boosting_type before * (so not enforcing that it's keyword-only)

LGBMRegressor: changed default of subsample from 1.0 to 1.1

full diff (click me)

diff --git a/python-package/lightgbm/sklearn.py b/python-package/lightgbm/sklearn.py index ab0686e2..7a854820 100644 --- a/python-package/lightgbm/sklearn.py +++ b/python-package/lightgbm/sklearn.py @@ -1330,7 +1330,7 @@ class LGBMRegressor(_LGBMRegressorBase, LGBMModel): min_split_gain: float = 0.0, min_child_weight: float = 1e-3, min_child_samples: int = 20, - subsample: float = 1.0, + subsample: float = 1.1, subsample_freq: int = 0, colsample_bytree: float = 1.0, reg_alpha: float = 0.0, @@ -1438,7 +1438,6 @@ class LGBMClassifier(_LGBMClassifierBase, LGBMModel): objective: Optional[Union[str, _LGBM_ScikitCustomObjectiveFunction]] = None, class_weight: Optional[Union[Dict, str]] = None, min_split_gain: float = 0.0, - min_child_weight: float = 1e-3, min_child_samples: int = 20, subsample: float = 1.0, subsample_freq: int = 0, @@ -1460,7 +1459,6 @@ class LGBMClassifier(_LGBMClassifierBase, LGBMModel): objective=objective, class_weight=class_weight, min_split_gain=min_split_gain, - min_child_weight=min_child_weight, min_child_samples=min_child_samples, subsample=subsample, subsample_freq=subsample_freq, @@ -1689,8 +1687,8 @@ class LGBMRanker(LGBMModel): # docs, help(), and tab completion. def __init__( self, - *, boosting_type: str = "gbdt", + *, num_leaves: int = 31, max_depth: int = -1, learning_rate: float = 0.1,

The test caught all of them.

pytest tests/python_package_test/test_sklearn.py::test_estimators_all_have_the_same_kwargs_and_defaults

tests/python_package_test/test_sklearn.py FFF [100%] ======================================================================== FAILURES ======================================================================== _________________________________________ test_estimators_all_have_the_same_kwargs_and_defaults[LGBMClassifier] __________________________________________ estimator = <class 'lightgbm.sklearn.LGBMClassifier'> @pytest.mark.parametrize("estimator", (lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker)) def test_estimators_all_have_the_same_kwargs_and_defaults(estimator): base_spec = inspect.getfullargspec(lgb.LGBMModel) subclass_spec = inspect.getfullargspec(estimator) # should not allow for any varargs assert subclass_spec.varargs == base_spec.varargs assert subclass_spec.varargs is None # the only varkw should be **kwargs, assert subclass_spec.varkw == base_spec.varkw assert subclass_spec.varkw == "kwargs" # default values for all constructor arguments should be identical # # NOTE: if LGBMClassifier / LGBMRanker / LGBMRegressor ever override # any of LGBMModel's constructor arguments, this will need to be updated > assert subclass_spec.kwonlydefaults == base_spec.kwonlydefaults E AssertionError: assert {'boosting_ty... 'split', ...} == {'boosting_ty... 'split', ...} E E Omitting 18 identical items, use -vv to show E Right contains 1 more item: E {'min_child_weight': 0.001} E Use -v to get more diff tests/python_package_test/test_sklearn.py:521: AssertionError __________________________________________ test_estimators_all_have_the_same_kwargs_and_defaults[LGBMRegressor] __________________________________________ estimator = <class 'lightgbm.sklearn.LGBMRegressor'> @pytest.mark.parametrize("estimator", (lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker)) def test_estimators_all_have_the_same_kwargs_and_defaults(estimator): base_spec = inspect.getfullargspec(lgb.LGBMModel) subclass_spec = inspect.getfullargspec(estimator) # should not allow for any varargs assert subclass_spec.varargs == base_spec.varargs assert subclass_spec.varargs is None # the only varkw should be **kwargs, assert subclass_spec.varkw == base_spec.varkw assert subclass_spec.varkw == "kwargs" # default values for all constructor arguments should be identical # # NOTE: if LGBMClassifier / LGBMRanker / LGBMRegressor ever override # any of LGBMModel's constructor arguments, this will need to be updated > assert subclass_spec.kwonlydefaults == base_spec.kwonlydefaults E AssertionError: assert {'boosting_ty... 'split', ...} == {'boosting_ty... 'split', ...} E E Omitting 18 identical items, use -vv to show E Differing items: E {'subsample': 1.1} != {'subsample': 1.0} E Use -v to get more diff tests/python_package_test/test_sklearn.py:521: AssertionError ___________________________________________ test_estimators_all_have_the_same_kwargs_and_defaults[LGBMRanker] ____________________________________________ estimator = <class 'lightgbm.sklearn.LGBMRanker'> @pytest.mark.parametrize("estimator", (lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker)) def test_estimators_all_have_the_same_kwargs_and_defaults(estimator): base_spec = inspect.getfullargspec(lgb.LGBMModel) subclass_spec = inspect.getfullargspec(estimator) # should not allow for any varargs assert subclass_spec.varargs == base_spec.varargs assert subclass_spec.varargs is None # the only varkw should be **kwargs, assert subclass_spec.varkw == base_spec.varkw assert subclass_spec.varkw == "kwargs" # default values for all constructor arguments should be identical # # NOTE: if LGBMClassifier / LGBMRanker / LGBMRegressor ever override # any of LGBMModel's constructor arguments, this will need to be updated > assert subclass_spec.kwonlydefaults == base_spec.kwonlydefaults E AssertionError: assert {'class_weigh...te': 0.1, ...} == {'boosting_ty... 'split', ...} E E Omitting 18 identical items, use -vv to show E Right contains 1 more item: E {'boosting_type': 'gbdt'} E Use -v to get more diff tests/python_package_test/test_sklearn.py:521: AssertionError ================================================================ short test summary info ================================================================= FAILED tests/python_package_test/test_sklearn.py::test_estimators_all_have_the_same_kwargs_and_defaults[LGBMClassifier] - AssertionError: assert {'boosting_ty... 'split', ...} == {'boosting_ty... 'split', ...} FAILED tests/python_package_test/test_sklearn.py::test_estimators_all_have_the_same_kwargs_and_defaults[LGBMRegressor] - AssertionError: assert {'boosting_ty... 'split', ...} == {'boosting_ty... 'split', ...} FAILED tests/python_package_test/test_sklearn.py::test_estimators_all_have_the_same_kwargs_and_defaults[LGBMRanker] - AssertionError: assert {'class_weigh...te': 0.1, ...} == {'boosting_ty... 'split', ...} =================================================================== 3 failed in 0.46s ====================================================================

StrikerRUS

LGTM!
I checked each class' signature and docstring - they are all consistent.
Please update Dask test according to the latest changes and let's ship it!

jameslamb · 2025-02-12T15:40:27Z

Ok thanks! Just pushed 51c18ad updating the Dask test.

After the release I'll return to #6677, hopefully that will make it less likely that I miss some tests when developing on a Mac.

I'll merge this when CI passes.

jameslamb · 2025-02-12T19:18:40Z

I've removed this branch from the readthedocs versions: https://readthedocs.org/projects/lightgbm/versions/

Thanks @StrikerRUS and @borchero for the thorough reviews!!!

jameslamb added 3 commits January 4, 2025 01:59

[python-package] make sub-classing scikit-learn estimators easier

3b5f648

tests passing

02c48c3

add docs

7b720cb

jameslamb added in progress breaking labels Jan 10, 2025

jameslamb added 4 commits January 10, 2025 00:40

Update tests/python_package_test/test_sklearn.py

51b5e64

remove docs links

81178fd

Merge branch 'python/sklearn-subclassing' of github.com:microsoft/Lig…

110b0e1

…htGBM into python/sklearn-subclassing

Merge branch 'master' into python/sklearn-subclassing

104471a

jameslamb changed the title ~~WIP: [python-package] support sub-classing scikit-learn estimators~~ [python-package] support sub-classing scikit-learn estimators Jan 11, 2025

jameslamb added awaiting review and removed in progress labels Jan 11, 2025

jameslamb marked this pull request as ready for review January 11, 2025 05:06

jameslamb requested review from guolinke, shiyu1994, jmoralez, borchero and StrikerRUS as code owners January 11, 2025 05:06

jameslamb added 2 commits January 12, 2025 23:24

fix Dask tests

d80b0df

Merge branch 'python/sklearn-subclassing' of github.com:microsoft/Lig…

b7e041a

…htGBM into python/sklearn-subclassing

jameslamb commented Jan 13, 2025

View reviewed changes

tests/python_package_test/test_dask.py Show resolved Hide resolved

Merge branch 'master' into python/sklearn-subclassing

68177a7

jameslamb mentioned this pull request Jan 23, 2025

release v4.6.0 #6796

Merged

32 tasks

jameslamb added 2 commits January 26, 2025 11:31

Merge branch 'master' into python/sklearn-subclassing

70f29a7

Merge branch 'master' into python/sklearn-subclassing

6796ba9

StrikerRUS reviewed Jan 27, 2025

View reviewed changes

jameslamb and others added 2 commits January 29, 2025 22:29

Update tests/python_package_test/test_dask.py

409733a

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Update python-package/lightgbm/sklearn.py

0a40e9b

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

jameslamb and others added 3 commits January 29, 2025 22:43

Merge branch 'master' into python/sklearn-subclassing

64850c6

Update docs/FAQ.rst

cd54639

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Merge branch 'master' into python/sklearn-subclassing

e39d19f

jameslamb requested a review from StrikerRUS January 30, 2025 04:48

StrikerRUS approved these changes Jan 30, 2025

View reviewed changes

borchero approved these changes Feb 4, 2025

View reviewed changes

Merge branch 'master' into python/sklearn-subclassing

7077c24

jameslamb added in progress and removed awaiting review labels Feb 6, 2025

jameslamb changed the title ~~[python-package] support sub-classing scikit-learn estimators~~ WIP: [python-package] support sub-classing scikit-learn estimators Feb 6, 2025

jameslamb added 2 commits February 6, 2025 22:55

Merge branch 'python/sklearn-subclassing' of github.com:microsoft/Lig…

7c59cd9

…htGBM into python/sklearn-subclassing

restore Dask signatures

98eb476

jameslamb changed the title ~~WIP: [python-package] support sub-classing scikit-learn estimators~~ [python-package] support sub-classing scikit-learn estimators Feb 7, 2025

jameslamb added awaiting review and removed in progress labels Feb 7, 2025

Merge branch 'master' into python/sklearn-subclassing

734961c

StrikerRUS requested changes Feb 10, 2025

View reviewed changes

jameslamb added 2 commits February 10, 2025 23:12

repeat all params

3d351a4

Merge branch 'python/sklearn-subclassing' of github.com:microsoft/Lig…

9be0ec1

…htGBM into python/sklearn-subclassing

jameslamb commented Feb 11, 2025

View reviewed changes

jameslamb requested a review from StrikerRUS February 11, 2025 05:35

StrikerRUS approved these changes Feb 12, 2025

View reviewed changes

update Dask tests

51c18ad

jameslamb removed the awaiting review label Feb 12, 2025

jameslamb merged commit c6d90bc into master Feb 12, 2025
49 checks passed

jameslamb deleted the python/sklearn-subclassing branch February 12, 2025 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] support sub-classing scikit-learn estimators #6783

[python-package] support sub-classing scikit-learn estimators #6783

jameslamb commented Jan 10, 2025 •

edited

Loading

StrikerRUS commented Jan 27, 2025

jameslamb commented Jan 27, 2025

StrikerRUS left a comment

StrikerRUS Jan 27, 2025 •

edited

Loading

jameslamb Jan 30, 2025

StrikerRUS Jan 30, 2025

jameslamb Jan 31, 2025

borchero Feb 4, 2025

StrikerRUS Feb 10, 2025

jameslamb Feb 10, 2025

StrikerRUS Feb 10, 2025

jameslamb Feb 11, 2025

StrikerRUS Feb 12, 2025

StrikerRUS left a comment

borchero left a comment

borchero Feb 4, 2025

StrikerRUS Feb 10, 2025

jameslamb Feb 11, 2025

StrikerRUS left a comment

jameslamb commented Feb 12, 2025

jameslamb commented Feb 12, 2025

		@@ -475,6 +501,193 @@ def test_clone_and_property():
		assert isinstance(clf.feature_importances_, np.ndarray)


		@pytest.mark.parametrize("estimator", (lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker))

[python-package] support sub-classing scikit-learn estimators #6783

[python-package] support sub-classing scikit-learn estimators #6783

Conversation

jameslamb commented Jan 10, 2025 • edited Loading

Notes for Reviewers

Why make the breaking change of requiring keyword args?

I posted a related answer to that Stack Overflow question

StrikerRUS commented Jan 27, 2025

jameslamb commented Jan 27, 2025

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

borchero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb commented Feb 12, 2025

jameslamb commented Feb 12, 2025

jameslamb commented Jan 10, 2025 •

edited

Loading

StrikerRUS Jan 27, 2025 •

edited

Loading