-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] support sub-classing scikit-learn estimators #6783
Conversation
…htGBM into python/sklearn-subclassing
Could you please setup an RTD build for this branch? I'd like to see how |
Sure, here's a first build: https://readthedocs.org/projects/lightgbm/builds/26983170/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great simplification, thanks for working on it!
I don't have any serious comments, just want to get some answers before approving.
python-package/lightgbm/dask.py
Outdated
importance_type=importance_type, | ||
**kwargs, | ||
) | ||
super().__init__(**kwargs) | ||
|
||
_base_doc = LGBMClassifier.__init__.__doc__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a little better for users to see all the parameters right here, instead of having to click over to another page.
This is what XGBoost is doing too: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRFRegressor
But I do also appreciate that it could look confusing.
If we don't do it this way, then I'd recommend we add a link in the docs for `**kwargs`` in these estimators, like this:
**kwargs
Other parameters for the model. These can be any of the keyword arguments forLGBMModel
or any other LightGBM parameters documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.
I have a weak preference for keeping it as-is (the signature in docs not having all parameters, but docstring having all parameters), but happy to change it if you think that's confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying your opinion!
I love your suggestion for **kwargs
description. But my preference is also weak 🙂
I think we need a third judge opinion for this question.
Either way, I'm approving this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I only saw this now! My personal preference would actually be to keep all of the parameters (similar to the previous state) and simply make them keyword arguments. While this results in more code and some duplication of defaults, I think that this is the clearest interface for users. If you think this is undesirable @jameslamb, I'd at least opt for documenting all of the "transitive" parameters, just like in the XGBoost docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I still think that
**kwargs
Other parameters for the model. These can be any of the keyword arguments forLGBMModel
or any other LightGBM parameters documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.
would be better... But OK.
What I'm definitely sure in is that sklearn classes and Dask ones should follow the same pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was so focused on the Dask estimators in the most recent round of changes that I forgot about the affect this would have on LGBM{Classifier,Ranker,Regressor}
. I agree, I need to fix this inconsistency
I do think that it'd be better to have all the arguments listed out in the signature explicitly. That's helpful for code completion in editors and help()
in a REPL. And I strongly suspect that users use LGBM{Classifier,Ranker,Regressor}
directly much more often than they use LGBMModel
. It introduces duplication in the code, but I personally am OK with that in exchange for those benefits for users, for the reasons I mentioned in #6783 (comment)
Given that set of possible benefits, @StrikerRUS would you be ok with me duplicating all the defaults into the __init__()
signature of LGBM{Classifier,Ranker,Regressor}
too (as currently happens for the Dask estimators) and expanding the tests to confirm that the arguments are all consistent between LGBMModel
, LGBM{Classifier,Ranker,Regressor}
, and DaskLGBM{Classifier,Ranker,Regressor}
?
Or would you still prefer having **kwargs
and a docstring like this?
**kwargs Other parameters for the model. These can be any of the keyword arguments for LGBMModel or any other LightGBM parameters documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.
It seems from comments above that @borchero was also OK with either form... I think we are all struggling to choose a preferred form here. I don't have any other thoughts on this, so I'll happily defer to your decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll approve consistent version with explicitly listed args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I'm sorry for how much effort reviewing this PR has turned into.
I do think LightGBM's users will appreciate sub-classing being easier, and still having tab completion for constructor arguments for LGBM{Classifier,Ranker,Regressor}
.
I just pushed 3d351a4 repeating all the arguments in the constructors.
Also added a test in test_sklearn.py
similar to the Dask one, to ensure that all the default values and the set of parameters stay the same.
Updated docs:
- build: https://readthedocs.org/projects/lightgbm/builds/27144893/
- docs: https://lightgbm.readthedocs.io/en/python-sklearn-subclassing/
Now they look the same:


I also re-ran the sub-classing example being added to FAQ.rst
here to be sure it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to apologize! Thank you for working on this very important change!
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
python-package/lightgbm/dask.py
Outdated
importance_type=importance_type, | ||
**kwargs, | ||
) | ||
super().__init__(**kwargs) | ||
|
||
_base_doc = LGBMClassifier.__init__.__doc__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going over the code again, I think the number of times the args are repeated, it's a very practical consideration to use **kwargs
.
…htGBM into python/sklearn-subclassing
python-package/lightgbm/dask.py
Outdated
importance_type=importance_type, | ||
**kwargs, | ||
) | ||
super().__init__(**kwargs) | ||
|
||
_base_doc = LGBMClassifier.__init__.__doc__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I still think that
**kwargs
Other parameters for the model. These can be any of the keyword arguments forLGBMModel
or any other LightGBM parameters documented at https://lightgbm.readthedocs.io/en/latest/Parameters.html.
would be better... But OK.
What I'm definitely sure in is that sklearn classes and Dask ones should follow the same pattern.
…htGBM into python/sklearn-subclassing
@@ -475,6 +501,193 @@ def test_clone_and_property(): | |||
assert isinstance(clf.feature_importances_, np.ndarray) | |||
|
|||
|
|||
@pytest.mark.parametrize("estimator", (lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried intentionally making the types of changes this test should catch.
LGBMClassifier
: removedmin_child_weight
LGBMRanker
: movedboosting_type
before*
(so not enforcing that it's keyword-only)LGBMRegressor
: changed default ofsubsample
from1.0
to1.1
full diff (click me)
diff --git a/python-package/lightgbm/sklearn.py b/python-package/lightgbm/sklearn.py
index ab0686e2..7a854820 100644
--- a/python-package/lightgbm/sklearn.py
+++ b/python-package/lightgbm/sklearn.py
@@ -1330,7 +1330,7 @@ class LGBMRegressor(_LGBMRegressorBase, LGBMModel):
min_split_gain: float = 0.0,
min_child_weight: float = 1e-3,
min_child_samples: int = 20,
- subsample: float = 1.0,
+ subsample: float = 1.1,
subsample_freq: int = 0,
colsample_bytree: float = 1.0,
reg_alpha: float = 0.0,
@@ -1438,7 +1438,6 @@ class LGBMClassifier(_LGBMClassifierBase, LGBMModel):
objective: Optional[Union[str, _LGBM_ScikitCustomObjectiveFunction]] = None,
class_weight: Optional[Union[Dict, str]] = None,
min_split_gain: float = 0.0,
- min_child_weight: float = 1e-3,
min_child_samples: int = 20,
subsample: float = 1.0,
subsample_freq: int = 0,
@@ -1460,7 +1459,6 @@ class LGBMClassifier(_LGBMClassifierBase, LGBMModel):
objective=objective,
class_weight=class_weight,
min_split_gain=min_split_gain,
- min_child_weight=min_child_weight,
min_child_samples=min_child_samples,
subsample=subsample,
subsample_freq=subsample_freq,
@@ -1689,8 +1687,8 @@ class LGBMRanker(LGBMModel):
# docs, help(), and tab completion.
def __init__(
self,
- *,
boosting_type: str = "gbdt",
+ *,
num_leaves: int = 31,
max_depth: int = -1,
learning_rate: float = 0.1,
The test caught all of them.
pytest tests/python_package_test/test_sklearn.py::test_estimators_all_have_the_same_kwargs_and_defaults
tests/python_package_test/test_sklearn.py FFF [100%]
======================================================================== FAILURES ========================================================================
_________________________________________ test_estimators_all_have_the_same_kwargs_and_defaults[LGBMClassifier] __________________________________________
estimator = <class 'lightgbm.sklearn.LGBMClassifier'>
@pytest.mark.parametrize("estimator", (lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker))
def test_estimators_all_have_the_same_kwargs_and_defaults(estimator):
base_spec = inspect.getfullargspec(lgb.LGBMModel)
subclass_spec = inspect.getfullargspec(estimator)
# should not allow for any varargs
assert subclass_spec.varargs == base_spec.varargs
assert subclass_spec.varargs is None
# the only varkw should be **kwargs,
assert subclass_spec.varkw == base_spec.varkw
assert subclass_spec.varkw == "kwargs"
# default values for all constructor arguments should be identical
#
# NOTE: if LGBMClassifier / LGBMRanker / LGBMRegressor ever override
# any of LGBMModel's constructor arguments, this will need to be updated
> assert subclass_spec.kwonlydefaults == base_spec.kwonlydefaults
E AssertionError: assert {'boosting_ty... 'split', ...} == {'boosting_ty... 'split', ...}
E
E Omitting 18 identical items, use -vv to show
E Right contains 1 more item:
E {'min_child_weight': 0.001}
E Use -v to get more diff
tests/python_package_test/test_sklearn.py:521: AssertionError
__________________________________________ test_estimators_all_have_the_same_kwargs_and_defaults[LGBMRegressor] __________________________________________
estimator = <class 'lightgbm.sklearn.LGBMRegressor'>
@pytest.mark.parametrize("estimator", (lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker))
def test_estimators_all_have_the_same_kwargs_and_defaults(estimator):
base_spec = inspect.getfullargspec(lgb.LGBMModel)
subclass_spec = inspect.getfullargspec(estimator)
# should not allow for any varargs
assert subclass_spec.varargs == base_spec.varargs
assert subclass_spec.varargs is None
# the only varkw should be **kwargs,
assert subclass_spec.varkw == base_spec.varkw
assert subclass_spec.varkw == "kwargs"
# default values for all constructor arguments should be identical
#
# NOTE: if LGBMClassifier / LGBMRanker / LGBMRegressor ever override
# any of LGBMModel's constructor arguments, this will need to be updated
> assert subclass_spec.kwonlydefaults == base_spec.kwonlydefaults
E AssertionError: assert {'boosting_ty... 'split', ...} == {'boosting_ty... 'split', ...}
E
E Omitting 18 identical items, use -vv to show
E Differing items:
E {'subsample': 1.1} != {'subsample': 1.0}
E Use -v to get more diff
tests/python_package_test/test_sklearn.py:521: AssertionError
___________________________________________ test_estimators_all_have_the_same_kwargs_and_defaults[LGBMRanker] ____________________________________________
estimator = <class 'lightgbm.sklearn.LGBMRanker'>
@pytest.mark.parametrize("estimator", (lgb.LGBMClassifier, lgb.LGBMRegressor, lgb.LGBMRanker))
def test_estimators_all_have_the_same_kwargs_and_defaults(estimator):
base_spec = inspect.getfullargspec(lgb.LGBMModel)
subclass_spec = inspect.getfullargspec(estimator)
# should not allow for any varargs
assert subclass_spec.varargs == base_spec.varargs
assert subclass_spec.varargs is None
# the only varkw should be **kwargs,
assert subclass_spec.varkw == base_spec.varkw
assert subclass_spec.varkw == "kwargs"
# default values for all constructor arguments should be identical
#
# NOTE: if LGBMClassifier / LGBMRanker / LGBMRegressor ever override
# any of LGBMModel's constructor arguments, this will need to be updated
> assert subclass_spec.kwonlydefaults == base_spec.kwonlydefaults
E AssertionError: assert {'class_weigh...te': 0.1, ...} == {'boosting_ty... 'split', ...}
E
E Omitting 18 identical items, use -vv to show
E Right contains 1 more item:
E {'boosting_type': 'gbdt'}
E Use -v to get more diff
tests/python_package_test/test_sklearn.py:521: AssertionError
================================================================ short test summary info =================================================================
FAILED tests/python_package_test/test_sklearn.py::test_estimators_all_have_the_same_kwargs_and_defaults[LGBMClassifier] - AssertionError: assert {'boosting_ty... 'split', ...} == {'boosting_ty... 'split', ...}
FAILED tests/python_package_test/test_sklearn.py::test_estimators_all_have_the_same_kwargs_and_defaults[LGBMRegressor] - AssertionError: assert {'boosting_ty... 'split', ...} == {'boosting_ty... 'split', ...}
FAILED tests/python_package_test/test_sklearn.py::test_estimators_all_have_the_same_kwargs_and_defaults[LGBMRanker] - AssertionError: assert {'class_weigh...te': 0.1, ...} == {'boosting_ty... 'split', ...}
=================================================================== 3 failed in 0.46s ====================================================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I checked each class' signature and docstring - they are all consistent.
Please update Dask test according to the latest changes and let's ship it!
I've removed this branch from the readthedocs versions: https://readthedocs.org/projects/lightgbm/versions/ Thanks @StrikerRUS and @borchero for the thorough reviews!!! |
I recently saw a Stack Overflow post ("Why can't I wrap LGBM?") expressing the same concerns from #4426 ... it's difficult to sub-class
lightgbm
'sscikit-learn
estimators.It doesn't have to be! Look how minimal the code is for
XGBRFRegressor
:https://github.com/dmlc/xgboost/blob/45009413ce9f0d2bdfcd0c9ea8af1e71e3c0a191/python-package/xgboost/sklearn.py#L1869
This PR proposes borrowing some patterns I learned while working on
xgboost
'sscikit-learn
estimators to make it easier to sub-classlightgbm
estimators. This also has the nice side effect of simplifying thelightgbm.dask
code 😁Notes for Reviewers
Why make the breaking change of requiring keyword args?
As part of this PR, I'm proposing immediately switching the constructors for
scikit-learn
estimators here (including those inlightgbm.dask
) to only supporting keyword arguments.Why I'm proposing this instead of a deprecation cycle:
scikit-learn
itself does this (HistGradientBoostingClassifier example)I posted a related answer to that Stack Overflow question
https://stackoverflow.com/a/79344862/3986677