New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Specify categorical features with feature names in HGBDT #24889
ENH Specify categorical features with feature names in HGBDT #24889
Conversation
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only some questions.
This will improve user friendliness soooo much!!!
"on data without feature names." | ||
) | ||
is_categorical = np.zeros(n_features, dtype=bool) | ||
feature_names = self.feature_names_in_.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this conversion to a list necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrays do not have the index
method. Not sure how to implement this while staying in numpy and making it easy to raise the error message timely at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the feature names list should never be to long (few hundred values) for HGBDT models in practice because those models tend to perform poorly when n_features >> n_samples
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final adjustments.
Thank for the final fixes @lorentzenchr! |
Similar to #24855 but for the
categorical_features
parameter as stated in #24852 (comment).Note that this works well with the
.set_output("pandas")
of this release.However it requires disabling the verbose column names of the column transformer :)
Note: in the future we might directly inspect dataframe column dtypes in HGBDT and have an "auto" mode to trigger native categorical support for explicitly encoded categorical dtyped columns, but this will be the topic for a later PR.