Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support categorical features in HistGradientBoosting estimators #15550

Closed
adrinjalali opened this issue Nov 6, 2019 · 10 comments
Closed

Support categorical features in HistGradientBoosting estimators #15550

adrinjalali opened this issue Nov 6, 2019 · 10 comments
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:ensemble

Comments

@adrinjalali
Copy link
Member

Similar to #12866 , HGBT based estimators can also natively support categorical features.

This issue is a placeholder to keep track of the discussions around the issue.

@adrinjalali adrinjalali added Enhancement Moderate Anything that requires some knowledge of conventions and best practices labels Nov 6, 2019
@NicolasHug
Copy link
Member

Hi all,
Is anyone among the @scikit-learn/core-devs team willing to work on this soon-ish? It'd be better if I'm not the one doing it, because we don't have many devs acquainted with the HistGBDT code yet.

It's a fair amount of work, but it should be fun. The goal is to implement something similar to what LightGBM does (i.e. sorting of categories for fast split finding).

Pinging also @lorentzenchr and @johannfaouzi in case you'd be interested?

I'm happy to assist in any way I can (getting started with the codebase, etc.)
If nobody takes it, I'll start working on it.

Cheers

@lorentzenchr
Copy link
Member

@NicolasHug Thanks for asking - sound like fun and a very useful feature, too. But at the moment, I'd like to focus on some other features (e.g. different loss functions).

I'm interested to see how you design the API for this. Tagging certain features with additional info is very useful for other estimators and transformers, too (a certain slep is ringing a bell:smirk:)

@NicolasHug
Copy link
Member

No worries! Thanks for the reply

In terms of API we'll keep things simple and just add a categorical_features param to __init__, like we did for the monotonic constraints. The estimators being still experimental, we have a little bit more of flexibiliy here.

Ideally, categorical features would be automatically inferred in fit, but as you know the SLEPs aren't there yet.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 26, 2020 via email

@thomasjpfan
Copy link
Member

I am interested in picking this up. Look out for a pull request soonish. (within a week)

@johannfaouzi
Copy link
Contributor

@NicolasHug Thanks for asking! Sorry for the delay, I saw the notification this morning but I forgot to reply because I work on too many side projects for the moment. I'm glad to see that @thomasjpfan is interested in picking this up and I would be happy to review the PR. Also I didn't know how tree-based algorithms deal with categorical features, so TIL.

@johannfaouzi
Copy link
Contributor

johannfaouzi commented Mar 27, 2020

Sleeping on the issue, here is my two cents. Feel free to ignore some / of all my remarks if irrelevant:

  • categorial_features could be None, an array of integers with the indices of the categorial features or an array of booleans indicating which features are categorical. Would it be out of scope to have an "auto" value to infer which values are categorical? Since HGBT is usually to apply to large data sets, it could be too costly. sklearn.preprocessing.OrdinalEncoder has an "auto" option (which is the default value).
  • If I understood correctly what LightGBM does, the idea is to split the categories into two groups. To do so, the histogram of the categorical feature is computed then sorted, and the Fisher theorem states that the two groups are contiguous. So after this transformation, the feature can be treated as any binned continuous feature.
  • Are categorical features subject to the max_bins parameter? It could be tricky.
  • For the first node of each tree, the histogram and its sorting can be computed only once, since it will always be the same, right? Edit: No.
  • How to deal with categories not seen when fitting the tree? Do we need to add a new bin for this case like it is done for missing values? This would lead to max_bins being lower than or equal to 254. This scenario may occur many times:
    • For child nodes, a subset of the samples is used, so some categories will no longer be present in this subset but they will be present for new samples when making predictions.
    • Rare categories when doing cross-validation.
    • Categories present in the test set that were not present in the training set.
  • Do we allow users to provide categorical features in varying formats (str) and use sklearn.preprocessing.OrdinalEncoder to encode them internally, or do we force the users to make the preprocessing themselves? One of first thing in fit is to call check_X_y.

@NicolasHug
Copy link
Member

For the first node of each tree, the histogram and its sorting can be computed only once, since it will always be the same, right?

Not sure about this one: the gradients / hessians are updated at each iteration so the histograms are never the same, even at the root.

For the rest I concur, this is a good list of the things we need to take care of

@johannfaouzi
Copy link
Contributor

For the first node of each tree, the histogram and its sorting can be computed only once, since it will always be the same, right?

Not sure about this one: the gradients / hessians are updated at each iteration so the histograms are never the same, even at the root.

Yeah I guess this refers to the so-called pseudo-residuals in the Wikipedia page. The LightGBM page also states that

LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian)

I will remove that!

@lorentzenchr
Copy link
Member

Native categorical support for HGBT was implemented in #18394.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:ensemble
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants