Support categorical features in HistGradientBoosting estimators #15550

adrinjalali · 2019-11-06T10:25:58Z

Similar to #12866 , HGBT based estimators can also natively support categorical features.

This issue is a placeholder to keep track of the discussions around the issue.

NicolasHug · 2020-03-25T17:35:59Z

Hi all,
Is anyone among the @scikit-learn/core-devs team willing to work on this soon-ish? It'd be better if I'm not the one doing it, because we don't have many devs acquainted with the HistGBDT code yet.

It's a fair amount of work, but it should be fun. The goal is to implement something similar to what LightGBM does (i.e. sorting of categories for fast split finding).

Pinging also @lorentzenchr and @johannfaouzi in case you'd be interested?

I'm happy to assist in any way I can (getting started with the codebase, etc.)
If nobody takes it, I'll start working on it.

Cheers

lorentzenchr · 2020-03-26T14:56:30Z

@NicolasHug Thanks for asking - sound like fun and a very useful feature, too. But at the moment, I'd like to focus on some other features (e.g. different loss functions).

I'm interested to see how you design the API for this. Tagging certain features with additional info is very useful for other estimators and transformers, too (a certain slep is ringing a bell:smirk:)

NicolasHug · 2020-03-26T15:01:49Z

No worries! Thanks for the reply

In terms of API we'll keep things simple and just add a categorical_features param to __init__, like we did for the monotonic constraints. The estimators being still experimental, we have a little bit more of flexibiliy here.

Ideally, categorical features would be automatically inferred in fit, but as you know the SLEPs aren't there yet.

GaelVaroquaux · 2020-03-26T15:11:30Z

For information, the Paris team is very slowed down currently because those of us who are not sick are working on the Paris hospitals databases for real-time statistics and reporting on the Covid cases. Let's hope that this cools down soon.

thomasjpfan · 2020-03-26T15:22:03Z

I am interested in picking this up. Look out for a pull request soonish. (within a week)

johannfaouzi · 2020-03-26T18:40:15Z

@NicolasHug Thanks for asking! Sorry for the delay, I saw the notification this morning but I forgot to reply because I work on too many side projects for the moment. I'm glad to see that @thomasjpfan is interested in picking this up and I would be happy to review the PR. Also I didn't know how tree-based algorithms deal with categorical features, so TIL.

johannfaouzi · 2020-03-27T08:23:12Z

Sleeping on the issue, here is my two cents. Feel free to ignore some / of all my remarks if irrelevant:

categorial_features could be None, an array of integers with the indices of the categorial features or an array of booleans indicating which features are categorical. Would it be out of scope to have an "auto" value to infer which values are categorical? Since HGBT is usually to apply to large data sets, it could be too costly. sklearn.preprocessing.OrdinalEncoder has an "auto" option (which is the default value).
If I understood correctly what LightGBM does, the idea is to split the categories into two groups. To do so, the histogram of the categorical feature is computed then sorted, and the Fisher theorem states that the two groups are contiguous. So after this transformation, the feature can be treated as any binned continuous feature.
Are categorical features subject to the max_bins parameter? It could be tricky.
~~For the first node of each tree, the histogram and its sorting can be computed only once, since it will always be the same, right?~~ Edit: No.
How to deal with categories not seen when fitting the tree? Do we need to add a new bin for this case like it is done for missing values? This would lead to max_bins being lower than or equal to 254. This scenario may occur many times:
- For child nodes, a subset of the samples is used, so some categories will no longer be present in this subset but they will be present for new samples when making predictions.
- Rare categories when doing cross-validation.
- Categories present in the test set that were not present in the training set.
Do we allow users to provide categorical features in varying formats (str) and use sklearn.preprocessing.OrdinalEncoder to encode them internally, or do we force the users to make the preprocessing themselves? One of first thing in fit is to call check_X_y.

NicolasHug · 2020-03-27T12:35:47Z

For the first node of each tree, the histogram and its sorting can be computed only once, since it will always be the same, right?

Not sure about this one: the gradients / hessians are updated at each iteration so the histograms are never the same, even at the root.

For the rest I concur, this is a good list of the things we need to take care of

johannfaouzi · 2020-03-27T16:54:20Z

For the first node of each tree, the histogram and its sorting can be computed only once, since it will always be the same, right?

Not sure about this one: the gradients / hessians are updated at each iteration so the histograms are never the same, even at the root.

Yeah I guess this refers to the so-called pseudo-residuals in the Wikipedia page. The LightGBM page also states that

LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian)

I will remove that!

lorentzenchr · 2020-11-21T17:04:55Z

Native categorical support for HGBT was implemented in #18394.

See scikit-learn/scikit-learn#15550

adrinjalali added Enhancement Moderate Anything that requires some knowledge of conventions and best practices labels Nov 6, 2019

NicolasHug added the module:ensemble label Mar 25, 2020

NicolasHug mentioned this issue Mar 30, 2020

Tree Optimal split - from LightGBM #9960

Closed

NicolasHug mentioned this issue Apr 9, 2020

Support sparse matrices in HistGradientBoosting estimators #16885

Closed

thomasjpfan mentioned this issue Apr 13, 2020

ENH Adds Categorical Support to Histogram Gradient Boosting #16909

Closed

h-vetinari mentioned this issue Sep 15, 2020

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

Open

lorentzenchr closed this as completed Nov 21, 2020

vruusmann added a commit to jpmml/jpmml-sklearn that referenced this issue Jan 27, 2021

Added support for histogram-based tree split values

db49079

See scikit-learn/scikit-learn#15550

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support categorical features in HistGradientBoosting estimators #15550

Support categorical features in HistGradientBoosting estimators #15550

adrinjalali commented Nov 6, 2019

NicolasHug commented Mar 25, 2020

lorentzenchr commented Mar 26, 2020

NicolasHug commented Mar 26, 2020

GaelVaroquaux commented Mar 26, 2020 via email

thomasjpfan commented Mar 26, 2020

johannfaouzi commented Mar 26, 2020

johannfaouzi commented Mar 27, 2020 •

edited

Loading

NicolasHug commented Mar 27, 2020

johannfaouzi commented Mar 27, 2020

lorentzenchr commented Nov 21, 2020

Support categorical features in HistGradientBoosting estimators #15550

Support categorical features in HistGradientBoosting estimators #15550

Comments

adrinjalali commented Nov 6, 2019

NicolasHug commented Mar 25, 2020

lorentzenchr commented Mar 26, 2020

NicolasHug commented Mar 26, 2020

GaelVaroquaux commented Mar 26, 2020 via email

thomasjpfan commented Mar 26, 2020

johannfaouzi commented Mar 26, 2020

johannfaouzi commented Mar 27, 2020 • edited Loading

NicolasHug commented Mar 27, 2020

johannfaouzi commented Mar 27, 2020

lorentzenchr commented Nov 21, 2020

johannfaouzi commented Mar 27, 2020 •

edited

Loading