Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds Categorical Support to Histogram Gradient Boosting #16909

Closed
wants to merge 89 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
02d89d7
ENH Adds categorical support
thomasjpfan Apr 13, 2020
8472f60
DOC Improves english
thomasjpfan Apr 13, 2020
1198340
REV Less diffs
thomasjpfan Apr 13, 2020
63f56fd
DOC Adds comment
thomasjpfan Apr 13, 2020
f34087e
DOC Adds performance comment
thomasjpfan Apr 13, 2020
0b2ed9c
DOC Adds performance comment
thomasjpfan Apr 13, 2020
5eaf099
STY Fix
thomasjpfan Apr 13, 2020
43822ab
ENH Much faster bin mapping when transforming categories
thomasjpfan Apr 13, 2020
0d6012a
CLN Removes uneeded code
thomasjpfan Apr 13, 2020
b22151f
BUG Code actually is being used
thomasjpfan Apr 13, 2020
8432bac
Merge remote-tracking branch 'upstream/master' into cat_hgbt_rb
thomasjpfan Apr 29, 2020
ae9be56
CLN Address comments
thomasjpfan Apr 30, 2020
d0557a5
Merge remote-tracking branch 'upstream/master' into cat_hgbt_rb
thomasjpfan May 4, 2020
7692325
WIP Address more comments
thomasjpfan May 5, 2020
590d95f
CLN Address comments
thomasjpfan May 6, 2020
95e79f2
CLN Address comments
thomasjpfan May 6, 2020
e62479b
STY Linting
thomasjpfan May 6, 2020
9086fad
ENH Adds new method to binner
thomasjpfan May 7, 2020
3e323b2
CLN Binner refactor once again
thomasjpfan May 7, 2020
197fac0
CLN Address comments
thomasjpfan May 7, 2020
63af0d5
CLN More comments lol
thomasjpfan May 7, 2020
7ef6a8d
CLN Adds test for predict
thomasjpfan May 7, 2020
e6a03c6
ENH Adds categorical indicies support
thomasjpfan May 7, 2020
eabcfae
ENH Fix qsort
thomasjpfan May 7, 2020
2abe579
CLN Move missing_go_left code into grower
thomasjpfan May 7, 2020
9a5a3f4
Merge remote-tracking branch 'upstream/master' into cat_hgbt_rb
thomasjpfan May 8, 2020
ebb68e5
BUG Fix
thomasjpfan May 8, 2020
470c146
DOC More comments
thomasjpfan May 8, 2020
0fc4c24
DOC Update failing example
thomasjpfan May 8, 2020
cebd6c0
BUG Fixes
thomasjpfan May 8, 2020
d1478ba
DOC Update comment
thomasjpfan May 8, 2020
ba00644
BUG Fix
thomasjpfan May 8, 2020
1806c2b
BUG Fix
thomasjpfan May 8, 2020
95919e3
DOC Fix
thomasjpfan May 8, 2020
38966d5
WIP Try 32 bit
thomasjpfan May 8, 2020
c4869ba
WIP Fix bug
thomasjpfan May 8, 2020
f63ad6a
WIP Fix bug
thomasjpfan May 8, 2020
26d0796
WIP Fix bug
thomasjpfan May 8, 2020
17afb0f
WIP Fix bug
thomasjpfan May 8, 2020
5246cc1
REV Revert
thomasjpfan May 8, 2020
60523a3
DOC Fix
thomasjpfan May 8, 2020
b014d6e
Merge remote-tracking branch 'upstream/master' into cat_hgbt_rb
thomasjpfan May 25, 2020
96d0687
CLN Address comments
thomasjpfan May 28, 2020
dc0a3a4
WIP Updates binning
thomasjpfan May 28, 2020
e10b346
WIP Address more comments
thomasjpfan May 28, 2020
af58498
DOC Fix
thomasjpfan May 29, 2020
3c2f672
WIP moving to a predictor for bitset
thomasjpfan May 30, 2020
c8f31f9
ENH Fix binning tests
thomasjpfan May 30, 2020
fe16b42
ENH Removes binning in predict
thomasjpfan May 30, 2020
cf5bb6d
WIP Do not look still iterating lol
thomasjpfan May 31, 2020
3d9e449
WIP Do not look still iterating lol
thomasjpfan May 31, 2020
3615dc2
WIP Do not look still iterating lol
thomasjpfan May 31, 2020
6608715
WIP
thomasjpfan Jun 1, 2020
2c384e6
WIP Adds unknown category encoding
thomasjpfan Jun 1, 2020
8c6e985
WIP Removes binning during predict
thomasjpfan Jun 1, 2020
2357ae9
STY Update
thomasjpfan Jun 1, 2020
52048af
CLN Clean up commets
thomasjpfan Jun 1, 2020
f70416e
DOC Improve doc
thomasjpfan Jun 1, 2020
a4159cf
BUG Fix test
thomasjpfan Jun 1, 2020
a398786
ENH Only go in one direction when finding best split
thomasjpfan Jun 1, 2020
8ea46cc
ENH Do not include bitset if the split is not categorical
thomasjpfan Jun 1, 2020
3dcbd31
Fix
thomasjpfan Jun 2, 2020
c3b5eef
WIP Test seg fault
thomasjpfan Jun 2, 2020
280784a
BUG Fix
thomasjpfan Jun 2, 2020
019de8a
DOC Update doc
thomasjpfan Jun 5, 2020
24d0711
DOC Address comments
thomasjpfan Jun 5, 2020
2d0e79d
ENH Enables openmp
thomasjpfan Jun 5, 2020
966379c
BUG Fix
thomasjpfan Jun 5, 2020
1c920f7
Some comments
NicolasHug Jul 18, 2020
3966432
Merge branch 'master' of github.com:scikit-learn/scikit-learn into ca…
NicolasHug Jul 20, 2020
6c1af62
pep8
NicolasHug Jul 20, 2020
f535c33
Merge remote-tracking branch 'upstream/master' into cat_hgbt_rb
thomasjpfan Jul 31, 2020
2afca55
CLN Apply suggestions
thomasjpfan Aug 1, 2020
9b44d82
CLN More comments
thomasjpfan Aug 8, 2020
9f3fa46
categorical => categorical_features
ogrisel Aug 10, 2020
1054754
Merge branch 'master' of github.com:scikit-learn/scikit-learn into ca…
NicolasHug Aug 17, 2020
bbae955
Added grower test for OHE equivalent
NicolasHug Aug 17, 2020
c003b76
ENH Change sorting ordre to match lightgbm
thomasjpfan Aug 24, 2020
40c3f9b
CLN Less than equal
thomasjpfan Aug 24, 2020
bb0e899
CLN Adds splitting in both directions
thomasjpfan Aug 24, 2020
f47da15
Merge remote-tracking branch 'upstream/master' into cat_hgbt_rb
thomasjpfan Aug 24, 2020
bb5877d
CLN Fixes merge conflicts
thomasjpfan Aug 25, 2020
c3061b5
ENH Uses mask instead of pandas features in benchmark
thomasjpfan Aug 25, 2020
8762e88
DOC Remove reference to pandas in user guide
thomasjpfan Aug 25, 2020
6d7ec60
DOC Benchmark update
thomasjpfan Aug 25, 2020
69f3f9a
ENH Update benchmark parameters
thomasjpfan Aug 25, 2020
f9f837c
Merge branch 'master' of github.com:scikit-learn/scikit-learn into ca…
NicolasHug Aug 27, 2020
b913ff1
Remove pandas support for categorical features
NicolasHug Aug 27, 2020
730d69f
Merge branch 'master' of github.com:scikit-learn/scikit-learn into ca…
NicolasHug Sep 4, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
38 changes: 38 additions & 0 deletions doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1095,6 +1095,44 @@ supported for multiclass context.

* :ref:`sphx_glr_auto_examples_ensemble_plot_monotonic_constraints.py`

.. _categorical_support_gbdt:

Categorical Support
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
-------------------

For datasets with categorical data, :class:`HistGradientBoostingClassifier`
and :class:`HistGradientBoostingRegressor` has native support for splitting
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
on categorical features. This is often better than one hot encoding because
it leads to faster training times and trees with less depth. When splitting
a node, the categorical feature will be split into two subsets: one going to
the left child and the other going to the right child. First, the histogram
for each categorical feature is first sorted according to the ratio:
`sum of gradient / sum of hessian` in each bin. Then the best split is found
by considering splits along the stored histogram.
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved

If the cardinality of a categorical feature is greater than `max_bins`, then
the top `max_bins` categories based on cardinality will be kept and the less
frequent categories will be considered missing. If there are missing values
during training, the missing will be considered its own category. When
predicting, categories that were unknown during fit time, will be consider
missing.
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved

To enable categorical support, a boolean mask can be passed to the
`categorical` parameter. In the following, the first feature will be
treated as categorical and the second feature is nummerical::
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved

>>> gbdt = HistGradientBoostingRegressor(categorical=[True, False])

Another way to enable categorical support is to pass `'pandas'` to the
`categorical` parameter. This will infer the categorical features using panda's
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
categorical dtype during `fit`.

>>> gbdt = HistGradientBoostingRegressor(categorical='pandas')

thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_categorical.py`

Low-level parallelism
---------------------

Expand Down
93 changes: 93 additions & 0 deletions examples/ensemble/plot_gradient_boosting_categorical.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
"""
========================================
Categorical Support in Gradient Boosting
========================================

.. currentmodule:: sklearn

This example, we will compare the performance of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example?

:class:`~ensemble.HistGradientBoostingRegressor` using one hot encoding
and with native categorical support.

We will work with the Ames Lowa Housing dataset which consists of numerical
and categorical features, where the houses' sales prices is the target.
Comment on lines +12 to +13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the categorical features useful for this classification task? It may be worth it to add another example where the categorical features are dropped, training should be faster but predictive performance should be worse. Dropping categorical features is another way to deal with them (in a dummy way).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this dataset, the categories do not matter as much. So I will be on the lookout for a nicer dataset.

"""
##############################################################################
# Load Ames Housing dataset
# -------------------------
# First, we load the ames housing data as a pandas dataframe. The features
# are either categorical or numerical:
print(__doc__)

from sklearn.datasets import fetch_openml

X, y = fetch_openml(data_id=41211, as_frame=True, return_X_y=True)

n_features = X.shape[1]
n_categorical_features = (X.dtypes == 'category').sum()
n_numerical_features = (X.dtypes == 'float').sum()
print(f"Number of features: {X.shape[1]}")
print(f"Number of categorical featuers: {n_categorical_features}")
print(f"Number of numerical featuers: {n_numerical_features}")

##############################################################################
# Create gradient boosting estimator with one hot encoding
# --------------------------------------------------------
# Next, we create a pipeline that will one hot encode the categorical features
# and let rest of the numerical data to passthrough:

from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder

preprocessor = make_column_transformer(
(OneHotEncoder(sparse=False, handle_unknown='ignore'),
make_column_selector(dtype_include='category')),
remainder='passthrough'
)

hist_one_hot = make_pipeline(preprocessor,
HistGradientBoostingRegressor(random_state=0))

##############################################################################
# Create gradient boosting estimator with native categorical support
# ------------------------------------------------------------------
# The :class:`~ensemble.HistGradientBoostingRegressor` has native support
# for categorical features using the `categorical` parameter:

hist_native = HistGradientBoostingRegressor(categorical='pandas',
random_state=0)

##############################################################################
# Train the models with cross-validation
# --------------------------------
# Finally, we train the models using cross validation. Here we compare the
# models performance in terms of :func:`~metrics.r2_score` and fit times. We
# show that fit times are faster with native categorical support and that the
# test scores and scores times are comparable:

from sklearn.model_selection import cross_validate
import matplotlib.pyplot as plt
import numpy as np

one_hot_result = cross_validate(hist_one_hot, X, y)
native_result = cross_validate(hist_native, X, y)

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(12, 8))

plot_info = [('fit_time', 'Fit times (s)', ax1),
('score_time', 'Score times (s)', ax2),
('test_score', 'Test Scores (r2 score)', ax3)]

x, width = np.arange(2), 0.9
for key, title, ax in plot_info:
items = [native_result[key], one_hot_result[key]]
ax.bar(x, [np.mean(item) for item in items],
width, yerr=[np.std(item) for item in items],
color=['b', 'r'])
ax.set(xlabel='Split number', title=title, xticks=[0, 1],
xticklabels=['Native', "One Hot"])
plt.show()
11 changes: 7 additions & 4 deletions sklearn/ensemble/_hist_gradient_boosting/_binning.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,15 @@ def _map_to_bins(const X_DTYPE_C [:, :] data,
"""
cdef:
int feature_idx
X_DTYPE_C [:] binning_threshold

for feature_idx in range(data.shape[1]):
_map_num_col_to_bins(data[:, feature_idx],
binning_thresholds[feature_idx],
missing_values_bin_idx,
binned[:, feature_idx])
binning_threshold = binning_thresholds[feature_idx]
if binning_threshold is not None:
_map_num_col_to_bins(data[:, feature_idx],
binning_threshold,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation needs a space

missing_values_bin_idx,
binned[:, feature_idx])


cdef void _map_num_col_to_bins(const X_DTYPE_C [:] data,
Expand Down
10 changes: 10 additions & 0 deletions sklearn/ensemble/_hist_gradient_boosting/_bitset.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# cython: language_level=3
from .common cimport X_BITSET_DTYPE_C
from .common cimport X_BINNED_DTYPE_C


cdef void init_bitset(X_BITSET_DTYPE_C bitset) nogil

cdef void insert_bitset(X_BINNED_DTYPE_C val, X_BITSET_DTYPE_C bitset) nogil

cdef unsigned char in_bitset(X_BINNED_DTYPE_C val, X_BITSET_DTYPE_C bitset) nogil
33 changes: 33 additions & 0 deletions sklearn/ensemble/_hist_gradient_boosting/_bitset.pyx
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# cython: cdivision=True
# cython: boundscheck=False
# cython: wraparound=False
# cython: language_level=3
from .common cimport X_BITSET_DTYPE_C
from .common cimport X_BINNED_DTYPE_C


cdef inline void init_bitset(X_BITSET_DTYPE_C bitset) nogil: # OUT
cdef:
unsigned int i

for i in range(8):
bitset[i] = 0

cdef inline void insert_bitset(X_BINNED_DTYPE_C val,
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
X_BITSET_DTYPE_C bitset) nogil: # OUT
cdef:
unsigned int i1 = val / 32
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
unsigned int i2 = val % 32

# It is assumed that val < 256 or i1 < 8
bitset[i1] |= (1 << i2)

cdef inline unsigned char in_bitset(X_BINNED_DTYPE_C val,
X_BITSET_DTYPE_C bitset) nogil:
cdef:
unsigned int i1 = val / 32
unsigned int i2 = val % 32

if i1 >= 8:
return 0
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
return (bitset[i1] >> i2) & 1
45 changes: 34 additions & 11 deletions sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -17,44 +17,59 @@ from .common cimport Y_DTYPE_C
from .common import Y_DTYPE
from .common cimport X_BINNED_DTYPE_C
from .common cimport node_struct
from ._bitset cimport in_bitset


def _predict_from_numeric_data(
node_struct [:] nodes,
const X_DTYPE_C [:, :] numeric_data,
const X_BINNED_DTYPE_C [:, :] categorical_data,
const long[:] orig_feature_to_binned_cat,
Y_DTYPE_C [:] out):

cdef:
int i

for i in prange(numeric_data.shape[0], schedule='static', nogil=True):
out[i] = _predict_one_from_numeric_data(nodes, numeric_data, i)
out[i] = _predict_one_from_numeric_data(
nodes, numeric_data, categorical_data,
orig_feature_to_binned_cat, i)


cdef inline Y_DTYPE_C _predict_one_from_numeric_data(
node_struct [:] nodes,
const X_DTYPE_C [:, :] numeric_data,
const X_BINNED_DTYPE_C [:, :] categorical_data,
const long[:] orig_feature_to_binned_cat,
const int row) nogil:
# Need to pass the whole array and the row index, else prange won't work.
# See issue Cython #2798

cdef:
node_struct node = nodes[0]
long cat_idx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use long? we usually use unsigned int.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match the dtype of orig_feature_to_binned_cat, but this will change when we do not bin anymore in predict.


while True:
if node.is_leaf:
return node.value

if isnan(numeric_data[row, node.feature_idx]):
if node.missing_go_to_left:
if node.is_categorical:
cat_idx = orig_feature_to_binned_cat[node.feature_idx]
if in_bitset(categorical_data[row, cat_idx], node.cat_threshold):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does lightgbm also do that? I.e. bin during predict, and rely on a bitset of internally encoded features?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lightgbm does not bin during predict. It has a dynamically sized bitset that encodes the input categorical features, so it can accept a category with any cardinality.

Currently, the implementation accepts a category with any cardinality. If the cardinality is higher than max_bins, then only the top max_bins categories are kept, ranked by cardinality, and the rest are considered missing. In this way, it is also handling infrequent categories as well. This option is more flexible, but means I have to bin predict which is disappointing.

A simpler alternative would be to restrict the input to be "ints" with range ~ [0, max_bin] and anything outside of that range will be considered missing. This would not do anything special to handle infrequent categories, but it will simplify some of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts @ogrisel ?

Personally, for a first version, I would prefer keeping things as simple as possible. As such, rxpecting ints in [0, max_bins] sounds reasonable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go either way on this. I spoke to @amueller about this and seems to prefer the current approach of "binning categories during predict".

node = nodes[node.left]
else:
node = nodes[node.right]
else:
if numeric_data[row, node.feature_idx] <= node.threshold:
node = nodes[node.left]
if isnan(numeric_data[row, node.feature_idx]):
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
if node.missing_go_to_left:
node = nodes[node.left]
else:
node = nodes[node.right]
else:
node = nodes[node.right]
if numeric_data[row, node.feature_idx] <= node.threshold:
node = nodes[node.left]
else:
node = nodes[node.right]


def _predict_from_binned_data(
Expand Down Expand Up @@ -85,16 +100,24 @@ cdef inline Y_DTYPE_C _predict_one_from_binned_data(
while True:
if node.is_leaf:
return node.value
if binned_data[row, node.feature_idx] == missing_values_bin_idx:
if node.missing_go_to_left:

if node.is_categorical:
if in_bitset(binned_data[row, node.feature_idx],
node.cat_threshold):
node = nodes[node.left]
else:
node = nodes[node.right]
else:
if binned_data[row, node.feature_idx] <= node.bin_threshold:
node = nodes[node.left]
if binned_data[row, node.feature_idx] == missing_values_bin_idx:
thomasjpfan marked this conversation as resolved.
Show resolved Hide resolved
if node.missing_go_to_left:
node = nodes[node.left]
else:
node = nodes[node.right]
else:
node = nodes[node.right]
if binned_data[row, node.feature_idx] <= node.bin_threshold:
node = nodes[node.left]
else:
node = nodes[node.right]

def _compute_partial_dependence(
node_struct [:] nodes,
Expand Down