[MRG] Native support for missing values in GBDTs #13911

NicolasHug · 2019-05-20T20:26:31Z

This PR implements support for missing values in histogram-based GBDTs.

By missing values, I mean NaNs. I haven't implemented anything related to sparse matrices (that's for another PR).

Here's a breakdown of the changes, which follows LightGBM and XGBoost strategies. It's relatively simple:

When binning, missing values are assigned to a specific bin (the first one*). We ignore the missing values when computing the bin thresholds (i.e. we only use non-missing values, just like before).
When training, what changes is the strategy to find the best bin: basically when considering a bin to split on, we compute the gain according to 2 scenarii: mapping the samples with missing values to the left child, or mapping them to the right child.
Concretely, this is done by scanning the bins from left to right (like we used to) and from right to left. This is what LightGBM does, and it's explained in XGBoost paper (algo 3.).
At predicting, samples with nans are mapped to the appropriate child (left or right) that was learned to be the one with best gain. Note that if there were no missing values during training, then the samples with missing values are mapped to whichever node has the most samples. That's what H20 does (I haven't checked the other libraries but that seems to be the most sensible behavior).

*LightGBM assigns missing values to the last bin. I haven't found a compelling reason to do so. We assign missing values to the first bin instead, which has the advantage of always being the first bin (whereas the index of the last bin may vary, and thus needs to be passed along as a parameter which is annoying). We also assign missing values to the last bin now

EDIT: see list of technical changes at #13911 (comment)

…ssing_value_gbdt

NicolasHug · 2019-07-26T14:13:21Z

This comment is for documenting the technical changes made in this PR. Ping @ogrisel when you're back ;)

we always allocate one bin for missing values, even if there are no missing values. This is the last bin, its index is always equal to max_bins.
max_bins is now maxed at 255 bins. It's the number of bins used for non-missing values.
private classes (BinMapper, histogram, etc) now take n_bins = max_bins + 1 as argument, since max_bins isn't the number of bins anymore. Histograms have size n_bins, not max_bins.
types.pyx and types.pxd have been renamed to common.*** because we needed to declare the ALMOST_INF constant.
Support for infinite values is unchanged, but code is a bit different (simplified). If a threshold is found to be +inf, we actually set it to ALMOST_INF = 1e300 like in LightGBM.
- This avoids having special cases for correctly mapping +inf values when predicting and when binning.
- This also allows us to set the threshold to +inf iff we are in a split-on-nan situation. Split-on-nan means that all the nans (and only nans) go to the right child, while the rest go to the left child.

…ssing_value_gbdt

ogrisel

Hi @NicolasHug! Nice work. I have the following comments but otherwise it looks good.

Furthermore we should document in the API/estimator changes section of the what's new document that the internal data structure of the fitted model has changed to be able to add the native support for missing values.

Trying to load a pickled model fitted with scikit-learn 0.21 in scikit-learn 0.22
will yield an exception such as:

ValueError: Buffer dtype mismatch, expected 'unsigned char' but got 'unsigned int' in
'node_struct.missing_go_to_left'

Model re-training is required in this case.

Alternatively we could override the __setstate__ method to detect if the predictor array has the valid dtype and reshape with some padding to make it match be extra nice to our users and not have to confuse them with a complex changelog message. But this would imply wrinting a test.

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

ogrisel · 2019-08-20T13:31:45Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py

+    predictions_binned = predictor.predict_binned(
+        X_binned, missing_values_bin_idx=bin_mapper.missing_values_bin_idx_)
+    assert np.all(predictions == -gradients)
+    assert np.all(predictions_binned == -gradients)


While this is a good test, I wonder if it wouldn't be better to write something that looks equivalent from the public API only to make it easier to understand. E.g. something based on the following:

>>> from sklearn.experimental import enable_hist_gradient_boosting >>> from sklearn.ensemble import HistGradientBoostingClassifier >>> import numpy as np >>> X = np.asarray([-np.inf, 0, 1, np.inf, np.nan]).reshape(-1, 1) >>> X array([[-inf], [ 0.], [ 1.], [ inf], [ nan]]) >>> y_isnan = np.isnan(X.ravel()) >>> y_isnan array([False, False, False, False, True]) >>> y_isinf = X.ravel() == np.inf >>> y_isinf array([False, False, False, True, False]) >>> stump_clf = HistGradientBoostingClassifier(min_samples_leaf=1, max_iter=1, learning_rate=1., max_depth=2) >>> stump_clf.fit(X, y_isinf).score(X, y_isinf) 1.0 >>> stump_clf.fit(X, y_isnan).score(X, y_isnan) 1.0

The issue with the public API is that we can't test the predictions for X_binned which I think is important too.

I'll add your test as well though, it can't hurt ;)

ogrisel · 2019-08-20T13:33:27Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_grower.py

+
+    # Make sure in particular that the +inf sample is mapped to the left child
+    # Note that lightgbm "fails" here and will assign the inf sample to the
+    # right child, even though it's a "split on nan" situation.


Are you sure LightGBM fails in this case? Why would they have introduced AvoidInf() if not for this case?

I opened microsoft/LightGBM#2277, looks like they just fixed it (I haven't checked again though)

adrinjalali · 2019-08-20T13:59:45Z

Alternatively we could override the setstate method to detect if the predictor array has the valid dtype and reshape with some padding to make it match be extra nice to our users and not have to confuse them with a complex changelog message. But this would imply wrinting a test.

Since this is still experimental, I would rather not worry about that. Adding sample weights and then categorical data would probably also cause the same issue.

ogrisel · 2019-08-20T14:03:54Z

I doubt that sample_weights will change the structure of the parameters used by the predictor function but I get your point.

NicolasHug · 2019-08-20T15:35:54Z

Thanks Olivier, I have addressed comments and updated the whatsnew with a compound entry of all the changes so far.

I labeled this as a MajorFeature, but feel free to change, I'm not sure.

Regarding the pickles: we don't even support pickling between major versions so I'm not sure we should have a special case for these estimators

ogrisel · 2019-08-20T15:57:04Z

Regarding the pickles: we don't even support pickling between major versions so I'm not sure we should have a special case for these estimators

Well it's cheap (and friendly) to warn the users in the changelog when that actually happens.

ogrisel

LGTM but some more comments / questions ;)

Whether this is to be considered a major feature or not can later be changed (either at release time and or discussed at next dev meeting).

doc/modules/ensemble.rst

sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py

NicolasHug · 2019-08-20T17:12:13Z

@adrinjalali Does this have your +1 as well?

Just in case, maybe @thomasjpfan @jnothman @rth @glemaitre @qinhanmin2014 would want to give it a quick look before we merge?

jnothman · 2019-08-20T20:58:24Z

I like the proposal. Don't have time soon to check for correctness unfortunately

jnothman

At some point our docs skulls have a reference listing of estimators that'd support missing values

doc/modules/ensemble.rst

adrinjalali · 2019-08-21T07:38:08Z

The example in docstrings still need fixing.

ogrisel · 2019-08-21T07:59:22Z

The example in docstrings still need fixing.

Indeed. This time I ran pytest locally prior to pushing :)

adrinjalali

This looks good now. And I guess the three of us are in agreement. Merging, nitpicks can go in other PRs, rather have this in, and there's time to fix the issues before the release.

NicolasHug added 5 commits May 20, 2019 16:23

Added NaN support in mapper

e279161

pep

91105a6

WIP

000ab9a

some more

66c2502

WIP

810b7b0

NicolasHug mentioned this pull request May 21, 2019

CLN GBDTs: don't split on last bin (explicitly) #13919

Merged

NicolasHug added 24 commits May 21, 2019 09:22

WIP

5fd59cb

bug fix

670566b

basic tests

e338e0a

some doc

d288518

avoid some interactions

2d1659b

Added tag

f2a83a0

better test

cd1de3c

decent test + fix bug

5cd8e59

Merge branch 'master' of github.com:scikit-learn/scikit-learn into mi…

af1558a

…ssing_value_gbdt

add missing_fraction param to benchmark

d6b73ed

bin training and validation data separately

5e06fa7

shorter test

1a34856

Map missing values to first bin instead of last

aae10a2

pep8

35eda6e

Merge branch 'master' of github.com:scikit-learn/scikit-learn into mi…

1fa9b26

…ssing_value_gbdt

Added whats new entry

1f63282

avoid some python interactions

e3d34a9

make predict_binned work

542cb25

fixed bug due to offset in bin_thresholds_ attribute

bf822b4

more sensible binning strat

112b400

typo

21a3ee3

user name

28c15b2

Add small test

5a5f39d

Merge branch 'master' of github.com:scikit-learn/scikit-learn into mi…

a4da8d0

…ssing_value_gbdt

1e300 -> almost inf

6f0e191

NicolasHug mentioned this pull request Jul 30, 2019

[MRG] DOC User guide section for histogram-based GBDTs #14525

Merged

NicolasHug and others added 4 commits August 5, 2019 13:09

Merge branch 'master' of github.com:scikit-learn/scikit-learn into mi…

a56db0b

…ssing_value_gbdt

added user guide section on missing values

c112335

Merge branch 'master' into missing_value_gbdt

0c5dc90

Merge branch 'master' of github.com:scikit-learn/scikit-learn into mi…

ef5cce2

…ssing_value_gbdt

ogrisel reviewed Aug 20, 2019

View reviewed changes

Addressed Olivier's comment + updated whatsnew

3b0c2ba

ogrisel approved these changes Aug 20, 2019

View reviewed changes

doc/modules/ensemble.rst Show resolved Hide resolved

doc/modules/ensemble.rst Show resolved Hide resolved

sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py Outdated Show resolved Hide resolved

addressed comments

7c868ae

jnothman reviewed Aug 20, 2019

View reviewed changes

ogrisel reviewed Aug 21, 2019

View reviewed changes

doc/modules/ensemble.rst Outdated Show resolved Hide resolved

Fix doctest formatting

876f538

Fix nan predictive doctest

601dc22

adrinjalali approved these changes Aug 21, 2019

View reviewed changes

adrinjalali merged commit 4b6273b into scikit-learn:master Aug 21, 2019

This was referenced Aug 21, 2019

Implement native support for missing values ogrisel/pygbm#95

Open

FEA Turn on early stopping in histogram GBDT by default #14516

Merged

thomasjpfan mentioned this pull request Oct 10, 2019

Add Release Highlights for several features in 0.22 #15152

Closed

13 tasks

adrinjalali added this to To do in Missing Values/Imputation Oct 21, 2019

adrinjalali moved this from To do to Done in Missing Values/Imputation Oct 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Native support for missing values in GBDTs #13911

[MRG] Native support for missing values in GBDTs #13911

NicolasHug commented May 20, 2019 •

edited

NicolasHug commented Jul 26, 2019

ogrisel left a comment •

edited

ogrisel Aug 20, 2019

NicolasHug Aug 20, 2019

ogrisel Aug 20, 2019

NicolasHug Aug 20, 2019

adrinjalali commented Aug 20, 2019

ogrisel commented Aug 20, 2019

NicolasHug commented Aug 20, 2019

ogrisel commented Aug 20, 2019

ogrisel left a comment

NicolasHug commented Aug 20, 2019

jnothman commented Aug 20, 2019

jnothman left a comment

adrinjalali commented Aug 21, 2019

ogrisel commented Aug 21, 2019

adrinjalali left a comment

[MRG] Native support for missing values in GBDTs #13911

[MRG] Native support for missing values in GBDTs #13911

Conversation

NicolasHug commented May 20, 2019 • edited

NicolasHug commented Jul 26, 2019

ogrisel left a comment • edited

Choose a reason for hiding this comment

ogrisel Aug 20, 2019

Choose a reason for hiding this comment

NicolasHug Aug 20, 2019

Choose a reason for hiding this comment

ogrisel Aug 20, 2019

Choose a reason for hiding this comment

NicolasHug Aug 20, 2019

Choose a reason for hiding this comment

adrinjalali commented Aug 20, 2019

ogrisel commented Aug 20, 2019

NicolasHug commented Aug 20, 2019

ogrisel commented Aug 20, 2019

ogrisel left a comment

Choose a reason for hiding this comment

NicolasHug commented Aug 20, 2019

jnothman commented Aug 20, 2019

jnothman left a comment

Choose a reason for hiding this comment

adrinjalali commented Aug 21, 2019

ogrisel commented Aug 21, 2019

adrinjalali left a comment

Choose a reason for hiding this comment

NicolasHug commented May 20, 2019 •

edited

ogrisel left a comment •

edited