Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Native support for missing values in GBDTs #13911

Merged
merged 103 commits into from Aug 21, 2019

Conversation

@NicolasHug
Copy link
Contributor

NicolasHug commented May 20, 2019

This PR implements support for missing values in histogram-based GBDTs.

By missing values, I mean NaNs. I haven't implemented anything related to sparse matrices (that's for another PR).

Here's a breakdown of the changes, which follows LightGBM and XGBoost strategies. It's relatively simple:

  • When binning, missing values are assigned to a specific bin (the first one*). We ignore the missing values when computing the bin thresholds (i.e. we only use non-missing values, just like before).

  • When training, what changes is the strategy to find the best bin: basically when considering a bin to split on, we compute the gain according to 2 scenarii: mapping the samples with missing values to the left child, or mapping them to the right child.
    Concretely, this is done by scanning the bins from left to right (like we used to) and from right to left. This is what LightGBM does, and it's explained in XGBoost paper (algo 3.).

  • At predicting, samples with nans are mapped to the appropriate child (left or right) that was learned to be the one with best gain. Note that if there were no missing values during training, then the samples with missing values are mapped to whichever node has the most samples. That's what H20 does (I haven't checked the other libraries but that seems to be the most sensible behavior).

*LightGBM assigns missing values to the last bin. I haven't found a compelling reason to do so. We assign missing values to the first bin instead, which has the advantage of always being the first bin (whereas the index of the last bin may vary, and thus needs to be passed along as a parameter which is annoying). We also assign missing values to the last bin now

EDIT: see list of technical changes at #13911 (comment)

NicolasHug added 5 commits May 20, 2019
WIP
@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Jul 26, 2019

This comment is for documenting the technical changes made in this PR. Ping @ogrisel when you're back ;)

  • we always allocate one bin for missing values, even if there are no missing values. This is the last bin, its index is always equal to max_bins.
  • max_bins is now maxed at 255 bins. It's the number of bins used for non-missing values.
  • private classes (BinMapper, histogram, etc) now take n_bins = max_bins + 1 as argument, since max_bins isn't the number of bins anymore. Histograms have size n_bins, not max_bins.
  • types.pyx and types.pxd have been renamed to common.*** because we needed to declare the ALMOST_INF constant.
  • Support for infinite values is unchanged, but code is a bit different (simplified). If a threshold is found to be +inf, we actually set it to ALMOST_INF = 1e300 like in LightGBM.
    • This avoids having special cases for correctly mapping +inf values when predicting and when binning.
    • This also allows us to set the threshold to +inf iff we are in a split-on-nan situation. Split-on-nan means that all the nans (and only nans) go to the right child, while the rest go to the left child.
NicolasHug and others added 4 commits Aug 5, 2019
Copy link
Member

ogrisel left a comment

Hi @NicolasHug! Nice work. I have the following comments but otherwise it looks good.

Furthermore we should document in the API/estimator changes section of the what's new document that the internal data structure of the fitted model has changed to be able to add the native support for missing values.

Trying to load a pickled model fitted with scikit-learn 0.21 in scikit-learn 0.22
will yield an exception such as:

ValueError: Buffer dtype mismatch, expected 'unsigned char' but got 'unsigned int' in
'node_struct.missing_go_to_left'

Model re-training is required in this case.

Alternatively we could override the __setstate__ method to detect if the predictor array has the valid dtype and reshape with some padding to make it match be extra nice to our users and not have to confuse them with a complex changelog message. But this would imply wrinting a test.

predictions_binned = predictor.predict_binned(
X_binned, missing_values_bin_idx=bin_mapper.missing_values_bin_idx_)
assert np.all(predictions == -gradients)
assert np.all(predictions_binned == -gradients)

This comment has been minimized.

Copy link
@ogrisel

ogrisel Aug 20, 2019

Member

While this is a good test, I wonder if it wouldn't be better to write something that looks equivalent from the public API only to make it easier to understand. E.g. something based on the following:

>>> from sklearn.experimental import enable_hist_gradient_boosting
>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> import numpy as np
>>> X = np.asarray([-np.inf, 0, 1, np.inf, np.nan]).reshape(-1, 1)
>>> X
array([[-inf],
       [  0.],
       [  1.],
       [ inf],
       [ nan]])
>>> y_isnan = np.isnan(X.ravel())                      
>>> y_isnan
array([False, False, False, False,  True])

>>> y_isinf = X.ravel() == np.inf                                                                                                                                               
>>> y_isinf
array([False, False, False,  True, False])

>>> stump_clf = HistGradientBoostingClassifier(min_samples_leaf=1, max_iter=1, learning_rate=1., max_depth=2)
>>> stump_clf.fit(X, y_isinf).score(X, y_isinf)
1.0

>>> stump_clf.fit(X, y_isnan).score(X, y_isnan)
1.0

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 20, 2019

Author Contributor

The issue with the public API is that we can't test the predictions for X_binned which I think is important too.

I'll add your test as well though, it can't hurt ;)


# Make sure in particular that the +inf sample is mapped to the left child
# Note that lightgbm "fails" here and will assign the inf sample to the
# right child, even though it's a "split on nan" situation.

This comment has been minimized.

Copy link
@ogrisel

ogrisel Aug 20, 2019

Member

Are you sure LightGBM fails in this case? Why would they have introduced AvoidInf() if not for this case?

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 20, 2019

Author Contributor

I opened microsoft/LightGBM#2277, looks like they just fixed it (I haven't checked again though)

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Aug 20, 2019

Alternatively we could override the setstate method to detect if the predictor array has the valid dtype and reshape with some padding to make it match be extra nice to our users and not have to confuse them with a complex changelog message. But this would imply wrinting a test.

Since this is still experimental, I would rather not worry about that. Adding sample weights and then categorical data would probably also cause the same issue.

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Aug 20, 2019

I doubt that sample_weights will change the structure of the parameters used by the predictor function but I get your point.

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Aug 20, 2019

Thanks Olivier, I have addressed comments and updated the whatsnew with a compound entry of all the changes so far.

I labeled this as a MajorFeature, but feel free to change, I'm not sure.

Regarding the pickles: we don't even support pickling between major versions so I'm not sure we should have a special case for these estimators

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Aug 20, 2019

Regarding the pickles: we don't even support pickling between major versions so I'm not sure we should have a special case for these estimators

Well it's cheap (and friendly) to warn the users in the changelog when that actually happens.

Copy link
Member

ogrisel left a comment

LGTM but some more comments / questions ;)

Whether this is to be considered a major feature or not can later be changed (either at release time and or discussed at next dev meeting).

doc/modules/ensemble.rst Show resolved Hide resolved
doc/modules/ensemble.rst Show resolved Hide resolved
@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Aug 20, 2019

@adrinjalali Does this have your +1 as well?

Just in case, maybe @thomasjpfan @jnothman @rth @glemaitre @qinhanmin2014 would want to give it a quick look before we merge?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 20, 2019

I like the proposal. Don't have time soon to check for correctness unfortunately

Copy link
Member

jnothman left a comment

At some point our docs skulls have a reference listing of estimators that'd support missing values

doc/modules/ensemble.rst Outdated Show resolved Hide resolved
@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Aug 21, 2019

The example in docstrings still need fixing.

@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Aug 21, 2019

The example in docstrings still need fixing.

Indeed. This time I ran pytest locally prior to pushing :)

Copy link
Member

adrinjalali left a comment

This looks good now. And I guess the three of us are in agreement. Merging, nitpicks can go in other PRs, rather have this in, and there's time to fix the issues before the release.

@adrinjalali adrinjalali merged commit 4b6273b into scikit-learn:master Aug 21, 2019
15 checks passed
15 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
scikit-learn.scikit-learn Build #20190821.13 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl_pandas) Linux pylatest_conda_mkl_pandas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
@thomasjpfan thomasjpfan mentioned this pull request Oct 10, 2019
10 of 13 tasks complete
@adrinjalali adrinjalali moved this from To do to Done in Missing Values/Imputation Oct 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.