Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Native support for missing values in GBDTs #13911

Merged
merged 103 commits into from Aug 21, 2019

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented May 20, 2019

This PR implements support for missing values in histogram-based GBDTs.

By missing values, I mean NaNs. I haven't implemented anything related to sparse matrices (that's for another PR).

Here's a breakdown of the changes, which follows LightGBM and XGBoost strategies. It's relatively simple:

  • When binning, missing values are assigned to a specific bin (the first one*). We ignore the missing values when computing the bin thresholds (i.e. we only use non-missing values, just like before).

  • When training, what changes is the strategy to find the best bin: basically when considering a bin to split on, we compute the gain according to 2 scenarii: mapping the samples with missing values to the left child, or mapping them to the right child.
    Concretely, this is done by scanning the bins from left to right (like we used to) and from right to left. This is what LightGBM does, and it's explained in XGBoost paper (algo 3.).

  • At predicting, samples with nans are mapped to the appropriate child (left or right) that was learned to be the one with best gain. Note that if there were no missing values during training, then the samples with missing values are mapped to whichever node has the most samples. That's what H20 does (I haven't checked the other libraries but that seems to be the most sensible behavior).

*LightGBM assigns missing values to the last bin. I haven't found a compelling reason to do so. We assign missing values to the first bin instead, which has the advantage of always being the first bin (whereas the index of the last bin may vary, and thus needs to be passed along as a parameter which is annoying). We also assign missing values to the last bin now

EDIT: see list of technical changes at #13911 (comment)

@NicolasHug
Copy link
Member Author

This comment is for documenting the technical changes made in this PR. Ping @ogrisel when you're back ;)

  • we always allocate one bin for missing values, even if there are no missing values. This is the last bin, its index is always equal to max_bins.
  • max_bins is now maxed at 255 bins. It's the number of bins used for non-missing values.
  • private classes (BinMapper, histogram, etc) now take n_bins = max_bins + 1 as argument, since max_bins isn't the number of bins anymore. Histograms have size n_bins, not max_bins.
  • types.pyx and types.pxd have been renamed to common.*** because we needed to declare the ALMOST_INF constant.
  • Support for infinite values is unchanged, but code is a bit different (simplified). If a threshold is found to be +inf, we actually set it to ALMOST_INF = 1e300 like in LightGBM.
    • This avoids having special cases for correctly mapping +inf values when predicting and when binning.
    • This also allows us to set the threshold to +inf iff we are in a split-on-nan situation. Split-on-nan means that all the nans (and only nans) go to the right child, while the rest go to the left child.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @NicolasHug! Nice work. I have the following comments but otherwise it looks good.

Furthermore we should document in the API/estimator changes section of the what's new document that the internal data structure of the fitted model has changed to be able to add the native support for missing values.

Trying to load a pickled model fitted with scikit-learn 0.21 in scikit-learn 0.22
will yield an exception such as:

ValueError: Buffer dtype mismatch, expected 'unsigned char' but got 'unsigned int' in
'node_struct.missing_go_to_left'

Model re-training is required in this case.

Alternatively we could override the __setstate__ method to detect if the predictor array has the valid dtype and reshape with some padding to make it match be extra nice to our users and not have to confuse them with a complex changelog message. But this would imply wrinting a test.

sklearn/ensemble/_hist_gradient_boosting/splitting.pyx Outdated Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/splitting.pyx Outdated Show resolved Hide resolved
predictions_binned = predictor.predict_binned(
X_binned, missing_values_bin_idx=bin_mapper.missing_values_bin_idx_)
assert np.all(predictions == -gradients)
assert np.all(predictions_binned == -gradients)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is a good test, I wonder if it wouldn't be better to write something that looks equivalent from the public API only to make it easier to understand. E.g. something based on the following:

>>> from sklearn.experimental import enable_hist_gradient_boosting
>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> import numpy as np
>>> X = np.asarray([-np.inf, 0, 1, np.inf, np.nan]).reshape(-1, 1)
>>> X
array([[-inf],
       [  0.],
       [  1.],
       [ inf],
       [ nan]])
>>> y_isnan = np.isnan(X.ravel())                      
>>> y_isnan
array([False, False, False, False,  True])

>>> y_isinf = X.ravel() == np.inf                                                                                                                                               
>>> y_isinf
array([False, False, False,  True, False])

>>> stump_clf = HistGradientBoostingClassifier(min_samples_leaf=1, max_iter=1, learning_rate=1., max_depth=2)
>>> stump_clf.fit(X, y_isinf).score(X, y_isinf)
1.0

>>> stump_clf.fit(X, y_isnan).score(X, y_isnan)
1.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with the public API is that we can't test the predictions for X_binned which I think is important too.

I'll add your test as well though, it can't hurt ;)


# Make sure in particular that the +inf sample is mapped to the left child
# Note that lightgbm "fails" here and will assign the inf sample to the
# right child, even though it's a "split on nan" situation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure LightGBM fails in this case? Why would they have introduced AvoidInf() if not for this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened microsoft/LightGBM#2277, looks like they just fixed it (I haven't checked again though)

@adrinjalali
Copy link
Member

Alternatively we could override the setstate method to detect if the predictor array has the valid dtype and reshape with some padding to make it match be extra nice to our users and not have to confuse them with a complex changelog message. But this would imply wrinting a test.

Since this is still experimental, I would rather not worry about that. Adding sample weights and then categorical data would probably also cause the same issue.

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2019

I doubt that sample_weights will change the structure of the parameters used by the predictor function but I get your point.

@NicolasHug
Copy link
Member Author

Thanks Olivier, I have addressed comments and updated the whatsnew with a compound entry of all the changes so far.

I labeled this as a MajorFeature, but feel free to change, I'm not sure.

Regarding the pickles: we don't even support pickling between major versions so I'm not sure we should have a special case for these estimators

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2019

Regarding the pickles: we don't even support pickling between major versions so I'm not sure we should have a special case for these estimators

Well it's cheap (and friendly) to warn the users in the changelog when that actually happens.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but some more comments / questions ;)

Whether this is to be considered a major feature or not can later be changed (either at release time and or discussed at next dev meeting).

doc/modules/ensemble.rst Show resolved Hide resolved
doc/modules/ensemble.rst Show resolved Hide resolved
@NicolasHug
Copy link
Member Author

@adrinjalali Does this have your +1 as well?

Just in case, maybe @thomasjpfan @jnothman @rth @glemaitre @qinhanmin2014 would want to give it a quick look before we merge?

@jnothman
Copy link
Member

I like the proposal. Don't have time soon to check for correctness unfortunately

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point our docs skulls have a reference listing of estimators that'd support missing values

doc/modules/ensemble.rst Outdated Show resolved Hide resolved
@adrinjalali
Copy link
Member

The example in docstrings still need fixing.

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2019

The example in docstrings still need fixing.

Indeed. This time I ran pytest locally prior to pushing :)

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now. And I guess the three of us are in agreement. Merging, nitpicks can go in other PRs, rather have this in, and there's time to fix the issues before the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants