Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sparse matrices in HistGradientBoosting estimators #16885

Closed
NicolasHug opened this issue Apr 9, 2020 · 2 comments
Closed

Support sparse matrices in HistGradientBoosting estimators #16885

NicolasHug opened this issue Apr 9, 2020 · 2 comments

Comments

@NicolasHug
Copy link
Member

This is a placeholder issue for sparse matrices support in the Histogram-based GBDT estimators.

I guess #15550 should be tackled first.


Below are my thoughts and potential plan on the matter, feel free to ignore.

Binning:

We need a utility to compute quantiles on sparse data, and we need to map a float sparse matrix to a binned sparse matrix given those quantiles. To avoid having to densify X_binned, the zeros in X should be mapped to bin 0, even if that's not their actual bin (called actual_bin_zeros). I guess that means all the bins in range(0, actual_bin_zeros) have an offset of 1, i.e. now they're actually mapped to range(1, actual_bin_zeros + 1). Though maybe we can avoid the offset by distinguishing between explicit and implicit zeros, IDK.

Histograms:

We need a histogram builder that can handle sparse data and that is aware of actual_bin_zeros in some way. We can't just build the histograms as usual, because that would mean that the zeros would be treated as the lowest value in the splitter. In the histogram, the zeros should be placed in their proper bin, i.e. at index actual_bin_zeros. This way, the splitter can be left unchanged. The offset of the bins in range(1, actual_bin_zeros) should also be canceled here.

When building a histogram, we can focus only on the non-zeros entries. We already know the totals sum_gradients, sum_hessians, and count at any given node. So we can just go through the samples that have non-zero values and fill-in the histogram at their respective bins, and then set hist[actual_bin_zeros]['grad'] = total_sum_gradients - hist[:]['grad'].sum().

@StealthyKamereon
Copy link

I'm working on it.
Correct me if I'm wrong but isn't it a duplicate (or the source) of #15336 ?

@NicolasHug
Copy link
Member Author

Indeed this is a duplicate, thanks for noting. I'll close this one as the other one has precedence.

Thanks for giving this a try, please ping me on the PR ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants