Support sparse matrices in HistGradientBoosting estimators #16885

NicolasHug · 2020-04-09T23:06:19Z

This is a placeholder issue for sparse matrices support in the Histogram-based GBDT estimators.

I guess #15550 should be tackled first.

Below are my thoughts and potential plan on the matter, feel free to ignore.

Binning:

We need a utility to compute quantiles on sparse data, and we need to map a float sparse matrix to a binned sparse matrix given those quantiles. To avoid having to densify X_binned, the zeros in X should be mapped to bin 0, even if that's not their actual bin (called actual_bin_zeros). I guess that means all the bins in range(0, actual_bin_zeros) have an offset of 1, i.e. now they're actually mapped to range(1, actual_bin_zeros + 1). Though maybe we can avoid the offset by distinguishing between explicit and implicit zeros, IDK.

Histograms:

We need a histogram builder that can handle sparse data and that is aware of actual_bin_zeros in some way. We can't just build the histograms as usual, because that would mean that the zeros would be treated as the lowest value in the splitter. In the histogram, the zeros should be placed in their proper bin, i.e. at index actual_bin_zeros. This way, the splitter can be left unchanged. The offset of the bins in range(1, actual_bin_zeros) should also be canceled here.

When building a histogram, we can focus only on the non-zeros entries. We already know the totals sum_gradients, sum_hessians, and count at any given node. So we can just go through the samples that have non-zero values and fill-in the histogram at their respective bins, and then set hist[actual_bin_zeros]['grad'] = total_sum_gradients - hist[:]['grad'].sum().

The text was updated successfully, but these errors were encountered:

StealthyKamereon · 2020-12-22T22:56:26Z

I'm working on it.
Correct me if I'm wrong but isn't it a duplicate (or the source) of #15336 ?

NicolasHug · 2020-12-23T08:56:55Z

Indeed this is a duplicate, thanks for noting. I'll close this one as the other one has precedence.

Thanks for giving this a try, please ping me on the PR ;)

NicolasHug added the New Feature label Apr 9, 2020

cmarmo added the module:ensemble label Apr 14, 2020

jeremiedbb mentioned this issue May 18, 2020

HistGradientBoostingClassifier should support sparse matrices like GradientBoostingClassifier #17260

Closed

NicolasHug closed this as completed Dec 23, 2020

NicolasHug mentioned this issue Dec 23, 2020

Add Sparse Matrix Support For HistGradientBoostingClassifier #15336

Open

NicolasHug mentioned this issue Apr 9, 2021

[WIP] Add sparse matrix support for histgradientboostingclassifier #19187

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sparse matrices in HistGradientBoosting estimators #16885

Support sparse matrices in HistGradientBoosting estimators #16885

NicolasHug commented Apr 9, 2020

StealthyKamereon commented Dec 22, 2020

NicolasHug commented Dec 23, 2020

Support sparse matrices in HistGradientBoosting estimators #16885

Support sparse matrices in HistGradientBoosting estimators #16885

Comments

NicolasHug commented Apr 9, 2020

StealthyKamereon commented Dec 22, 2020

NicolasHug commented Dec 23, 2020