You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Below are my thoughts and potential plan on the matter, feel free to ignore.
Binning:
We need a utility to compute quantiles on sparse data, and we need to map a float sparse matrix to a binned sparse matrix given those quantiles. To avoid having to densify X_binned, the zeros in X should be mapped to bin 0, even if that's not their actual bin (called actual_bin_zeros). I guess that means all the bins in range(0, actual_bin_zeros) have an offset of 1, i.e. now they're actually mapped to range(1, actual_bin_zeros + 1). Though maybe we can avoid the offset by distinguishing between explicit and implicit zeros, IDK.
Histograms:
We need a histogram builder that can handle sparse data and that is aware of actual_bin_zeros in some way. We can't just build the histograms as usual, because that would mean that the zeros would be treated as the lowest value in the splitter. In the histogram, the zeros should be placed in their proper bin, i.e. at index actual_bin_zeros. This way, the splitter can be left unchanged. The offset of the bins in range(1, actual_bin_zeros) should also be canceled here.
When building a histogram, we can focus only on the non-zeros entries. We already know the totals sum_gradients, sum_hessians, and count at any given node. So we can just go through the samples that have non-zero values and fill-in the histogram at their respective bins, and then set hist[actual_bin_zeros]['grad'] = total_sum_gradients - hist[:]['grad'].sum().
The text was updated successfully, but these errors were encountered:
This is a placeholder issue for sparse matrices support in the Histogram-based GBDT estimators.
I guess #15550 should be tackled first.
Below are my thoughts and potential plan on the matter, feel free to ignore.
Binning:
We need a utility to compute quantiles on sparse data, and we need to map a float sparse matrix to a binned sparse matrix given those quantiles. To avoid having to densify
X_binned
, the zeros inX
should be mapped to bin 0, even if that's not their actual bin (calledactual_bin_zeros
). I guess that means all the bins inrange(0, actual_bin_zeros)
have an offset of 1, i.e. now they're actually mapped torange(1, actual_bin_zeros + 1)
. Though maybe we can avoid the offset by distinguishing between explicit and implicit zeros, IDK.Histograms:
We need a histogram builder that can handle sparse data and that is aware of
actual_bin_zeros
in some way. We can't just build the histograms as usual, because that would mean that the zeros would be treated as the lowest value in the splitter. In the histogram, the zeros should be placed in their proper bin, i.e. at indexactual_bin_zeros
. This way, the splitter can be left unchanged. The offset of the bins inrange(1, actual_bin_zeros)
should also be canceled here.When building a histogram, we can focus only on the non-zeros entries. We already know the totals
sum_gradients
,sum_hessians
, andcount
at any given node. So we can just go through the samples that have non-zero values and fill-in the histogram at their respective bins, and then sethist[actual_bin_zeros]['grad'] = total_sum_gradients - hist[:]['grad'].sum()
.The text was updated successfully, but these errors were encountered: