Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for using tdigests to compute approximate percentiles. #8983

Merged
merged 34 commits into from Sep 24, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
a37b539
Add groupby_aggregation and groupby_scan_aggregation classes and forc…
nvdbaranec Jul 29, 2021
2dc3fd6
Add python bindings for new aggregation types. Fix JNI bindings and /…
nvdbaranec Aug 2, 2021
767f1dd
Formatting changes.
nvdbaranec Aug 2, 2021
9477f88
Fixed doc typo. Removed unneeded declarations from GroupbyAggregatio…
nvdbaranec Aug 2, 2021
d51d583
Super rough draft.
nvdbaranec Aug 4, 2021
e47e4f2
Merge tdigest aggregation working fully.
nvdbaranec Aug 5, 2021
3c4ce03
Cleanup and documentation.
nvdbaranec Aug 5, 2021
0d36303
Merge branch 'branch-21.10' into percentile_approx
nvdbaranec Aug 19, 2021
c405f7d
Merge branch 'branch-21.10' into percentile_approx
nvdbaranec Aug 20, 2021
8bc8e12
Docs and general cleanup. Removed code duplication in python from an …
nvdbaranec Aug 20, 2021
ea1662e
Move some code around to more logical file locations. Make sure to ty…
nvdbaranec Aug 23, 2021
b755e0f
Add tdigest.hpp
nvdbaranec Aug 23, 2021
352b8fc
Intermediate checkin. Properly handle nulls when computing cluster s…
nvdbaranec Aug 25, 2021
4f90bfd
Fix bugs in merge_tdigest aggregation. Use group_labels provided by …
nvdbaranec Aug 30, 2021
f967add
Handle empty or otherwise unweighted data when creating and merging t…
nvdbaranec Sep 2, 2021
d36f2d7
Fix an issue where small numbers of inputs could lead to holes in cen…
nvdbaranec Sep 8, 2021
e5201aa
Merge branch 'branch-21.10' into percentile_approx
nvdbaranec Sep 9, 2021
40d2063
Explicitly store source min and max values when creating and merging …
nvdbaranec Sep 10, 2021
abd324f
Add parameter to allow user to specify output column type for percent…
nvdbaranec Sep 15, 2021
fdf3315
Address PR review comments.
nvdbaranec Sep 15, 2021
036d2e7
Comment update.
nvdbaranec Sep 15, 2021
7ebffb3
Add more tests. Make floating point precision ulp level a parameter t…
nvdbaranec Sep 17, 2021
60d7969
Add grouped tests to percentile_approx tests.
nvdbaranec Sep 17, 2021
3e0119b
Fixed bug with min/max values in tdigest generation from decimal. Fix…
nvdbaranec Sep 19, 2021
8e14771
Tweak some parameters to speed up percentile approx tests. Removed a…
nvdbaranec Sep 19, 2021
fe3ecae
Doc update
nvdbaranec Sep 20, 2021
e1ccef5
PR review comments.
nvdbaranec Sep 21, 2021
708548f
More PR review changes.
nvdbaranec Sep 21, 2021
f0fb57f
Null percentiles return the minimum value. Empty percentiles return …
nvdbaranec Sep 22, 2021
4083e54
Wave of review changes. Notably: add tdigest namespace to help grou…
nvdbaranec Sep 22, 2021
5fbdaca
Misc PR review changes. Use 'centroid' instead of 'centroid_tuple' i…
nvdbaranec Sep 22, 2021
564f9c7
Refactor the guts of the percentile_approx kernel to be considerably …
nvdbaranec Sep 22, 2021
fb2816e
Another round of review changes.
nvdbaranec Sep 23, 2021
e5e9360
Change percentile_approx so null percentiles result in null results (…
nvdbaranec Sep 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
37 changes: 23 additions & 14 deletions cpp/src/quantiles/tdigest/tdigest.cu
Expand Up @@ -69,14 +69,11 @@ __global__ void compute_percentiles_kernel(device_span<offset_type const> tdiges

// size of the digest we're querying
auto const tdigest_size = tdigest_offsets[tdigest_index + 1] - tdigest_offsets[tdigest_index];
if (tdigest_size == 0) {
// no work to do. values will be set to null
return;
}
// no work to do. values will be set to null
if (tdigest_size == 0 || !percentiles.is_valid(pindex)) { return; }

output[tid] = [&]() {
double const percentage =
percentiles.is_valid(pindex) ? percentiles.element<double>(pindex) : 0.0;
double const percentage = percentiles.element<double>(pindex);
double const* cumulative_weight = cumulative_weight_ + tdigest_offsets[tdigest_index];

// centroids for this particular tdigest
Expand Down Expand Up @@ -207,15 +204,27 @@ std::unique_ptr<column> compute_approx_percentiles(structs_column_view const& in
weight.begin<double>(),
cumulative_weights->mutable_view().begin<double>());

// output is a column of size input.size() * percentiles.size()
auto result = cudf::make_fixed_width_column(data_type{type_id::FLOAT64},
input.size() * percentiles.size(),
mask_state::UNALLOCATED,
stream,
mr);

auto percentiles_cdv = column_device_view::create(percentiles);
auto centroids = cudf::detail::make_counting_transform_iterator(

// leaf is a column of size input.size() * percentiles.size()
auto const num_output_values = input.size() * percentiles.size();

// null percentiles become null results.
auto [null_mask, null_count] = [&]() {
return percentiles.null_count() != 0
? cudf::detail::valid_if(
thrust::make_counting_iterator<size_type>(0),
thrust::make_counting_iterator<size_type>(0) + num_output_values,
[percentiles = *percentiles_cdv] __device__(size_type i) {
return percentiles.is_valid(i % percentiles.size());
mythrocks marked this conversation as resolved.
Show resolved Hide resolved
})
: std::pair<rmm::device_buffer, size_type>{rmm::device_buffer{}, 0};
}();

auto result = cudf::make_fixed_width_column(
data_type{type_id::FLOAT64}, num_output_values, std::move(null_mask), null_count, stream, mr);

auto centroids = cudf::detail::make_counting_transform_iterator(
0, make_centroid{mean.begin<double>(), weight.begin<double>()});

constexpr size_type block_size = 256;
Expand Down
12 changes: 6 additions & 6 deletions cpp/tests/quantiles/percentile_approx_test.cu
Expand Up @@ -413,7 +413,7 @@ TEST_F(PercentileApproxTest, NullPercentiles)
auto const delta = 1000;

cudf::test::fixed_width_column_wrapper<double> values{1, 1, 2, 3, 4, 5, 6, 7, 8};
cudf::test::fixed_width_column_wrapper<int> keys{0, 0, 0, 0, 0, 0, 0, 0, 0};
cudf::test::fixed_width_column_wrapper<int> keys{0, 0, 0, 0, 0, 1, 1, 1, 1};
cudf::table_view t({keys});
cudf::groupby::groupby gb(t);
std::vector<cudf::groupby::aggregation_request> requests;
Expand All @@ -424,12 +424,12 @@ TEST_F(PercentileApproxTest, NullPercentiles)

structs_column_view scv(*tdigest_column.second[0].results[0]);

// nulls should result in the min value
cudf::test::fixed_width_column_wrapper<double> npercentiles{{1.0, 1.0, 0.5, 0.75}, {0, 0, 1, 1}};
cudf::test::fixed_width_column_wrapper<double> npercentiles{{0.5, 0.5, 1.0, 1.0}, {0, 0, 1, 1}};
auto result = cudf::percentile_approx(scv, npercentiles);

cudf::test::fixed_width_column_wrapper<double> percentiles{0.0, 0.0, 0.5, 0.75};
auto expected = cudf::percentile_approx(scv, percentiles);
std::vector<bool> valids{0, 0, 1, 1};
cudf::test::lists_column_wrapper<double> expected{{{99, 99, 4, 4}, valids.begin()},
{{99, 99, 8, 8}, valids.begin()}};

CUDF_TEST_EXPECT_COLUMNS_EQUAL(*result, *expected);
CUDF_TEST_EXPECT_COLUMNS_EQUAL(*result, expected);
}