[FEA] Generate unique keys table in java JNI contiguousSplitGroups #11615

res-life · 2022-08-29T08:53:36Z

Is your feature request related to a problem? Please describe.
Table.contiguousSplitGroups splits a table into sub-groups, but each group's group-by key is not collected and de-duplicated.

contiguousSplitGroups already generated split indices, so it's efficient to collect and de-duplicate the group-by keys by invoking a gather.

Describe the solution you'd like
Generate an extra table to collect the unique keys corresponding to sub-groups.
Origin implementation example:

     * Example:
     *   Grouping column index: 0
     *   Input: A table of 3 rows (two groups)
     *             a    1
     *             b    2
     *             b    3
     *
     * Result:
     *   Two tables, one group one table.
     *   Result[0]:
     *              a    1
     *
     *   Result[1]:
     *              b    2
     *              b    3

New requirement example:

contiguousSplitGroups
     * Example:
     *   Grouping column index: 0
     *   Input: A table of 3 rows (two groups)
     *             a    1
     *             b    2
     *             b    3
     *
     * Result:  GroupByResult
     *   groups:  Two tables, one group one table.
     *          group[0]:
     *              a    1
     * 
     *          group[1]:
     *              b    2
     *              b    3
     *    uniqKeysTable: Two rows, one row is corresponding to one group.
     *      a  // for group 0
     *      b  // for group 1

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context

In the NVIDIA/spark-rapids#5999, need to split the input into groups and then get the unique keys table to generate partition strings. We can generate the distinct group-by keys by t.groupBy(columnIds).aggregate() after contiguousSplitGroups, but this is inefficient. Here we can gather the split indices produced by contiguousSplitGroups to generate the unique keys.

The text was updated successfully, but these errors were encountered:

Generate unique keys table in java JNI `contiguousSplitGroups` closes #11615 `contiguousSplitGroups` splits a table into sub-groups, but each group's `group-by` key is not collected and de-duplicated. This PR is to generate an extra table to collect and deduplicate the unique keys corresponding to sub-groups. ``` contiguousSplitGroups * Example: * Grouping column index: 0 * Input: A table of 3 rows (two groups) * a 1 * b 2 * b 3 * * Result: GroupByResult * groups: Two tables, one group one table. * group[0]: * a 1 * * group[1]: * b 2 * b 3 * uniqKeysTable: Two rows, one row is corresponding to one group. * a // for group 0 * b // for group 1 ``` Authors: - Chong Gao (https://github.com/res-life) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #11614

res-life added feature request New feature or request Needs Triage Need team to review and classify Java Affects Java cuDF API. labels Aug 29, 2022

res-life self-assigned this Aug 29, 2022

github-actions bot added this to Needs prioritizing in Feature Planning Aug 29, 2022

res-life mentioned this issue Aug 29, 2022

Generate unique keys table in java JNI contiguousSplitGroups #11614

Merged

3 tasks

rapids-bot bot closed this as completed in #11614 Sep 5, 2022

Feature Planning automation moved this from Needs prioritizing to Closed Sep 5, 2022

res-life mentioned this issue Sep 19, 2022

Add dynamic partition concurrent writer to avoid full sort [databricks] NVIDIA/spark-rapids#6569

Merged

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Generate unique keys table in java JNI contiguousSplitGroups #11615

[FEA] Generate unique keys table in java JNI contiguousSplitGroups #11615

res-life commented Aug 29, 2022

[FEA] Generate unique keys table in java JNI contiguousSplitGroups #11615

[FEA] Generate unique keys table in java JNI contiguousSplitGroups #11615

Comments

res-life commented Aug 29, 2022