Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Generate unique keys table in java JNI contiguousSplitGroups #11615

Closed
res-life opened this issue Aug 29, 2022 · 0 comments · Fixed by #11614
Closed

[FEA] Generate unique keys table in java JNI contiguousSplitGroups #11615

res-life opened this issue Aug 29, 2022 · 0 comments · Fixed by #11614
Assignees
Labels
feature request New feature or request Java Affects Java cuDF API.

Comments

@res-life
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Table.contiguousSplitGroups splits a table into sub-groups, but each group's group-by key is not collected and de-duplicated.

contiguousSplitGroups already generated split indices, so it's efficient to collect and de-duplicate the group-by keys by invoking a gather.

Describe the solution you'd like
Generate an extra table to collect the unique keys corresponding to sub-groups.
Origin implementation example:

     * Example:
     *   Grouping column index: 0
     *   Input: A table of 3 rows (two groups)
     *             a    1
     *             b    2
     *             b    3
     *
     * Result:
     *   Two tables, one group one table.
     *   Result[0]:
     *              a    1
     *
     *   Result[1]:
     *              b    2
     *              b    3

New requirement example:

contiguousSplitGroups
     * Example:
     *   Grouping column index: 0
     *   Input: A table of 3 rows (two groups)
     *             a    1
     *             b    2
     *             b    3
     *
     * Result:  GroupByResult
     *   groups:  Two tables, one group one table.
     *          group[0]:
     *              a    1
     * 
     *          group[1]:
     *              b    2
     *              b    3
     *    uniqKeysTable: Two rows, one row is corresponding to one group.
     *      a  // for group 0
     *      b  // for group 1

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context

In the NVIDIA/spark-rapids#5999, need to split the input into groups and then get the unique keys table to generate partition strings. We can generate the distinct group-by keys by t.groupBy(columnIds).aggregate() after contiguousSplitGroups, but this is inefficient. Here we can gather the split indices produced by contiguousSplitGroups to generate the unique keys.

@res-life res-life added feature request New feature or request Needs Triage Need team to review and classify Java Affects Java cuDF API. labels Aug 29, 2022
@res-life res-life self-assigned this Aug 29, 2022
@github-actions github-actions bot added this to Needs prioritizing in Feature Planning Aug 29, 2022
Feature Planning automation moved this from Needs prioritizing to Closed Sep 5, 2022
rapids-bot bot pushed a commit that referenced this issue Sep 5, 2022
Generate unique keys table in java JNI `contiguousSplitGroups`
closes #11615

`contiguousSplitGroups` splits a table into sub-groups, but each group's `group-by` key is not collected and de-duplicated.
This PR is to generate an extra table to collect and deduplicate the unique keys corresponding to sub-groups.

```
contiguousSplitGroups
     * Example:
     *   Grouping column index: 0
     *   Input: A table of 3 rows (two groups)
     *             a    1
     *             b    2
     *             b    3
     *
     * Result:  GroupByResult
     *   groups:  Two tables, one group one table.
     *          group[0]:
     *              a    1
     * 
     *          group[1]:
     *              b    2
     *              b    3
     *    uniqKeysTable: Two rows, one row is corresponding to one group.
     *      a  // for group 0
     *      b  // for group 1
```

Authors:
  - Chong Gao (https://github.com/res-life)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #11614
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Java Affects Java cuDF API.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants