You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe. Table.contiguousSplitGroups splits a table into sub-groups, but each group's group-by key is not collected and de-duplicated.
contiguousSplitGroups already generated split indices, so it's efficient to collect and de-duplicate the group-by keys by invoking a gather.
Describe the solution you'd like
Generate an extra table to collect the unique keys corresponding to sub-groups.
Origin implementation example:
* Example:
* Grouping column index: 0
* Input: A table of 3 rows (two groups)
* a 1
* b 2
* b 3
*
* Result:
* Two tables, one group one table.
* Result[0]:
* a 1
*
* Result[1]:
* b 2
* b 3
New requirement example:
contiguousSplitGroups
* Example:
* Grouping column index: 0
* Input: A table of 3 rows (two groups)
* a 1
* b 2
* b 3
*
* Result: GroupByResult
* groups: Two tables, one group one table.
* group[0]:
* a 1
*
* group[1]:
* b 2
* b 3
* uniqKeysTable: Two rows, one row is corresponding to one group.
* a // for group 0
* b // for group 1
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
In the NVIDIA/spark-rapids#5999, need to split the input into groups and then get the unique keys table to generate partition strings. We can generate the distinct group-by keys by t.groupBy(columnIds).aggregate() after contiguousSplitGroups, but this is inefficient. Here we can gather the split indices produced by contiguousSplitGroups to generate the unique keys.
The text was updated successfully, but these errors were encountered:
Generate unique keys table in java JNI `contiguousSplitGroups`
closes#11615
`contiguousSplitGroups` splits a table into sub-groups, but each group's `group-by` key is not collected and de-duplicated.
This PR is to generate an extra table to collect and deduplicate the unique keys corresponding to sub-groups.
```
contiguousSplitGroups
* Example:
* Grouping column index: 0
* Input: A table of 3 rows (two groups)
* a 1
* b 2
* b 3
*
* Result: GroupByResult
* groups: Two tables, one group one table.
* group[0]:
* a 1
*
* group[1]:
* b 2
* b 3
* uniqKeysTable: Two rows, one row is corresponding to one group.
* a // for group 0
* b // for group 1
```
Authors:
- Chong Gao (https://github.com/res-life)
Approvers:
- Robert (Bobby) Evans (https://github.com/revans2)
URL: #11614
Is your feature request related to a problem? Please describe.
Table.contiguousSplitGroups
splits a table into sub-groups, but each group's group-by key is not collected and de-duplicated.contiguousSplitGroups
already generatedsplit indices
, so it's efficient to collect and de-duplicate thegroup-by
keys by invoking agather
.Describe the solution you'd like
Generate an extra table to collect the unique keys corresponding to sub-groups.
Origin implementation example:
New requirement example:
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
In the NVIDIA/spark-rapids#5999, need to split the input into groups and then get the unique keys table to generate partition strings. We can generate the distinct group-by keys by
t.groupBy(columnIds).aggregate()
after contiguousSplitGroups, but this is inefficient. Here we can gather the split indices produced bycontiguousSplitGroups
to generate the unique keys.The text was updated successfully, but these errors were encountered: