Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for lexicographic comparators for lists of structs #11222

Closed
2 tasks done
bdice opened this issue Jul 7, 2022 · 2 comments · Fixed by #12953
Closed
2 tasks done

Proposal for lexicographic comparators for lists of structs #11222

bdice opened this issue Jul 7, 2022 · 2 comments · Fixed by #12953
Assignees
Labels
libcudf Affects libcudf (C++/CUDA) code. proposal Change current process or code

Comments

@bdice
Copy link
Contributor

bdice commented Jul 7, 2022

Background

Part of a broad push for row operators on nested types in story issue #10186. Related to user request in #10408.

@devavret has designed lexicographic comparators for lists in #11129 but this design cannot support lists of structs. Meanwhile, structs can be lexicographically compared through decomposition (#10164).

Proposed approach

(Discussed with @devavret @ttnghia @vyasr.)

We propose to alter the self-comparator (two-table comparator) preprocessing step to replace all list<struct<...>> with ranked child elements as list<size_type> from innermost to outermost levels of nesting. The replacement data is computed by calling cudf::rank on the child column of the list. (For two tables, the child list columns from both tables are concatenated, ranked, and then split back into a child list column for each table. This can use a fancy "chained" iterator to avoid materializing the concatenation.)

The key insight is that any child type (int, struct<int, int>, ...) except for list<struct<...>> can be ranked using existing lexicographic comparators, and replacing the list child data with the corresponding rankings will preserve the desired sorting order.

After preprocessing, the columns will not contain list<struct<...>> at any level of nesting, and all other combinations of nested types can be compared using other methods (currently dremel encoding for lists as in #11129, and decomposition for structs as in #10164). Altogether, the number of rank calls during preprocessing is equal to the number of instances of list<struct<...>> in the type hierarchy.

Notes

This preprocessing step for each list<struct<...>> may be expensive due to its reliance on ranks, which requires a sort and is thus O(N log N). Originally we considered replacing the data with an order preserving hash (#10020) which would be O(N) but that is probably not easy to implement without stringent constraints or prior knowledge of the inputs. It may be possible to have special cases for some limited set of types where some form of compressed ranking (like an order-preserving hash) can be trivially computed without sorting (the cudf::rank step is not needed). For example, the child ordering of a non-nullable list<struct<int32_t>> can be preserved by extracting the single integer's value and sorting on that. However, if null values in the parent struct must be considered, this may not work. The mapping must be one-to-one (injective) and monotonic, which cudf::rank satisfies.

Additionally, this places a new requirement on the self-comparator (two-table comparator) to know the nullability and physical element comparator (#10870) at construction so that the child ranks of list<struct<...>> data are correctly preprocessed. Currently these arguments are given when constructing the device row comparator from the self-comparator (two-table comparator).

Examples

Example 1

Given a column with type list<struct<int, int>>:

[{1, 2}, {2, 1}]
[{3, 4}, {1, 2}]
[{2, 1}]

The struct<int, int> leaf data looks like:

{1, 2}
{2, 1}
{3, 4}
{1, 2}
{2, 1}

The ranking of these values is 0, 2, 4, 0, 2. Replacing these values in the list column gives a list<size_type> column:

[0, 2]
[3, 0]
[2]

Then this can be sorted.

Example 2

Given a column with type list<struct<struct<list<int>, list<struct<int>>>, list<int>>>:

[{{[1], [{2}]}, [3, 4]}]
[{{[0], [{1}]}, [2, 3]}]

We can iteratively replace all list<struct<...> by list<size_type>:

list<struct<struct<list<int>, list<size_type>>, list<int>>>
[{{[1], [1]}, [3, 4]}]
[{{[0], [0]}, [2, 3]}]

At this point, the child type of the root list is struct<struct<list<int>, list<size_type>>, list<int>>. This child type contains no list<struct<...>> and can be ranked again:

list<size_type>
[1]
[0]

The table is now preprocessed and is now comparable.

Example 3

For a binary operation like column < scalar, the two-table comparator will be used. The preprocessing will need to rank all of the elements of a list<struct<...>> from the column against those of the scalar. To optimize this, the rank algorithm should be updated to make it possible to compute a rank over multiple non-contiguous column views without materializing their concatenation.

Tasks

  1. 3 - Ready for Review CMake Spark feature request libcudf non-breaking
    ttnghia
  2. 3 - Ready for Review CMake Spark feature request libcudf non-breaking
    ttnghia
@bdice bdice added proposal Change current process or code libcudf Affects libcudf (C++/CUDA) code. labels Jul 7, 2022
@jrhemstad
Copy link
Contributor

Additionally, this places a new requirement on the self-comparator (two-table comparator) to know the nullability and physical element comparator (#10870) at construction so that the child ranks of list<struct<...>> data are correctly preprocessed. Currently these arguments are given when constructing the device row comparator from the self-comparator (two-table comparator).

Given this requirement, I'd be less inclined to make this change: #11040 (comment)

@github-actions
Copy link

github-actions bot commented Aug 7, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@bdice bdice added this to Needs prioritizing in Feature Planning via automation Aug 29, 2022
@ttnghia ttnghia self-assigned this Mar 5, 2023
@rapids-bot rapids-bot bot closed this as completed in d0a7dec May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code. proposal Change current process or code
Projects
No open projects
Feature Planning
Needs prioritizing
Development

Successfully merging a pull request may close this issue.

3 participants