Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support lists of structs in row lexicographic comparator #12953

Merged
merged 182 commits into from
May 3, 2023

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Mar 15, 2023

This implements support for lexicographic comparison for lists-of-structs, following the proposed idea in #11222:

  • The child column of the lists-of-structs column is replaced by an integer column of its rank values.
  • In the cases of comparing two tables, such child columns from both tables are concatenated, ranked, then split back into new child columns to replace the original child columns for each table.

Depends on:

Closes #11222.

ttnghia and others added 30 commits March 2, 2023 15:55
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass through today, going to try and review this piecemeal to stop stalling progress so focused on getting a handle on the core algorithms in this review. I have a few concerns about specific aspects, but overall nothing too bad. I feel pretty confident that a couple more rounds of review will get us there.

cpp/src/table/row_operators.cu Show resolved Hide resolved
cpp/src/table/row_operators.cu Show resolved Hide resolved
cpp/src/table/row_operators.cu Show resolved Hide resolved
Comment on lines +445 to +449
// Dense ranks should be used instead of first rank.
// Consider this example: `input = [ [{0, "a"}, {3, "c"}], [{0, "a"}, {2, "b"}] ]`.
// If first rank is used, `transformed_input = [ [0, 3], [1, 2] ]`. Comparing them will lead
// to the result row(0) < row(1) which is incorrect.
// With dense rank, `transformed_input = [ [0, 2], [0, 1] ]`, producing correct comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically this is just saying that we need exactly equal elements to map to the same rank, right?

Copy link
Contributor

@bdice bdice Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

As a consequence, the maximum value of the rank (minus 1) should be the cardinality of the input data, not the size of the input data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that you're invoking cardinality vs size in order to treat this as a set, yes I think we're on the same page.

cpp/src/table/row_operators.cu Show resolved Hide resolved
cpp/src/table/row_operators.cu Show resolved Hide resolved
cpp/src/table/row_operators.cu Show resolved Hide resolved

// Recursively call transformation on the child column.
auto [new_child_lhs, new_child_rhs_opt, out_cols_child_lhs, out_cols_child_rhs] =
transform_lists_of_structs(child_lhs, child_rhs_opt, column_null_order, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yes this kind of recursion makes sense, is what I'm worried might be missing in the struct code path above if there are deeper levels of nesting.

Copy link
Contributor Author

@ttnghia ttnghia Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original input should be passed in decompse_struct before flowing into this function. So basically this function cannot see any structs-of-lists column. It can only see structs-of-structs-of..... of one column of basic types, or lists-of-lists (including nested), or lists-of-structs (including nested).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right and decompose_structs will find structs at any level of nesting, even inside lists, right? That's the key here is ensuring that even when there's mixed nesting every level gets processed correctly (or at least, whatever constitutes the "leaf" nodes according to that function).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decompose_structs will find structs at any level of nesting, even inside lists, right?

Yes. All possible mixed nesting at any level will be processed to make sure this function will not miss any cases.

cpp/src/table/row_operators.cu Outdated Show resolved Hide resolved
Comment on lines +685 to +688
* Note that the output of this factory function should not be used in `two_table_comparator` if
* the input table contains lists-of-structs. In such cases, please use the overload
* `preprocessed_table::create(table_view const&, table_view const&,...)` to preprocess both input
* tables at the same time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this seems error prone. I want to make sure I understand, I think there are three cases:

  1. Only have a single table and want to use self_comparator: Call preprocessed_table::create(tbl, ...);
  2. Have two tables and want to use two_table_comparator, but neither table has lists of structs: Call preprocessed_table::create(lhs, ...) and preprocessed_table::create(rhs, ...) independently.
  3. Have two tables and want to use two_table_comparator, and the tables may contain lists of structs: Call preprocessed_table::create(lhs, rhs, ...)

Is that right? If so, the differences between cases 2 and 3 seem tricky for callers to remember and it would be nice to avoid that. I don't have any great ideas yet though, without defining a completely different return type I don't see a way to avoid the second case from being used. I just wonder if we should always advise use of the new create and eat the cost to simplify developers' lives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I don't want to encourage independent preprocessing for two-table comparators (equality or lexicographic). Enforcing joint preprocessing at the call site gives us far more control over future changes, and we can always fall back to independent preprocessing internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just add a comment to encourage using the overload create(lhs, rhs): 3a12c2d.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure that the two-table overload is being used everywhere in libcudf before merging this. I think that's in-scope for this PR -- or at least an immediate follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's do that in this PR so that there's no chance of examples of the undesirable pattern lingering that could be copy-pasted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that change is somewhat very large, requiring the hash_join class to be reimplemented which is too much diverged from this PR. I plan to have two more separate PRs for doing so instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was wrong. When I was searching for two_table_comparator, I found a lot of instances and mistakenly considered all of them, including the instances of equality::two_table_comparator.

In fact, libcudf currently has only one instance of lexicographic::two_table_comparator and it doesn't need to be updated (already constructed properly). So this is no longer needed.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed a lot of follow-up work for this PR. I don't have a lot more to say about the current state, since we've deferred so much for now, so I am approving. Please make sure all the follow-up steps are organized and documented well.

cpp/src/table/row_operators.cu Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've stared at this enough to largely convince myself of the correctness of the algorithm at this point. I have some requests left, but they're mostly around testing or easy code quality improvements. In the interest of unblocking downstream development I am OK with merging this with more of the cleanup left as downstream work as long as it can be done in parallel with some of the feature development happening in other PRs.

order column_order,
null_policy null_handling,
null_order null_precedence,
bool percentage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we go ahead and make this an enum?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just exposing an API that already existed in a source file and changing this would affect other code paths I'm OK punting to a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just exposing the rank API in detail:: namespace. Changing this would be breaking so let's do it in some follow up PR.

Comment on lines +685 to +688
* Note that the output of this factory function should not be used in `two_table_comparator` if
* the input table contains lists-of-structs. In such cases, please use the overload
* `preprocessed_table::create(table_view const&, table_view const&,...)` to preprocess both input
* tables at the same time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's do that in this PR so that there's no chance of examples of the undesirable pattern lingering that could be copy-pasted.

cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
Comment on lines +445 to +449
// Dense ranks should be used instead of first rank.
// Consider this example: `input = [ [{0, "a"}, {3, "c"}], [{0, "a"}, {2, "b"}] ]`.
// If first rank is used, `transformed_input = [ [0, 3], [1, 2] ]`. Comparing them will lead
// to the result row(0) < row(1) which is incorrect.
// With dense rank, `transformed_input = [ [0, 2], [0, 1] ]`, producing correct comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that you're invoking cardinality vs size in order to treat this as a set, yes I think we're on the same page.

cpp/tests/search/search_list_test.cpp Outdated Show resolved Hide resolved
cpp/tests/search/search_list_test.cpp Show resolved Hide resolved
cpp/tests/sort/sort_nested_types_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/table/experimental_row_operator_tests.cu Outdated Show resolved Hide resolved
cpp/benchmarks/sort/sort_lists.cpp Outdated Show resolved Hide resolved
@ttnghia ttnghia requested a review from vyasr May 3, 2023 04:32
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've pushed a lot of work to follow-ups at this point, so I'm good with merging this so that we can get started on that work.

cpp/tests/search/search_list_test.cpp Show resolved Hide resolved
CUDF_TEST_EXPECT_COLUMNS_EQUAL(expected_upper_bound, *result_upper_bound, verbosity);
}

TEST_F(ListBinarySearch, CrazyListTest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the name. Thank you for writing this test! It gives me a lot more confidence that we're handling the arbitrary depth correctly.

@ttnghia
Copy link
Contributor Author

ttnghia commented May 3, 2023

Thanks all for reviewing. I'm merging this now, so can unblock the follow-up work ASAP. The next step will be addressing

@ttnghia
Copy link
Contributor Author

ttnghia commented May 3, 2023

/merge

@rapids-bot rapids-bot bot merged commit d0a7dec into rapidsai:branch-23.06 May 3, 2023
43 checks passed
@ttnghia ttnghia deleted the two_tables_nested_types branch May 3, 2023 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal for lexicographic comparators for lists of structs
5 participants