-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor semi_anti_join
#11100
Refactor semi_anti_join
#11100
Conversation
semi_anti_join
to expose semi_join_contains
semi_anti_join
to expose left_semi_join_contains
Benchmark comparing before and after the modification:
So the performance is not always better, although in most cases we gain >10% speedup. The slower places are due to null search. I'm still trying to optimize it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some non-blocking nitpick. Thanks @ttnghia ! Great Work!
cpp/src/search/contains_table.cu
Outdated
// Otherwise, we have only one nullable column and can use its null mask directly. | ||
auto const row_bitmask = | ||
haystack_nullable_columns.size() > 1 | ||
? std::move( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This move
is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move
is necessary here because we need to access to the .first
element of the bitmask_and
result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow I didn't know that. Thanks 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was wrong, the .first
element here is a device_buffer
instead.
It seems that by not using std::move
, the device_buffer
is incorrectly copied (??) and CI test failed:
13:43:43 [ RUN ] DeathTest.CudaFatalError
13:43:43 /workspace/.conda-bld/work/cpp/tests/error/error_handling_test.cu:102: Failure
13:43:43 Death test: call_kernel()
13:43:43 Result: failed to die.
13:43:43 Error msg:
13:43:43 [ DEATH ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's related. #11100 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try one more time to see if removing std::move
actually can get rid of the death test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the failure here is just random? I just checked rmm::device_buffer
and see that its copy constructor is deleted so there should not be any copying.
Let's wait to see the CI test result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that test failed again. So indeed removing std::move
is not related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now all CI tests passed (std::move
removed)! Definitely the test failed randomly. I'm going to merge this soon.
Rerun tests. |
Unfortunately Jake has no time to review again
@gpucibot merge |
This PR adds the following APIs for set operations: * `lists::have_overlap` * `lists::intersect_distinct` * `lists::union_distinct` * `lists::difference_distinct` ### Name Convention Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements. --- Depends on: * #10945 * #11017 * NVIDIA/cuCollections#175 * #11052 * #11118 * #11100 * #11149 Closes #10409. Authors: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Michael Wang (https://github.com/isVoid) - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11043
A (left) semi-join between the left and right tables returns a set of rows in the left table that has matching rows (i.e., compared equally) in the right table. As such, for each row in the left table, it needs to check if that row has a match in the right table.
Such check is very generic and has applications in many other places, not just in semi-join. This PR exposes that check functionality as a new
cudf::detail::contains(table_view, table_view)
for internal usage.Closes #11037.
Depends on:
static_multimap::pair_contains
NVIDIA/cuCollections#175