Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distinct left join #15149

Merged
merged 24 commits into from
Mar 6, 2024
Merged

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Feb 26, 2024

Description

Contributes to #14948

This PR adds distinct left join. It also cleans up the distinct inner code to use the terms "build" and "probe" consistently instead of "left" and "right".

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress Performance Performance related issue non-breaking Non-breaking change labels Feb 26, 2024
@PointKernel PointKernel self-assigned this Feb 26, 2024
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 26, 2024
@PointKernel
Copy link
Member Author

Benchmark results on RTX8000:

left_join_32bit

[0] Quadro RTX 8000

Key Type Payload Type Nullable Build Table Size Probe Table Size Samples CPU Time Noise GPU Time Noise
I32 I32 0 100000 100000 3984x 129.402 us 13.54% 125.573 us 13.00%
I32 I32 0 100000 400000 2256x 269.767 us 5.79% 266.255 us 5.60%
I32 I32 0 10000000 10000000 30x 16.885 ms 0.23% 16.882 ms 0.22%
I32 I32 0 10000000 40000000 11x 47.211 ms 0.25% 47.207 ms 0.25%
I32 I32 0 10000000 100000000 11x 108.764 ms 0.04% 108.762 ms 0.04%
I32 I32 0 80000000 100000000 11x 159.143 ms 0.04% 159.141 ms 0.04%
I32 I32 0 100000000 100000000 11x 174.163 ms 0.03% 174.160 ms 0.03%
I32 I32 0 10000000 240000000 11x 259.735 ms 0.05% 259.734 ms 0.05%
I32 I32 0 80000000 240000000 11x 299.279 ms 0.02% 299.278 ms 0.02%
I32 I32 0 100000000 240000000 11x 313.821 ms 0.04% 313.821 ms 0.04%

distinct_left_join_32bit

[0] Quadro RTX 8000

Key Type Payload Type Nullable Build Table Size Probe Table Size Samples CPU Time Noise GPU Time Noise
I32 I32 0 100000 100000 10272x 52.113 us 16.87% 48.697 us 14.70%
I32 I32 0 100000 400000 3184x 160.391 us 2.57% 157.164 us 1.39%
I32 I32 0 10000000 10000000 41x 12.482 ms 0.06% 12.478 ms 0.06%
I32 I32 0 10000000 40000000 13x 38.902 ms 0.04% 38.899 ms 0.04%
I32 I32 0 10000000 100000000 11x 89.294 ms 0.03% 89.287 ms 0.03%
I32 I32 0 80000000 100000000 11x 121.486 ms 0.02% 121.482 ms 0.02%
I32 I32 0 100000000 100000000 11x 129.176 ms 0.02% 129.172 ms 0.02%
I32 I32 0 10000000 240000000 11x 212.003 ms 0.05% 211.999 ms 0.05%
I32 I32 0 80000000 240000000 11x 245.199 ms 0.03% 245.195 ms 0.03%
I32 I32 0 100000000 240000000 11x 254.463 ms 0.04% 254.461 ms 0.04%

@PointKernel PointKernel marked this pull request as ready for review February 27, 2024 23:59
@PointKernel PointKernel requested a review from a team as a code owner February 27, 2024 23:59
@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 28, 2024
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need tests that explicitly compare distinct join results to "normal" join results? I might have overlooked that, but it would be good to confirm that the results match. (Distinct joins should just be faster since they have an additional requirement on the input.)

cpp/include/cudf/join.hpp Outdated Show resolved Hide resolved
cpp/tests/join/distinct_join_tests.cpp Show resolved Hide resolved
@PointKernel
Copy link
Member Author

Do we need tests that explicitly compare distinct join results to "normal" join results?

Distinct join results are always compared against "gold references" instead of the output of normal joins in unit tests. It's a more reliable verification since "normal" joins can go wrong as well.

cpp/include/cudf/join.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/join.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a few questions/nits. Overall LGTM.

@PointKernel
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit aabfd83 into rapidsai:branch-24.04 Mar 6, 2024
73 checks passed
@PointKernel PointKernel deleted the distinct-left-join branch March 6, 2024 06:58
rapids-bot bot pushed a commit that referenced this pull request Mar 6, 2024
Adds Java bindings to the distinct left join functionality added in #15149.

Authors:
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Jim Brennan (https://github.com/jbrennan333)

URL: #15154
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants