Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match pandas join ordering obligations in pandas-compatible mode #14428

Merged
merged 4 commits into from
Nov 17, 2023

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Nov 16, 2023

Description

If we pass sort=True to merges we are on the hook to sort the result in order with respect to the key columns. If those key columns have repeated values there is still some space for ambiguity. Currently we get a result back whose order (for the repeated key values) is determined by the gather map that libcudf returns for the join. This does not come with any ordering guarantees.

When sort=False, pandas has join-type dependent ordering guarantees which we also do not match. To fix this, in pandas-compatible mode only, reorder the gather maps according to the order of the input keys. When sort=False this means that our result matches pandas ordering. When sort=True, it ensures that (if we use a stable sort) the tie-break for equal sort keys is the input dataframe order.

While we're here, switch from argsort + gather to sort_by_key when sorting results.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested a review from a team as a code owner November 16, 2023 12:08
@wence- wence- requested review from shwina and bdice November 16, 2023 12:08
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 16, 2023
@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 16, 2023
@wence-
Copy link
Contributor Author

wence- commented Nov 16, 2023

Note that there is still one case where we don't match pandas, but I think that is a pandas bug: pandas-dev/pandas#55992

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bdice
Copy link
Contributor

bdice commented Nov 16, 2023

Can you share benchmarks?

Do we still need to propose moving this logic to libcudf, if the performance is reduced? It sounds like @shwina is approving but I thought we were wanting to confirm performance impact first.

@shwina
Copy link
Contributor

shwina commented Nov 16, 2023

I think we still want to move to libcudf in the medium term. In a private comment, Lawrence posted some results showing the perf impact that I'm copying here. It seems like a reasonable compromise for the short term:

(rapids) rapids@compose:~/first-party/db-benchmark/cudf$ python run-join.py --use-pool --no-sort --no-compat 300_000_000
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f4a332ff5c0>
use_pool=True
use_spilling=False
string_categoricals=False
sort=False
compat=False
Left table has 300000000 rows and is 5.59 GiB
Right table is 300000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.1451s
medium inner on int repeat=2: 0.1437s
medium outer on int repeat=1: 0.1527s
medium outer on int repeat=2: 0.1522s
medium inner on factor repeat=1: 0.5018s
medium inner on factor repeat=2: 0.501s
(rapids) rapids@compose:~/first-party/db-benchmark/cudf$ python run-join.py --use-pool --no-sort --compat 300_000_000
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f553642ed00>
use_pool=True
use_spilling=False
string_categoricals=False
sort=False
compat=True
Left table has 300000000 rows and is 5.59 GiB
Right table is 300000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.3911s
medium inner on int repeat=2: 0.3922s
medium outer on int repeat=1: 0.4247s
medium outer on int repeat=2: 0.4248s
medium inner on factor repeat=1: 0.7429s
medium inner on factor repeat=2: 0.7425s
(rapids) rapids@compose:~/first-party/db-benchmark/cudf$ python run-join.py --use-pool --sort --no-compat 300_000_000
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f5e8430ae40>
use_pool=True
use_spilling=False
string_categoricals=False
sort=True
compat=False
Left table has 300000000 rows and is 5.59 GiB
Right table is 300000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.631s
medium inner on int repeat=2: 1.631s
medium outer on int repeat=1: 2.066s
medium outer on int repeat=2: 2.062s
medium inner on factor repeat=1: 1.988s
medium inner on factor repeat=2: 1.987s
(rapids) rapids@compose:~/first-party/db-benchmark/cudf$ python run-join.py --use-pool --sort --compat 300_000_000
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f92e00b80c0>
use_pool=True
use_spilling=False
string_categoricals=False
sort=True
compat=True
Left table has 300000000 rows and is 5.59 GiB
Right table is 300000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.879s
medium inner on int repeat=2: 1.878s
medium outer on int repeat=1: 2.338s
medium outer on int repeat=2: 2.336s
medium inner on factor repeat=1: 2.231s
medium inner on factor repeat=2: 2.232s

@beckernick would you agree?

@bdice
Copy link
Contributor

bdice commented Nov 16, 2023

Great. I'm good with merging this in the short term and then revisiting moving this logic to libcudf. Ideally that could land in 24.02.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I like the tests -- very thorough!

@wence-
Copy link
Contributor Author

wence- commented Nov 17, 2023

The additional cost of compat-mode is independent of the merge type (or number of merge keys, etc...). It is effectively the cost of stable_sort_by_key(gather_map, iota[gather_map]). Migrating to libcudf could avoid some stream synchronisations which would probably help a bit. For a merge between a table with $3 \times 10^8$ rows and one with $3 \times 10^5$ rows, this cost is around 220ms.

sort=True is much more expensive and gets worse with both the number of keys being merged on and the number of columns in the output table.

no compat
Screenshot from 2023-11-17 12-36-10

compat
Screenshot from 2023-11-17 12-47-08

If we pass sort=True to merges we are on the hook to sort the result
in order with respect to the key columns. If those key columns have
repeated values there is still some space for ambiguity. Currently we
get a result back whose order (for the repeated key values) is
determined by the gather map that libcudf returns for the join. This
does not come with any ordering guarantees.

When sort=False, pandas has join-type dependent ordering guarantees
which we also do not match. To fix this, in pandas-compatible mode
only, reorder the gather maps according to the order of the input
keys. When sort=False this means that our result matches pandas
ordering. When sort=True, it ensures that (if we use a stable sort)
the tie-break for equal sort keys is the input dataframe order.

While we're here, switch from argsort + gather to sort_by_key when
sorting results.

- Closes rapidsai#14001
@wence-
Copy link
Contributor Author

wence- commented Nov 17, 2023

/merge

@rapids-bot rapids-bot bot merged commit d2069f4 into rapidsai:branch-24.02 Nov 17, 2023
70 checks passed
@wence- wence- deleted the wence/fea/join-moar-sort branch November 17, 2023 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants