Match pandas join ordering obligations in pandas-compatible mode #14428

wence- · 2023-11-16T12:08:33Z

Description

If we pass sort=True to merges we are on the hook to sort the result in order with respect to the key columns. If those key columns have repeated values there is still some space for ambiguity. Currently we get a result back whose order (for the repeated key values) is determined by the gather map that libcudf returns for the join. This does not come with any ordering guarantees.

When sort=False, pandas has join-type dependent ordering guarantees which we also do not match. To fix this, in pandas-compatible mode only, reorder the gather maps according to the order of the input keys. When sort=False this means that our result matches pandas ordering. When sort=True, it ensures that (if we use a stable sort) the tie-break for equal sort keys is the input dataframe order.

While we're here, switch from argsort + gather to sort_by_key when sorting results.

Closes [BUG] DataFrame.join(Series) does not maintain order of Series values in mode.pandas_compatible #14001

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

wence- · 2023-11-16T13:13:40Z

Note that there is still one case where we don't match pandas, but I think that is a pandas bug: pandas-dev/pandas#55992

python/cudf/cudf/core/join/join.py

shwina

LGTM!

bdice · 2023-11-16T18:37:33Z

Can you share benchmarks?

Do we still need to propose moving this logic to libcudf, if the performance is reduced? It sounds like @shwina is approving but I thought we were wanting to confirm performance impact first.

shwina · 2023-11-16T19:19:32Z

I think we still want to move to libcudf in the medium term. In a private comment, Lawrence posted some results showing the perf impact that I'm copying here. It seems like a reasonable compromise for the short term:

(rapids) rapids@compose:~/first-party/db-benchmark/cudf$ python run-join.py --use-pool --no-sort --no-compat 300_000_000
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f4a332ff5c0>
use_pool=True
use_spilling=False
string_categoricals=False
sort=False
compat=False
Left table has 300000000 rows and is 5.59 GiB
Right table is 300000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.1451s
medium inner on int repeat=2: 0.1437s
medium outer on int repeat=1: 0.1527s
medium outer on int repeat=2: 0.1522s
medium inner on factor repeat=1: 0.5018s
medium inner on factor repeat=2: 0.501s
(rapids) rapids@compose:~/first-party/db-benchmark/cudf$ python run-join.py --use-pool --no-sort --compat 300_000_000
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f553642ed00>
use_pool=True
use_spilling=False
string_categoricals=False
sort=False
compat=True
Left table has 300000000 rows and is 5.59 GiB
Right table is 300000 rows and is 0.01 GiB
medium inner on int repeat=1: 0.3911s
medium inner on int repeat=2: 0.3922s
medium outer on int repeat=1: 0.4247s
medium outer on int repeat=2: 0.4248s
medium inner on factor repeat=1: 0.7429s
medium inner on factor repeat=2: 0.7425s
(rapids) rapids@compose:~/first-party/db-benchmark/cudf$ python run-join.py --use-pool --sort --no-compat 300_000_000
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f5e8430ae40>
use_pool=True
use_spilling=False
string_categoricals=False
sort=True
compat=False
Left table has 300000000 rows and is 5.59 GiB
Right table is 300000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.631s
medium inner on int repeat=2: 1.631s
medium outer on int repeat=1: 2.066s
medium outer on int repeat=2: 2.062s
medium inner on factor repeat=1: 1.988s
medium inner on factor repeat=2: 1.987s
(rapids) rapids@compose:~/first-party/db-benchmark/cudf$ python run-join.py --use-pool --sort --compat 300_000_000
Running experiment
------------------
mr=<rmm._lib.memory_resource.PoolMemoryResource object at 0x7f92e00b80c0>
use_pool=True
use_spilling=False
string_categoricals=False
sort=True
compat=True
Left table has 300000000 rows and is 5.59 GiB
Right table is 300000 rows and is 0.01 GiB
medium inner on int repeat=1: 1.879s
medium inner on int repeat=2: 1.878s
medium outer on int repeat=1: 2.338s
medium outer on int repeat=2: 2.336s
medium inner on factor repeat=1: 2.231s
medium inner on factor repeat=2: 2.232s

@beckernick would you agree?

bdice · 2023-11-16T22:29:02Z

Great. I'm good with merging this in the short term and then revisiting moving this logic to libcudf. Ideally that could land in 24.02.

bdice

Nice. I like the tests -- very thorough!

wence- · 2023-11-17T12:48:40Z

The additional cost of compat-mode is independent of the merge type (or number of merge keys, etc...). It is effectively the cost of stable_sort_by_key(gather_map, iota[gather_map]). Migrating to libcudf could avoid some stream synchronisations which would probably help a bit. For a merge between a table with $3 \times 10^8$ rows and one with $3 \times 10^5$ rows, this cost is around 220ms.

sort=True is much more expensive and gets worse with both the number of keys being merged on and the number of columns in the output table.

no compat

compat

If we pass sort=True to merges we are on the hook to sort the result in order with respect to the key columns. If those key columns have repeated values there is still some space for ambiguity. Currently we get a result back whose order (for the repeated key values) is determined by the gather map that libcudf returns for the join. This does not come with any ordering guarantees. When sort=False, pandas has join-type dependent ordering guarantees which we also do not match. To fix this, in pandas-compatible mode only, reorder the gather maps according to the order of the input keys. When sort=False this means that our result matches pandas ordering. When sort=True, it ensures that (if we use a stable sort) the tie-break for equal sort keys is the input dataframe order. While we're here, switch from argsort + gather to sort_by_key when sorting results. - Closes rapidsai#14001

wence- · 2023-11-17T12:50:03Z

/merge

wence- requested a review from a team as a code owner November 16, 2023 12:08

wence- requested review from shwina and bdice November 16, 2023 12:08

github-actions bot added the Python Affects Python cuDF API. label Nov 16, 2023

wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 16, 2023

shwina reviewed Nov 16, 2023

View reviewed changes

python/cudf/cudf/core/join/join.py Show resolved Hide resolved

shwina approved these changes Nov 16, 2023

View reviewed changes

bdice approved these changes Nov 17, 2023

View reviewed changes

wence- added 4 commits November 17, 2023 12:49

PANDAS_GE_220

b2754b3

Tests of join order in pandas-compat mode

b2be23e

More tests (matching pandas ones)

1e9841c

wence- force-pushed the wence/fea/join-moar-sort branch from f93d5c3 to 1e9841c Compare November 17, 2023 12:49

rapids-bot bot merged commit d2069f4 into rapidsai:branch-24.02 Nov 17, 2023
70 checks passed

wence- deleted the wence/fea/join-moar-sort branch November 17, 2023 16:20

wence- mentioned this pull request Nov 23, 2023

[PERF/ENH] Series.map sorts a larger dataset than it needs to #14485

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match pandas join ordering obligations in pandas-compatible mode #14428

Match pandas join ordering obligations in pandas-compatible mode #14428

wence- commented Nov 16, 2023

wence- commented Nov 16, 2023

shwina left a comment

bdice commented Nov 16, 2023

shwina commented Nov 16, 2023 •

edited

Loading

bdice commented Nov 16, 2023

bdice left a comment

wence- commented Nov 17, 2023

wence- commented Nov 17, 2023

Match pandas join ordering obligations in pandas-compatible mode #14428

Match pandas join ordering obligations in pandas-compatible mode #14428

Conversation

wence- commented Nov 16, 2023

Description

Checklist

wence- commented Nov 16, 2023

shwina left a comment

Choose a reason for hiding this comment

bdice commented Nov 16, 2023

shwina commented Nov 16, 2023 • edited Loading

bdice commented Nov 16, 2023

bdice left a comment

Choose a reason for hiding this comment

wence- commented Nov 17, 2023

wence- commented Nov 17, 2023

shwina commented Nov 16, 2023 •

edited

Loading