Add fast path for multi-column sorting #5

charlesbluca · 2021-09-23T19:59:30Z

Adds an optional fast path for multi-column sorting when the following is true:

dask-cudf is installed
we are trying to sort a dask-cudf dataframe
all columns are being sorted in ascending order with nulls positioned last

There's some work in dask-cudf aiming to support descending sort and null positioning, which could potentially open up the cases the fast path can be used:

charlesbluca · 2021-09-27T16:21:53Z

rerun tests

VibhuJawa · 2021-09-30T21:13:58Z

dask_sql/physical/utils/sort.py

+    ):
+        try:
+            df = df.sort_values(sort_columns, ignore_index=True)
+            return df.persist()


We should not call .persist() on single patition frames .

Just curious , Does .persist() ensure we dont trigger duplicate computations as IIRC, .sort_values() is not lazy.

I wonder if this is a better patten

df = df.persist() df = df.sort_values(sort_columns, ignore_index=True).persist()

Agreed, I can add in a call to map_partitions in the single partition case.

@quasiben might know better than me the implications of calling persist here; I would assume this is here mostly to match up with the persist call happening in the workaround:

dask-sql/dask_sql/physical/utils/sort.py

Line 38 in 4d5f7dd

return df.persist()

EDIT:

Just saw your edit - knowing that, it looks like the current pattern should be good (once we account for the single partition case) - should we still opt to persist before running sort_values?

Just saw your edit - knowing that, it looks like the current pattern should be good (once we account for the single partition case) - should we still opt to persist before running sort_values?

Testing it again now., will update here. Sorry for the edit and confusion.

So i tested an example workflow with and without persisting first, and persisting before sorting indeed prevents duplicate computation.

Without Persisting (DASK PROFILE):

st = time.time() with performance_report(filename="sort-without-persist.html"): df = dask_cudf.read_parquet(get_fp("web_sales"),columns= columns).shuffle(['ws_sold_date_sk','ws_ship_date_sk']) df = df.sort_values(by=['ws_bill_cdemo_sk'],ignore_index=True).persist() df = wait(df); del df print(f"et -st = {et-st}")

et -st = 23.0989

With Persisting (DASK PROFILE):

st = time.time() with performance_report(filename="sort-with-persist.html"): df = dask_cudf.read_parquet(get_fp("web_sales"),columns= columns).shuffle(['ws_sold_date_sk','ws_ship_date_sk']) df = df.persist().sort_values(by=['ws_bill_cdemo_sk'],ignore_index=True).persist() df = wait(df); del df et = time.time() print(f"et -st = {et-st}")

et -st = 16.24

The trade of here is memory vs duplicate computation. I think we might want to think more about this .

I wonder if a version of in-place sorting might prevent some memory overheads.

Anyways, we should think deeply about this.

CC: @randerzander

I suspect the persist calls here are due to handling the multi-col sort on CPU. Once pandas-dev/pandas#43881 is resolved and Dask has a native multi-col sort we can probably remove them entirely. @charlesbluca is correct that I was originally intending to match the the case when native mult-col sorting is not supported.

I think it's ok to safely remove persist in the initial try state and return the dataframe directly

Pushed these changes to the original PR:

dask-contrib#229

quasiben and others added 9 commits August 27, 2021 06:51

add fast path for multi-column sorting

5ad586d

lint

8fc9a7b

Merge remote-tracking branch 'upstream/main' into multi-col-sort

ac7bad8

Prevent single column Dask dataframes from calling sort_values

c86cdab

Wrap dask_cudf import in try/except block

d321ca3

Add test for fast multi column sort

ed65228

Move multi_col_sort contents to apply_sort

76eb2aa

Ignore index for dask-cudf sorting

927c618

Fix show tables test for cudf enabled fixture

963ad5e

charlesbluca changed the base branch from main to branch-21.12 September 24, 2021 18:22

Trigger CI

5fb3c41

VibhuJawa reviewed Sep 30, 2021

View reviewed changes

charlesbluca closed this Oct 11, 2021

VibhuJawa mentioned this pull request Oct 11, 2021

[ENH] Support persist configuration for sorting dask-contrib/dask-sql#250

Open

charlesbluca deleted the multi-col-sort branch January 19, 2022 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add fast path for multi-column sorting #5

Add fast path for multi-column sorting #5

Uh oh!

charlesbluca commented Sep 23, 2021

Uh oh!

charlesbluca commented Sep 27, 2021

Uh oh!

VibhuJawa Sep 30, 2021 •

edited

Loading

Uh oh!

charlesbluca Sep 30, 2021 •

edited

Loading

Uh oh!

VibhuJawa Sep 30, 2021

Uh oh!

VibhuJawa Sep 30, 2021

Uh oh!

VibhuJawa Oct 4, 2021

Uh oh!

quasiben Oct 5, 2021

Uh oh!

charlesbluca Oct 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add fast path for multi-column sorting #5

Add fast path for multi-column sorting #5

Uh oh!

Conversation

charlesbluca commented Sep 23, 2021

Uh oh!

charlesbluca commented Sep 27, 2021

Uh oh!

VibhuJawa Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesbluca Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VibhuJawa Sep 30, 2021

Choose a reason for hiding this comment

Uh oh!

VibhuJawa Sep 30, 2021

Choose a reason for hiding this comment

Uh oh!

VibhuJawa Oct 4, 2021

Choose a reason for hiding this comment

Uh oh!

quasiben Oct 5, 2021

Choose a reason for hiding this comment

Uh oh!

charlesbluca Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

VibhuJawa Sep 30, 2021 •

edited

Loading

charlesbluca Sep 30, 2021 •

edited

Loading