Skip to content
This repository was archived by the owner on Aug 29, 2024. It is now read-only.

Conversation

@charlesbluca
Copy link
Member

Adds an optional fast path for multi-column sorting when the following is true:

  • dask-cudf is installed
  • we are trying to sort a dask-cudf dataframe
  • all columns are being sorted in ascending order with nulls positioned last

There's some work in dask-cudf aiming to support descending sort and null positioning, which could potentially open up the cases the fast path can be used:

@charlesbluca charlesbluca changed the base branch from main to branch-21.12 September 24, 2021 18:22
@charlesbluca
Copy link
Member Author

rerun tests

):
try:
df = df.sort_values(sort_columns, ignore_index=True)
return df.persist()
Copy link
Member

@VibhuJawa VibhuJawa Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We should not call .persist() on single patition frames .

  2. Just curious , Does .persist() ensure we dont trigger duplicate computations as IIRC, .sort_values() is not lazy.

I wonder if this is a better patten

df = df.persist()
df = df.sort_values(sort_columns, ignore_index=True).persist()

Copy link
Member Author

@charlesbluca charlesbluca Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agreed, I can add in a call to map_partitions in the single partition case.
  2. @quasiben might know better than me the implications of calling persist here; I would assume this is here mostly to match up with the persist call happening in the workaround:

return df.persist()

EDIT:

Just saw your edit - knowing that, it looks like the current pattern should be good (once we account for the single partition case) - should we still opt to persist before running sort_values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw your edit - knowing that, it looks like the current pattern should be good (once we account for the single partition case) - should we still opt to persist before running sort_values?

Testing it again now., will update here. Sorry for the edit and confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i tested an example workflow with and without persisting first, and persisting before sorting indeed prevents duplicate computation.

Without Persisting (DASK PROFILE):

st = time.time()
with performance_report(filename="sort-without-persist.html"):
    df =  dask_cudf.read_parquet(get_fp("web_sales"),columns= columns).shuffle(['ws_sold_date_sk','ws_ship_date_sk'])
    df = df.sort_values(by=['ws_bill_cdemo_sk'],ignore_index=True).persist()
    df = wait(df);
    del df
print(f"et -st = {et-st}")
et -st = 23.0989 

With Persisting (DASK PROFILE):

st = time.time()
with performance_report(filename="sort-with-persist.html"):
    df =  dask_cudf.read_parquet(get_fp("web_sales"),columns= columns).shuffle(['ws_sold_date_sk','ws_ship_date_sk'])
    df = df.persist().sort_values(by=['ws_bill_cdemo_sk'],ignore_index=True).persist()
    df = wait(df);
    del df
    
et = time.time()
print(f"et -st = {et-st}")
et -st = 16.24

The trade of here is memory vs duplicate computation. I think we might want to think more about this .

I wonder if a version of in-place sorting might prevent some memory overheads.

Anyways, we should think deeply about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the persist calls here are due to handling the multi-col sort on CPU. Once pandas-dev/pandas#43881 is resolved and Dask has a native multi-col sort we can probably remove them entirely. @charlesbluca is correct that I was originally intending to match the the case when native mult-col sorting is not supported.

I think it's ok to safely remove persist in the initial try state and return the dataframe directly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed these changes to the original PR:

dask-contrib#229

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants