-
Notifications
You must be signed in to change notification settings - Fork 1
Add fast path for multi-column sorting #5
Conversation
|
rerun tests |
| ): | ||
| try: | ||
| df = df.sort_values(sort_columns, ignore_index=True) | ||
| return df.persist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
We should not call
.persist()on single patition frames . -
Just curious , Does
.persist()ensure we dont trigger duplicate computations as IIRC,.sort_values()is not lazy.
I wonder if this is a better patten
df = df.persist()
df = df.sort_values(sort_columns, ignore_index=True).persist()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Agreed, I can add in a call to
map_partitionsin the single partition case. - @quasiben might know better than me the implications of calling
persisthere; I would assume this is here mostly to match up with the persist call happening in the workaround:
dask-sql/dask_sql/physical/utils/sort.py
Line 38 in 4d5f7dd
| return df.persist() |
EDIT:
Just saw your edit - knowing that, it looks like the current pattern should be good (once we account for the single partition case) - should we still opt to persist before running sort_values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw your edit - knowing that, it looks like the current pattern should be good (once we account for the single partition case) - should we still opt to persist before running sort_values?
Testing it again now., will update here. Sorry for the edit and confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i tested an example workflow with and without persisting first, and persisting before sorting indeed prevents duplicate computation.
Without Persisting (DASK PROFILE):
st = time.time()
with performance_report(filename="sort-without-persist.html"):
df = dask_cudf.read_parquet(get_fp("web_sales"),columns= columns).shuffle(['ws_sold_date_sk','ws_ship_date_sk'])
df = df.sort_values(by=['ws_bill_cdemo_sk'],ignore_index=True).persist()
df = wait(df);
del df
print(f"et -st = {et-st}")et -st = 23.0989
With Persisting (DASK PROFILE):
st = time.time()
with performance_report(filename="sort-with-persist.html"):
df = dask_cudf.read_parquet(get_fp("web_sales"),columns= columns).shuffle(['ws_sold_date_sk','ws_ship_date_sk'])
df = df.persist().sort_values(by=['ws_bill_cdemo_sk'],ignore_index=True).persist()
df = wait(df);
del df
et = time.time()
print(f"et -st = {et-st}")et -st = 16.24
The trade of here is memory vs duplicate computation. I think we might want to think more about this .
I wonder if a version of in-place sorting might prevent some memory overheads.
Anyways, we should think deeply about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC: @randerzander
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the persist calls here are due to handling the multi-col sort on CPU. Once pandas-dev/pandas#43881 is resolved and Dask has a native multi-col sort we can probably remove them entirely. @charlesbluca is correct that I was originally intending to match the the case when native mult-col sorting is not supported.
I think it's ok to safely remove persist in the initial try state and return the dataframe directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed these changes to the original PR:
Adds an optional fast path for multi-column sorting when the following is true:
There's some work in dask-cudf aiming to support descending sort and null positioning, which could potentially open up the cases the fast path can be used:
ascendingparameter for dask-cudfsort_valuescudf#9250na_positionparam to dask-cudfsort_valuescudf#9264