Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register partd encode dispatch in dask_cudf #14287

Merged

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Oct 16, 2023

Description

This PR enables "disk"-based shuffling of cudf-backed Dask-DataFrame collections, but does not yet add the shuffle="disk" option to the dask_cudf.DataFrame.shuffle/sort_values APIs.

We now use basic (slow) pickle logic to convert cudf.DataFrame objects to/from bytes here, so I'd like to consider further optimizations before making the shuffle="disk" option "official".

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

raydouglass and others added 30 commits March 30, 2020 11:03
Merge pull request rapidsai#5690 from ajschmidt8/phase2
[skip ci] Update master references for main branch
[RELEASE] Re-release v0.15 cudf [skip-ci]
[RELEASE] v0.18.2 `cudf` release [skip-ci]
wence-
wence- previously approved these changes Oct 18, 2023
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

@wence- wence- dismissed their stale review October 18, 2023 11:49

Implementation not yet fully working

@wence-
Copy link
Contributor

wence- commented Oct 18, 2023

Hmm, looks like the disk shuffle in dask wants some internal property of an Index that our indices do not provide:

FAILED python/dask_cudf/dask_cudf/tests/test_sort.py::test_disk_shuffle - AttributeError: 'Int64Index' object has no attribute '_attributes'

@rjzamora
Copy link
Member Author

Hmm, looks like the disk shuffle in dask wants some internal property of an Index that our indices do not provide:

Ah, right - Sorry. The test requires a newer version of dask, so I'll need to add a version check to that test.

@rjzamora
Copy link
Member Author

/ok to test

@rjzamora
Copy link
Member Author

/ok to test

@rjzamora
Copy link
Member Author

/ok to test

@rjzamora
Copy link
Member Author

/ok to test

@rjzamora
Copy link
Member Author

/ok to test

@rjzamora
Copy link
Member Author

/ok to test

@rjzamora
Copy link
Member Author

rjzamora commented Nov 3, 2023

/ok to test

import pickle
from functools import partial

import partd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add partd to our package requirements and conda recipes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it comes in transitively through dask? Which is a dependency of dask-cudf.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right exactly, dask depends on partd. I think it should be safe for us to let dask worry about the partd dependency. If dask suddenly stops using partd for shuffle="disk", it will also stop using partd_encode_dispatch.

@rjzamora
Copy link
Member Author

rjzamora commented Nov 6, 2023

/ok to test

@rjzamora
Copy link
Member Author

rjzamora commented Nov 6, 2023

/ok to test

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 6, 2023
@rjzamora
Copy link
Member Author

rjzamora commented Nov 6, 2023

/merge

@rapids-bot rapids-bot bot merged commit 70c4283 into rapidsai:branch-23.12 Nov 6, 2023
61 checks passed
@rjzamora rjzamora deleted the register-partd_encode_dispatch branch November 6, 2023 19:45
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants