Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. #22665

Merged

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Feb 25, 2022

Alternative to #22580.

TODOs

  • Verify performance improvement on nightly test (need to wait for PR wheel build).

Related issue number

Closes #21274

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@clarkzinzow
Copy link
Contributor Author

@ericl Got ~18s when running the nightly test, so the results are:

  • Status quo: 150s
  • Custom partition file implementation: 13s
  • (This PR) Custom file fragment serialization: 18s

Given that this PR adds a lot less tech debt, I think that going with this PR over the other makes sense. What do you think?

@ericl ericl self-assigned this Feb 28, 2022
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this LGTM as a much cleaner short-term fix.

@ericl ericl merged commit cf3577f into ray-project:master Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] [Bug] Piece Serialization with cloudpickle is very slow.
2 participants