Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Speed up downloading of large checkpoints #38695

Merged
merged 4 commits into from
Aug 23, 2023

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Aug 21, 2023

Why are these changes needed?

This PR sets chunk_size and use_threads=True (download only) for pyarrow.fs.copy_files, speeding up the downloading by 2-3x.

Some benchmarks (bold is what this PR implements as the default for Train/Tune):

  • pyarrow's default S3FileSystem, 64MiB chunk size, use_threads=F, download: ~50MiB/s
  • pyarrow's default S3FileSystem, 64MiB chunk size, use_threads=F, upload: ~700MiB/s
  • pyarrow's default S3FileSystem, 64MiB chunk size, use_threads=T, download: ~120MiB/s
  • pyarrow's default S3FileSystem, 64MiB chunk size, use_threads=T, upload: hangs due to [Python] Python hangs when use pyarrow.fs.copy_files with "used_threads=True" apache/arrow#32372
  • fsspec/s3_fs, 8MiB multipart chunk size, download: ~120MiB/s
  • fsspec/s3_fs, 8MiB multipart chunk size, upload: ~300MiB/s

Note that aws s3 sync does better, and can hit ~700MiB/s for both upload and download, but this is out of the scope for this fix for 2.7:

awsv2 configure set default.s3.preferred_transfer_client crt
awsv2 configure set default.s3.target_bandwidth 100Gb/s
awsv2 configure set default.s3.multipart_chunksize 8MB

awsv2 s3 sync s3://... /tmp/ckpt

Completed 15.4 GiB/69.4 GiB (632.9 MiB/s) with 14 file(s) remaining

Closes #38612

Signed-off-by: Eric Liang <ekhliang@gmail.com>
Signed-off-by: Eric Liang <ekhliang@gmail.com>
@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 22, 2023
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! One comment

python/ray/train/_internal/storage.py Outdated Show resolved Hide resolved
Signed-off-by: Eric Liang <ekhliang@gmail.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ericl ericl merged commit da9c511 into ray-project:master Aug 23, 2023
42 of 45 checks passed
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[tune/train] Investigate default s3 filesystem performance with large files
2 participants