Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Replacing aws s3 sync with s5cmd with much faster speed #2291

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Jul 22, 2023

This PR closes #2176

User requested to use s5cmd sync instead of aws s3 sync for our storage upload and downloads due to it's speed. Confirmed it has most of necessary functionalities with a much faster speed. As s5cmd does not have --include option implemented yet, I will leave storage.py/S3Store.get_file_sync_command as it is.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • sky launch s3_storage.yaml
    s3_storage.yaml:
file_mounts:
  /s3-file: s3://s3-doyoung-test-file/test-file.yaml

  /s3-copy-dir: 
    name: s3-doyoung-copy-testing
    source: ~/s3-cos-copy
    store: s3
    mode: COPY

  /s3-copy-file:
    name: s3-doyoung-copy-file-testing
    source: [~/test-file.yaml]
    store: s3
    mode: COPY
  • Relevant individual smoke tests: pytest tests/test_smoke.py::TestStorageWithCredentials
  • pytest tests/test_smoke.py::test_aws_storage_mounts --aws

@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Aug 27, 2023

Thanks @landscapepainter. Just noting our offline discussion here - instead of replacing aws cli, we want to first use s5cmd if it is installed, and if not installed we want to fall back to aws cli.

Additionally, looks like the --include flag is implemented now, so we can use it in S3Store.get_file_sync_command.

@concretevitamin
Copy link
Collaborator

Hey @landscapepainter - could we quickly check if the crt client is faster than s5cmd? https://awscli.amazonaws.com/v2/documentation/api/latest/topic/s3-config.html#preferred-transfer-client

@landscapepainter
Copy link
Collaborator Author

landscapepainter commented Sep 5, 2023

Just ran a benchmark with crt client. And it seems like setting the use for crt client affects the performance for s5cmd as well. Following is the comparison.

Sync on 1000 of 1MB files:
aws s3 sync with 10 threads with default client on 1000 of 1MB files: 20.738014650344848
aws s3 sync with 10 threads with crt client on 1000 of 1MB files: 22.9957523346
s5cmd sync with 10 threads with default client on 1000 of 1MB files 7.532667350769043
s5cmd sync with 10 threads with crt client on 1000 of 1MB files 4.55448350906372

Sync on 6 of 10GB files:
aws s3 sync with 10 threads with default client on 6 of 10GB files: 612.9618566036224
aws s3 sync with 10 threads with crt client on 6 of 10GB files: 412.372611364
s5cmd sync with 10 threads with default client on 6 of 10GB files: 386.55710673332214
s5cmd sync with 10 threads with crt client on 6 of 10GB files: 108.94380240440368

chatGPT generated graph:
image
image

Some ways to configure for the crt client is to run aws s3 configure command or set ~/.aws/config in it. Our current appoach for uploading from local to s3 is to use aws s3 sync as a default and note the users that installing s5cmd can improve performance. And if they install s5cmd, it'll use s5cmd. I can create a separte PR to update a doc on setting this config for improved performance in case users want it for either aws s3 sync or s5cmd sync. @concretevitamin What do you think?

@landscapepainter
Copy link
Collaborator Author

@romilbhardwaj --include is implemented. This is ready for another look!

@landscapepainter
Copy link
Collaborator Author

Need to add a setup for crt client at ~/.aws/config of remote machine for faster fetch when provisioning.

Copy link

This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label May 23, 2024
@turian
Copy link
Contributor

turian commented May 25, 2024

@romilbhardwaj @landscapepainter is this PR ready?

@github-actions github-actions bot removed the Stale label May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] s5cmd command support for COPY Persistent Data Storage
4 participants