Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sky Data #63

Merged
merged 21 commits into from Dec 10, 2021
Merged

Sky Data #63

merged 21 commits into from Dec 10, 2021

Conversation

michaelzhiluo
Copy link
Collaborator

@michaelzhiluo michaelzhiluo commented Nov 25, 2021

Completed Items

  • Backends for S3 and GCS Blobs
  • Creating S3/GCS Blobs
  • Deleting S3/GCS Blobs
  • Uploading Data (parallel upload) to S3/GCS Blobs
  • Downloading Data (not parallel yet) from S3/GCS Blobs
  • Transfer between S3 and GCS Blobs (1-2 GB/s bandwidth for Imagenet, 2-3 min transfer time)
  • Moving AWS authentication across all nodes
  • Integrating Storage Classes and Backend into Sky (currently it is a standalone module)
  • Download S3/GCS data onto the Cloud VM

TODOs

  • [Optional] Mounting S3/GCS Blob onto all Cluster nodes (Optional, to @concretevitamin 's suggestion )

@michaelzhiluo michaelzhiluo changed the title [WIP] Sky Data Sky Data Dec 2, 2021
@michaelzhiluo
Copy link
Collaborator Author

@romilbhardwaj @concretevitamin Ready to review!

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! Left some comments, but I generally like how this has come out to be

prototype/examples/resnet_app_storage.py Show resolved Hide resolved
prototype/examples/storage_playground.py Outdated Show resolved Hide resolved
prototype/sky/__init__.py Outdated Show resolved Hide resolved
prototype/sky/backends/data_transfer.py Outdated Show resolved Hide resolved
prototype/sky/backends/data_transfer.py Outdated Show resolved Hide resolved
self.path = path
assert not is_new_bucket or self.path
if 'gcs://' not in self.path:
if 's3://' in self.path:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we add more backends, we'll probably have to add more logic here for cross-cloud transfers.. In a future PR we should clean this up by adding something like a transfer registry which maps Tuple[Backend1, Backen2] -> TransferFn(Backend1, Backend2)

prototype/sky/storage.py Outdated Show resolved Hide resolved
prototype/sky/storage.py Outdated Show resolved Hide resolved
prototype/sky/storage.py Outdated Show resolved Hide resolved
prototype/sky/task.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great!! First pass, didn't look into all details but I tried to raise pressing high-level concerns/suggestions.


Question:

For GNN, I imagine they will always write

s = sky.Storage('my-data', source='/data/gnn/train')
# First run: upload.
# Second+ runs: sync (some overhead even if no change to data).
# This overhead is OK for now; future we can optimize it away.
s.to_s3()
task.set_storage_mounts({s: '/data/gnn/train'})

Is that right?

[Not for this PR] Suppose they now run this Sky script in a different machine without access to source='/data/gnn/train'. We need a way (sky status? sky storage?) to enable them to find what S3 path is backing my-data. So that they can write sky.Storage('my-data', source=<correct s3 path>) instead.

prototype/sky/backends/data_transfer.py Outdated Show resolved Hide resolved
prototype/sky/storage.py Outdated Show resolved Hide resolved
prototype/sky/storage.py Outdated Show resolved Hide resolved
prototype/sky/storage.py Outdated Show resolved Hide resolved
prototype/sky/storage.py Outdated Show resolved Hide resolved
prototype/sky/storage.py Outdated Show resolved Hide resolved
prototype/sky/storage.py Outdated Show resolved Hide resolved
prototype/sky/storage.py Outdated Show resolved Hide resolved
prototype/sky/task.py Outdated Show resolved Hide resolved
prototype/sky/cloud_stores.py Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator

Also, I just realized we will need to add support for Storage in our YAML schema. That's probably best done as a separate PR...
cc @concretevitamin @michaelzhiluo @gmittal

@michaelzhiluo
Copy link
Collaborator Author

Also, I just realized we will need to add support for Storage in our YAML schema. That's probably best done as a separate PR... cc @concretevitamin @michaelzhiluo @gmittal

Will be done in a followup PR after this is merged

@michaelzhiluo
Copy link
Collaborator Author

@romilbhardwaj @concretevitamin Think its good to go! Addressed 99% of the comments.

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Mostly some more API ideas. Should be good to go after that.

prototype/sky/task.py Outdated Show resolved Hide resolved
prototype/sky/task.py Outdated Show resolved Hide resolved
prototype/sky/task.py Outdated Show resolved Hide resolved
prototype/sky/execution.py Outdated Show resolved Hide resolved
prototype/sky/data/storage.py Outdated Show resolved Hide resolved
prototype/sky/data/storage.py Outdated Show resolved Hide resolved
prototype/sky/data/storage.py Outdated Show resolved Hide resolved
prototype/sky/data/storage.py Outdated Show resolved Hide resolved
prototype/sky/data/storage.py Outdated Show resolved Hide resolved
prototype/sky/data/storage.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! Left a minor comment.

Also, in a future PR we should investigate directly mounting the storage using GcsFuse/s3Fuse instead of copying to local EBS:

  1. Reduces setup time and EBS size (at the cost of increased disk latency/xput, need to evaluate this tradeoff)
  2. Allows for writes to Storage which users can access from other workers/local dev machine

prototype/examples/resnet_app_storage.py Outdated Show resolved Hide resolved
prototype/sky/data/storage.py Outdated Show resolved Hide resolved
@@ -12,7 +12,10 @@
import subprocess
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to team: As sky.Storage becomes the standard method of uploading/interfacing with data in Sky, we should deprecate CloudStorage and move it's functionality to AbstractStore
cc @concretevitamin @michaelzhiluo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we can refactor in a later PR @romilbhardwaj @concretevitamin

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding work @michaelzhiluo !

@michaelzhiluo michaelzhiluo merged commit 9da780f into master Dec 10, 2021
@michaelzhiluo
Copy link
Collaborator Author

Thanks for the reviews! @romilbhardwaj @concretevitamin

@michaelzhiluo michaelzhiluo deleted the data_egress branch January 12, 2022 01:52
gmittal pushed a commit that referenced this pull request Mar 15, 2022
* Initial Commit

* Some more changes, mounting is left, finish tmrw

* Demo

* Fix

* MVP is Done

* ok

* squashed commits

* Yapf Fix

* Romil's Comments Addressed

* Partially addressed Zongheng's comments

* Temporary fix

* Fixed Resnet Storage example

* Yapf and Pylint

* Addressed Zongheng's comments, linting time

* Pylint fix

* Done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants