-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage] Support for storage info in cluster table's metadata #2322
[Storage] Support for storage info in cluster table's metadata #2322
Conversation
merge with master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now ready for a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter! Took a high level look at the code. Haven't tested yet.
sky/backends/cloud_vm_ray_backend.py
Outdated
for _, storage_obj in storage_mounts.items(): | ||
for _, store_obj in storage_obj.stores.items(): | ||
# Update some of the non-picklabe attributes of store instances | ||
store_obj.make_picklable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use and store StorageMetadata
here? It was designed to be a picklable representation of Storage objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out!
sky/backends/cloud_vm_ray_backend.py
Outdated
return | ||
|
||
cluster_name = handle.cluster_name | ||
cluster_metadata = global_user_state.get_cluster_metadata(cluster_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what counts as metadata for a cluster. In other words, should storage objects be instead stored as a part of the handle
? Two reasons for this:
get_cluster_metadata
andset_cluster_metadata
appear to be unused before this... @Michaelvll - was there a specific use case for these methods?handle
seems to store all metadata for a cluster (e.g., name, resources etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add on, the current implementation I have, it stores storage_mounts
dictionary, which is dst[str] : Storage
obj. pair, in cluster
's metadata
. This is useful as I can pass it into sync_file_mounts
api just like we run it on sky launch
. We need to consider a way to store dst
and keep it as a pair with Storage
obj. if we want to store only Storage
object in handle.
@romilbhardwaj This is ready for a look. Did some additional backwards compatibility tests as well.
|
Additionally, can you also run |
@romilbhardwaj This is ready for another look. Confirmed |
@romilbhardwaj This is ready for a look. Additional testing done for
Following is a result from running
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter! Left some minor comments, should be good to go after they are fixed. Please also run file mounting related smoke tests and some manual backward compatibility tests to make sure the updates in global_user_state
don't break for existing storages/clusters.
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter! Left one comment - should be good to go if all smoke tests and manual backward compatibility tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran into an issue with backward compatibility:
- On master branch,
sky launch -c test task.yaml --cloud gcp
this YAML:
file_mounts:
/nfs:
name: romil-test-bucket
store: s3
mode: MOUNT
run: |
ls /nfs
- Switch to this branch, run
sky launch -c test task.yaml --cloud gcp
again. (this step may not be required, but I ran it) - Stop cluster with
sky stop test
sky start test
fails with:
I 11-07 17:41:15 cloud_vm_ray_backend.py:1692] Successfully provisioned or found existing VM.
D 11-07 17:41:15 cloud_vm_ray_backend.py:3001] Checking if skylet is running on the head node.
Traceback (most recent call last):
File "/Users/romilb/tools/anaconda3/bin/sky", line 8, in <module>
sys.exit(cli())
File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 311, in _record
return f(*args, **kwargs)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 1162, in invoke
return super().invoke(ctx)
File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 332, in _record
return f(*args, **kwargs)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 2573, in start
core.start(name,
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 332, in _record
return f(*args, **kwargs)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/core.py", line 278, in start
_start(cluster_name,
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/core.py", line 211, in _start
storage_mounts = backend.get_storage_mounts_metadata(handle.cluster_name)
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/cloud_vm_ray_backend.py", line 4708, in get_storage_mounts_metadata
storage_mounts[dst] = storage_lib.Storage.from_metadata(
File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/data/storage.py", line 760, in from_metadata
mode=override_args.get('mode', metadata.mode))
AttributeError: 'StorageMetadata' object has no attribute 'mode'
@romilbhardwaj Re-tested on all the following cases:
Below is the backwards compatibility test you commented about. Confirmed that it passes now. Thanks for the catch.
edge cases
Newly added smoke tests:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extensive tests @landscapepainter! Should be good to go now.
This PR resolves #1203.
Currently, when we run
sky stop <cluster-name>
on a cluster that has cloud storage mounted, the mounting will be lost when restarted withsky start <cluster-name>
as the mounting command does not get ran.In this PR, I save the
storage_mounts
dictionary as a metadata incluster
table'sstorage_mounts_metadata
whensky launch
is initially ran. Then, this is loaded to be used to re-mount whensky start <cluster-name>
is ran.Tested (run the relevant ones):
bash format.sh
gcsfuse
,goofys
, andrclone
.pytest tests/test_smoke.py::TestStorageWithCredentials
pytest tests/test_smoke.py::test_file_mounts
state.db
created withsky launch
inmaster
branch,sky stop
/start
ran from PR branchstate.db
created withsky launch
inmaster
branch,sky stop
in master branch,sky start
in PR branchsky storage delete
where storage has 3 stores and 1 of those are externally deletedsky start
with a cluster that has a bucket mounted and deleted externallypytest tests/test_smoke.py::test_aws_storage_mounts_with_stop --aws
pytest tests/test_smoke.py::test_gcp_storage_mounts_with_stop --gcp
pytest tests/test_smoke.py::test_kubernetes_storage_mounts_with_stop --kubernetes