Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Support for storage info in cluster table's metadata #2322

Merged

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Jul 28, 2023

This PR resolves #1203.

Currently, when we run sky stop <cluster-name> on a cluster that has cloud storage mounted, the mounting will be lost when restarted with sky start <cluster-name> as the mounting command does not get ran.

In this PR, I save the storage_mounts dictionary as a metadata in cluster table's storage_mounts_metadata when sky launch is initially ran. Then, this is loaded to be used to re-mount when sky start <cluster-name> is ran.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual tests on gcsfuse, goofys, and rclone.
  • Relevant individual smoke tests: pytest tests/test_smoke.py::TestStorageWithCredentials
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_file_mounts
  • state.db created with sky launch in master branch, sky stop/start ran from PR branch
  • state.dbcreated with sky launch in master branch, sky stop in master branch, sky start in PR branch
  • Running sky storage delete where storage has 3 stores and 1 of those are externally deleted
  • Running sky start with a cluster that has a bucket mounted and deleted externally
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_aws_storage_mounts_with_stop --aws
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_gcp_storage_mounts_with_stop --gcp
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_kubernetes_storage_mounts_with_stop --kubernetes

@landscapepainter landscapepainter marked this pull request as draft July 28, 2023 23:40
@landscapepainter landscapepainter marked this pull request as ready for review August 12, 2023 05:26
Copy link
Collaborator Author

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now ready for a review.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter! Took a high level look at the code. Haven't tested yet.

Comment on lines 4257 to 4260
for _, storage_obj in storage_mounts.items():
for _, store_obj in storage_obj.stores.items():
# Update some of the non-picklabe attributes of store instances
store_obj.make_picklable()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use and store StorageMetadata here? It was designed to be a picklable representation of Storage objects.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out!

return

cluster_name = handle.cluster_name
cluster_metadata = global_user_state.get_cluster_metadata(cluster_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what counts as metadata for a cluster. In other words, should storage objects be instead stored as a part of the handle? Two reasons for this:

  1. get_cluster_metadata and set_cluster_metadata appear to be unused before this... @Michaelvll - was there a specific use case for these methods?
  2. handle seems to store all metadata for a cluster (e.g., name, resources etc.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add on, the current implementation I have, it stores storage_mounts dictionary, which is dst[str] : Storage obj. pair, in cluster's metadata. This is useful as I can pass it into sync_file_mounts api just like we run it on sky launch. We need to consider a way to store dst and keep it as a pair with Storage obj. if we want to store only Storage object in handle.

sky/core.py Outdated Show resolved Hide resolved
@landscapepainter
Copy link
Collaborator Author

landscapepainter commented Oct 2, 2023

@romilbhardwaj This is ready for a look. Did some additional backwards compatibility tests as well.

  • state.db created with sky launch in master branch, sky stop/start ran from PR branch
  • state.dbcreated with sky launch in master branch, sky stop in master branch, sky start in PR branch

@romilbhardwaj
Copy link
Collaborator

Additionally, can you also run test_smoke.py::test_file_mounts?

@landscapepainter
Copy link
Collaborator Author

@romilbhardwaj This is ready for another look. Confirmed pytest tests/test_smoke.py::test_file_mounts passes.

@landscapepainter
Copy link
Collaborator Author

landscapepainter commented Oct 10, 2023

@romilbhardwaj This is ready for a look. Additional testing done for sky storage delete and sky start after implementing error raise from _get_bucket when the bucket is externally deleted while sync_on_reconstruction is set to False.

  • Running sky storage delete where storage has 3 stores and 1 of those are externally deleted
  • Running sky start with a cluster that has a bucket mounted and deleted externally

Following is a result from running sky start after externally deleting the mounting bucket.

$ sky start doyoung-test -y
I 10-10 05:34:43 cloud_vm_ray_backend.py:1449] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-10-10-05-34-43-492829/provision.log
I 10-10 05:34:43 cloud_vm_ray_backend.py:1277] Cluster 'doyoung-test' (status: STOPPED) was previously launched in GCP us-central1. Relaunching in that region.
I 10-10 05:34:47 cloud_vm_ray_backend.py:1887] Launching on GCP us-central1 (us-central1-a)
I 10-10 05:35:21 log_utils.py:45] Head node is up.
I 10-10 05:35:58 cloud_vm_ray_backend.py:1692] Successfully provisioned or found existing VM.
I 10-10 05:36:07 cloud_vm_ray_backend.py:4587] Processing 1 storage mount.
sky.exceptions.StorageExternalDeletionError: The bucket, 'gcs-mount-testing', could not be mounted on cluster 'doyoung-test'. Please verify that the bucket exists. The cluster started successfully without mounting the bucket.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter! Left some minor comments, should be good to go after they are fixed. Please also run file mounting related smoke tests and some manual backward compatibility tests to make sure the updates in global_user_state don't break for existing storages/clusters.

sky/data/storage.py Outdated Show resolved Hide resolved
sky/exceptions.py Outdated Show resolved Hide resolved
sky/data/storage.py Show resolved Hide resolved
sky/data/storage.py Show resolved Hide resolved
sky/data/storage.py Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter! Left one comment - should be good to go if all smoke tests and manual backward compatibility tests pass.

sky/data/storage.py Show resolved Hide resolved
sky/data/storage.py Show resolved Hide resolved
sky/data/storage.py Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into an issue with backward compatibility:

  1. On master branch, sky launch -c test task.yaml --cloud gcp this YAML:
file_mounts:
  /nfs:
    name: romil-test-bucket
    store: s3
    mode: MOUNT

run: |
  ls /nfs
  1. Switch to this branch, run sky launch -c test task.yaml --cloud gcp again. (this step may not be required, but I ran it)
  2. Stop cluster with sky stop test
  3. sky start test fails with:
I 11-07 17:41:15 cloud_vm_ray_backend.py:1692] Successfully provisioned or found existing VM.
D 11-07 17:41:15 cloud_vm_ray_backend.py:3001] Checking if skylet is running on the head node.
Traceback (most recent call last):
  File "/Users/romilb/tools/anaconda3/bin/sky", line 8, in <module>
    sys.exit(cli())
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 311, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 1162, in invoke
    return super().invoke(ctx)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 332, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 2573, in start
    core.start(name,
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 332, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/core.py", line 278, in start
    _start(cluster_name,
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/core.py", line 211, in _start
    storage_mounts = backend.get_storage_mounts_metadata(handle.cluster_name)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/cloud_vm_ray_backend.py", line 4708, in get_storage_mounts_metadata
    storage_mounts[dst] = storage_lib.Storage.from_metadata(
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/data/storage.py", line 760, in from_metadata
    mode=override_args.get('mode', metadata.mode))
AttributeError: 'StorageMetadata' object has no attribute 'mode'

@landscapepainter
Copy link
Collaborator Author

@romilbhardwaj Re-tested on all the following cases:

  • Relevant individual smoke tests: pytest tests/test_smoke.py::TestStorageWithCredentials
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_file_mounts

Below is the backwards compatibility test you commented about. Confirmed that it passes now. Thanks for the catch.

  • state.db created with sky launch in master branch, sky stop/start ran from PR branch
  • state.dbcreated with sky launch in master branch, sky stop in master branch, sky start in PR branch

edge cases

  • Running sky storage delete where storage has 3 stores and 1 of those are externally deleted
  • Running sky start with a cluster that has a bucket mounted and deleted externally

Newly added smoke tests:

  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_aws_storage_mounts_with_stop --aws
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_gcp_storage_mounts_with_stop --gcp
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_kubernetes_storage_mounts_with_stop --kubernetes

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extensive tests @landscapepainter! Should be good to go now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Storage] VM loses S3 mount after stop/start
2 participants