Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Provisioner] Remove ray dependency for GCP and move TPU node to new provisioner #2943

Merged
merged 25 commits into from
Jan 12, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jan 4, 2024

Changes

  1. Removes the dependency for ray for skypilot[gcp].
  2. Moved the TPU node creation and deletion in the new provisioner to avoid the depedencies of ray in the backend.
  3. Refactor the minimal permissions added in Relax required permissions for custom VPC (GCP) #2944 to gcp_utils.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-gcp --cloud gcp --cpus 2+ echo hi; sky exec test-gcp echo hi
    • sky launch -c test-tpu-node examples/tpu/tpu_app.yaml; sky stop test-tpu-node; sky start test-tpu-node; sky exec test-tpu-node examples/tpu/tpu_app.yaml; sky down test-tpu-node
    • sky launch -c test-tpu-node examples/tpu/tpu_app.yaml; sky autostop -i 0 test-tpu-node; sky status -r test-tpu-node; sky start test-tpu-node; sky autostop -i 0 --down test-tpu-node; sky status -r test-tpu-node
  • All smoke tests: pytest tests/test_smoke.py --gcp (with only skypilot[gcp] installed): Failed the following tests due to the lack of credentials for AWS.
    tests/test_smoke.py::TestStorageWithCredentials::test_nonexistent_bucket
    tests/test_smoke.py::TestStorageWithCredentials::test_upload_to_existing_bucket
    tests/test_smoke.py::test_example_app
    tests/test_smoke.py::TestStorageWithCredentials::test_bucket_bulk_deletion
    tests/test_smoke.py::TestStorageWithCredentials::test_upload_source_with_spaces
    tests/test_smoke.py::TestStorageWithCredentials::test_multiple_buckets_creation_and_deletion
    tests/test_smoke.py::TestStorageWithCredentials::test_new_bucket_creation_and_deletion
    tests/test_smoke.py::TestStorageWithCredentials::test_bucket_external_deletion
    tests/test_smoke.py::TestStorageWithCredentials::test_private_bucket
    tests/test_smoke.py::TestStorageWithCredentials::test_list_source
    tests/test_smoke.py::test_gcp_storage_mounts_with_stop
    tests/test_smoke.py::test_spot_storage
    tests/test_smoke.py::test_sky_bench
    tests/test_smoke.py::test_multiple_accelerators_unordered_with_default
    tests/test_smoke.py::test_file_mounts
    tests/test_smoke.py::TestStorageWithCredentials::test_excluded_file_cloud_storage_upload_copy
    tests/test_smoke.py::test_multiple_resources
    tests/test_smoke.py::TestStorageWithCredentials::test_public_bucket
    tests/test_smoke.py::TestStorageWithCredentials::test_public_bucket
    tests/test_smoke.py::TestStorageWithCredentials::test_copy_mount_existing_storage
    
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh (only with the cloud related dependencies installed)
    • master: sky launch -c test-tpu-node examples/tpu/tpu_app.yaml; this branch: sky exec test-tpu-node examples/tpu/tpu_app.yaml; sky autostop -i 0 test-tpu-node
    • master: sky launch -c test-tpu-node examples/tpu/tpu_app.yaml; this branch: sky exec test-tpu-node examples/tpu/tpu_app.yaml; sky launch -c test-tpu-node examples/tpu/tpu_app.yaml; sky autostop -i 0 test-tpu-node
  • Installation speed
    • master: pip install .[gcp] - 42.337s
    • current PR: pip install .[gcp] - 26.453s

@concretevitamin
Copy link
Collaborator

Awesome!! May want to do an install speed comparison, and/or smoke tests?

@Michaelvll Michaelvll changed the title [Provisioner] Remove ray dependency for GCP [Provisioner] Remove ray dependency for GCP and move TPU node to new provisioner Jan 5, 2024
@Michaelvll
Copy link
Collaborator Author

Awesome!! May want to do an install speed comparison, and/or smoke tests?

Thanks for the comment @concretevitamin! Added the smoke tests and the speed comparison in the PR description.

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @Michaelvll to see the installation speedups! Did a pass.

sky/backends/backend_utils.py Show resolved Hide resolved
sky/clouds/gcp.py Outdated Show resolved Hide resolved
sky/clouds/utils/gcp_utils.py Show resolved Hide resolved
sky/provision/gcp/constants.py Show resolved Hide resolved
sky/setup_files/setup.py Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @Michaelvll, some minor nits, thanks.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
# Delete TPU node.
"""Delete a TPU node with gcloud CLI.

This is used for both stopping and terminating a TPU node. It is ok to call
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case maybe best to name it stop_or_delete_tpu_node(). How does the CLI cmd determine when to stop and when to delete?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, maybe I am not clear enough in the docstr. This function always delete the tpu accelerator, no matter it is stopping or terminating the cluster, because the host VM will be correctly stopped or terminated. Whenever we restart a stopped TPU node cluster, we will create a new TPU accelerator to attach to the host VM.

Just updated the docstr. PTAL

@@ -44,7 +44,7 @@
# e.g., when we add new events to skylet, or we fix a bug in skylet.
#
# TODO(zongheng,zhanghao): make the upgrading of skylet automatic?
SKYLET_VERSION = '5'
SKYLET_VERSION = '6'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the comment above to add this case as a reason we must bump the version? For future guidance.

sky/provision/gcp/instance_utils.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
@Michaelvll Michaelvll merged commit 71525cd into master Jan 12, 2024
19 checks passed
@Michaelvll Michaelvll deleted the gcp-remove-ray-dependency branch January 12, 2024 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants