-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Provisioner] Remove ray dependency for GCP and move TPU node to new provisioner #2943
Conversation
Awesome!! May want to do an install speed comparison, and/or smoke tests? |
…move-ray-dependency
Thanks for the comment @concretevitamin! Added the smoke tests and the speed comparison in the PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @Michaelvll to see the installation speedups! Did a pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @Michaelvll, some minor nits, thanks.
sky/provision/gcp/instance_utils.py
Outdated
# Delete TPU node. | ||
"""Delete a TPU node with gcloud CLI. | ||
|
||
This is used for both stopping and terminating a TPU node. It is ok to call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case maybe best to name it stop_or_delete_tpu_node()
. How does the CLI cmd determine when to stop and when to delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, maybe I am not clear enough in the docstr. This function always delete the tpu accelerator, no matter it is stopping or terminating the cluster, because the host VM will be correctly stopped or terminated. Whenever we restart a stopped TPU node cluster, we will create a new TPU accelerator to attach to the host VM.
Just updated the docstr. PTAL
@@ -44,7 +44,7 @@ | |||
# e.g., when we add new events to skylet, or we fix a bug in skylet. | |||
# | |||
# TODO(zongheng,zhanghao): make the upgrading of skylet automatic? | |||
SKYLET_VERSION = '5' | |||
SKYLET_VERSION = '6' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the comment above to add this case as a reason we must bump the version? For future guidance.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…kypilot into gcp-remove-ray-dependency
Changes
skypilot[gcp]
.gcp_utils
.Tested (run the relevant ones):
bash format.sh
sky launch -c test-gcp --cloud gcp --cpus 2+ echo hi
;sky exec test-gcp echo hi
sky launch -c test-tpu-node examples/tpu/tpu_app.yaml
;sky stop test-tpu-node
;sky start test-tpu-node
;sky exec test-tpu-node examples/tpu/tpu_app.yaml
;sky down test-tpu-node
sky launch -c test-tpu-node examples/tpu/tpu_app.yaml
;sky autostop -i 0 test-tpu-node
;sky status -r test-tpu-node
;sky start test-tpu-node
;sky autostop -i 0 --down test-tpu-node
;sky status -r test-tpu-node
pytest tests/test_smoke.py --gcp
(with onlyskypilot[gcp]
installed): Failed the following tests due to the lack of credentials for AWS.pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh
(only with the cloud related dependencies installed)sky launch -c test-tpu-node examples/tpu/tpu_app.yaml
; this branch:sky exec test-tpu-node examples/tpu/tpu_app.yaml
;sky autostop -i 0 test-tpu-node
sky launch -c test-tpu-node examples/tpu/tpu_app.yaml
; this branch:sky exec test-tpu-node examples/tpu/tpu_app.yaml
;sky launch -c test-tpu-node examples/tpu/tpu_app.yaml
;sky autostop -i 0 test-tpu-node
pip install .[gcp]
- 42.337spip install .[gcp]
- 26.453s