New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AWS] Bring-your-own-VPC that disables public IPs for all SkyPilot nodes. #1512
Conversation
After iterating with the user, reverted to having a fixed config path at Reason: After implementing an env-var config, in my own testing, I exported the env var in one terminal, but forgot to export it in a new terminal where I ran With a global path like I've confirmed with the user that the fixed-path approach should work with their config distribution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great effort @concretevitamin! This will unblock our user. ; )
I haven't done a complete pass yet (only covered the first 6 files for now), but I feel it might be good to submit the comments first. I will continue reading the PR tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially addressed comments. Main things left
- test spot controller in an old VPC
- try to always drop ssh proxy command from launch hash
A minor nit: we query for the IP of the cluster twice, even if it only has internal IPs (here). It is fine as both of them returns internal IPs, but we can optimize it by only query the IP once
Good call, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll. Addressed. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @concretevitamin! Left several comments. A big concern is the launch_hash
calculated in the autostop event. We may need to drop the ssh_proxy_command
directly from the ray yaml, to avoid different launch hash causing leaked instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll for catching an important bug: after a recent commit that fixes the monkey patching, the second ray up
in the backend was leaking VMs if hit. Fixed now. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @concretevitamin! Just tested with the following (with private VPC set up on us-east-1 and us-east-2):
-
sky launch -c test-hash-2 --num-nodes 2 -i 0 echo hi
both nodes have the same launch hash and correctly autostopped. -
sky status -r
correctly gets the stopped status. -
sky start test-hash-2
correctly restart the cluster without leakage. -
sky spot launch --cloud aws --region us-east-2 echo hi
controller is launched on us-east-1 and the spot cluster is correctly launched on us-east-2 - remove the controller,
mv ~/.aws ~/.aws.bk; sky check
andsky spot launch echo hi
. The controller is launched on Azure and the spot cluster is correctly launched on GCP. (for testing the sys.executable works on Azure) - remove the controller,
mv ~/.azure ~/.azure.bk; sky check
andsky spot launch echo hi
. The controller is launched on GCP and the spot cluster is correctly launched on GCP. (for testing the sys.executable works on GCP)
Thanks @Michaelvll for your thoughtful reviews! Passed the following, and I also pushed some simple stability fixes for smoke tests issues discovered:
Merging this. |
Support bring-your-own-VPC that disables public IPs for all SkyPilot nodes (incl. spot controller).
Introduces an experimental
sky_config
module that reads~/.sky/config.yaml
, a YAML config file for such networking/auth settings. See below for an example config.NOTEs
Steps to set up a new VPC (single region) to test this PR
t2.micro
VM in a "public" subnet of the new VPC; selectsky-key
as keypair; make sure to choose assign public IP~/.sky
environment: for example,cp -r
that dir somewhere, and remove~/.sky/*
.~/.sky/config.yaml
. Example:vpc_name
with the newly created VPC name; and inssh_proxy_command
replace<user>@<jump server public IP>
. (Alternatively, use your own proxy setup there.)Steps to set up VPC peering between 2 regions
More info on peering
Deferred for the future
Testing
TODOs
logger.info
Tested:
bash tests/run_smoke_tests.sh
(e82c4b4)bash tests/run_smoke_tests.sh
(e82c4b4)TestStorageWithCredentials::test_bucket_bulk_deletion[StoreType.GCS]
being flaky (1-2 times out of ~5 times pass rate), unrelated?bash tests/backward_comaptibility_tests.sh
(commit e82c4b4)bash tests/backward_comaptibility_tests.sh
(commit e82c4b4)