Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Bring-your-own-VPC that disables public IPs for all SkyPilot nodes. #1512

Merged
merged 60 commits into from Jan 4, 2023

Conversation

concretevitamin
Copy link
Collaborator

@concretevitamin concretevitamin commented Dec 12, 2022

Support bring-your-own-VPC that disables public IPs for all SkyPilot nodes (incl. spot controller).

Introduces an experimental sky_config module that reads ~/.sky/config.yaml, a YAML config file for such networking/auth settings. See below for an example config.

NOTEs

  • Any existing clusters will not pick up new configs, even if those clusters are stopped and then restarted
  • Configs are tied to the lifetime of each cluster: configs are read once when a new cluster is launched and immutable after
    • i.e., if a cluster X was launched with config C, and later some changes to the config were made, restarting cluster X would work and it would still use the original config

Steps to set up a new VPC (single region) to test this PR

  • Provision a new VPC
  • Provision a jump server VM
    • Ensure local client can ssh into jump server
    • Ensure jump server can ping the newly created VPC using private IP
    • Both of the above are satisfied by:
      • Manually launch a t2.micro VM in a "public" subnet of the new VPC; select sky-key as keypair; make sure to choose assign public IP
  • Switch to a new ~/.sky environment: for example, cp -r that dir somewhere, and remove ~/.sky/*.
    • Things will still work without this step, but this will make testing easier
  • Add a config file at ~/.sky/config.yaml. Example:
aws:
  # The VPC to use in each region.
  #
  # If this is set, any region without a unique VPC with this name will not be
  # able to launch SkyPilot nodes. SkyPilot's failover will still properly
  # function to look for such an eligible region.
  #
  # Optional; default: None (SkyPilot will use the default VPC in each region).
  vpc_name: skypilot-vpc

  # Set to true to use private IPs to communicate between the local client and
  # any SkyPilot nodes. This requires the networking stack is properly set up.
  # Specifically, setting this flag means SkyPilot will only use subnets that
  # satisfy both of the following to launch nodes:
  #  - subnets with name tags containing the substring "private"
  #  - subnets that are configured to not assign public IPs (the
  #    `map_public_ip_on_launch` attribute is False)
  #
  # This flag is typically set together with 'vpc_name' above and
  # 'auth.ssh_proxy_command'.
  #
  # Optional; default: False.
  use_internal_ips: True

auth:
  # If set, this is passed as the '-o ProxyCommand' option for any SSH
  # connections (including rsync) used to communicate between the local client
  # and any SkyPilot nodes. This option is not used between SkyPilot nodes,
  # since they may not have such a proxy set up.
  #
  # Useful for using a jump server to communicate with SkyPilot nodes hosted in
  # private subnets without public IPs.
  #
  # Optional; default: None.
  ssh_proxy_command: ssh -W %h:%p -i ~/.ssh/sky-key -o StrictHostKeyChecking=no ec2-user@xxxxxx
  • Be sure to replace vpc_name with the newly created VPC name; and in ssh_proxy_command replace <user>@<jump server public IP>. (Alternatively, use your own proxy setup there.)

Steps to set up VPC peering between 2 regions

  • Use the above two create 2 VPCs in 2 regions; use the same name
  • Create a VPC Peering Connection
    • Requires: the two VPCs should have non-overlapping CIDR ranges (e.g., 10.0.0.0/16, 10.1.0.0/16, ...)
      • These are specified at VPC creation, and thus requires "network/IP address range planning"
  • Modify the route tables of each VPC
    • Inside VPC 1's each route table, add a route: (Destination=CIDR of the other VPC, e.g., 10.1.0.0/16; Target=pcx-xxxxxx (the peering connection ID))
    • Inside VPC 2's each route table, add a route: (Destination=CIDR of the other VPC, e.g., 10.0.0.0/16; Target=pcx-xxxxxx (the peering connection ID))
    • Do this for all route tables, including the single public route table + per-AZ private route tables in each VPC
    • NOTE
      • Technically, only need one direction for "jump server VPC -> sky vpc" (i.e., the above needs only be done for jump server VPC)
      • But between 2 sky VPCs, do both directions (since spot controller can be in any region).

More info on peering

Deferred for the future

  • (deferred) per-region proxy command
  • have a simple preflight test for proxy command
  • (deferred) make Python API pick up configs easier

Testing

TODOs

  • smoke tests
  • remove some logger.info

Tested:

  • failover cross-region
  • spot launch with region migration
  • Ran the following, with the above single-region setup:
    • autostop
    • launch
    • exec
    • queue
    • logs
    • status -r
    • ssh clus
    • spot *
  • cross-region with peering setup: sky spot launch ‘echo $(hostname)’
    • VPC 1: us-east-1
      • jump server
      • spot controller
    • VPC 2: us-west-2
      • data plane
  • cross-region with peering setup:
    • VPC 1: us-east-1
      • jump server
    • VPC 2: us-west-2
      • data plane
  • All smoke tests, without any config: bash tests/run_smoke_tests.sh (e82c4b4)
  • All smoke tests, with the above config and setup: bash tests/run_smoke_tests.sh (e82c4b4)
    • This requires setting up (1) 4 VPCs in 4 AWS US regions (2) peering connections between them
    • All passed, except for TestStorageWithCredentials::test_bucket_bulk_deletion[StoreType.GCS] being flaky (1-2 times out of ~5 times pass rate), unrelated?
  • Backward compatibility tests, without any config: bash tests/backward_comaptibility_tests.sh (commit e82c4b4)
  • Backward compatibility tests, with the above config and setup: bash tests/backward_comaptibility_tests.sh (commit e82c4b4)

@Michaelvll Michaelvll self-requested a review December 12, 2022 20:28
tests/test_smoke.py Outdated Show resolved Hide resolved
@concretevitamin
Copy link
Collaborator Author

After iterating with the user, reverted to having a fixed config path at ~/.sky/config.yaml.

Reason: After implementing an env-var config, in my own testing, I exported the env var in one terminal, but forgot to export it in a new terminal where I ran sky launch. This made these launches fail to pick up the config, so nodes are launched with public IPs, which are undesired.

With a global path like ~/.sky/config.yaml it should be hard to trigger this.

I've confirmed with the user that the fixed-path approach should work with their config distribution.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great effort @concretevitamin! This will unblock our user. ; )

I haven't done a complete pass yet (only covered the first 6 files for now), but I feel it might be good to submit the comments first. I will continue reading the PR tomorrow.

sky/authentication.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/skylet/providers/aws/config.py Outdated Show resolved Hide resolved
sky/execution.py Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially addressed comments. Main things left

  • test spot controller in an old VPC
  • try to always drop ssh proxy command from launch hash

A minor nit: we query for the IP of the cluster twice, even if it only has internal IPs (here). It is fine as both of them returns internal IPs, but we can optimize it by only query the IP once

Good call, done.

sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/sky_config.py Outdated Show resolved Hide resolved
sky/skylet/events.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll. Addressed. PTAL.

sky/skylet/events.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
sky/execution.py Show resolved Hide resolved
sky/utils/command_runner.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @concretevitamin! Left several comments. A big concern is the launch_hash calculated in the autostop event. We may need to drop the ssh_proxy_command directly from the ray yaml, to avoid different launch hash causing leaked instances.

sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/setup_files/MANIFEST.in Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
sky/skylet/events.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll for catching an important bug: after a recent commit that fixes the monkey patching, the second ray up in the backend was leaking VMs if hit. Fixed now. PTAL.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @concretevitamin! Just tested with the following (with private VPC set up on us-east-1 and us-east-2):

  • sky launch -c test-hash-2 --num-nodes 2 -i 0 echo hi both nodes have the same launch hash and correctly autostopped.
  • sky status -r correctly gets the stopped status.
  • sky start test-hash-2 correctly restart the cluster without leakage.
  • sky spot launch --cloud aws --region us-east-2 echo hi controller is launched on us-east-1 and the spot cluster is correctly launched on us-east-2
  • remove the controller, mv ~/.aws ~/.aws.bk; sky check and sky spot launch echo hi. The controller is launched on Azure and the spot cluster is correctly launched on GCP. (for testing the sys.executable works on Azure)
  • remove the controller, mv ~/.azure ~/.azure.bk; sky check and sky spot launch echo hi. The controller is launched on GCP and the spot cluster is correctly launched on GCP. (for testing the sys.executable works on GCP)

@concretevitamin
Copy link
Collaborator Author

Thanks @Michaelvll for your thoughtful reviews!

Passed the following, and I also pushed some simple stability fixes for smoke tests issues discovered:

  • All smoke tests, without any config: bash tests/run_smoke_tests.sh (e82c4b4)
  • All smoke tests, with the above config and setup: bash tests/run_smoke_tests.sh (e82c4b4)
    • This requires setting up (1) 4 VPCs in 4 AWS US regions (2) peering connections between them
    • All passed, except for TestStorageWithCredentials::test_bucket_bulk_deletion[StoreType.GCS] being flaky (1-2 times out of ~5 times pass rate), unrelated?
  • Backward compatibility tests, without any config: bash tests/backward_comaptibility_tests.sh (commit e82c4b4)
  • Backward compatibility tests, with the above config and setup: bash tests/backward_comaptibility_tests.sh (commit e82c4b4)

Merging this.

@concretevitamin concretevitamin merged commit 8d6f6a9 into master Jan 4, 2023
@Michaelvll Michaelvll mentioned this pull request May 12, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants