Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Provisioner] New provisioner with AWS support #1702

Merged
merged 134 commits into from
Sep 29, 2023
Merged

Conversation

suquark
Copy link
Collaborator

@suquark suquark commented Feb 18, 2023

This PR is a major rewriting of the provisioning layer of Skypilot. Here are the benefits summarized:

  • Simplified and eliminates a huge amount of tech debts from the original provisioner implementation from Ray
  • 100% speed improvement for multi-node provision
  • 20x (or even more) speed improvement when failing over across regions (due to resource limit) with multiple nodes
  • Easy implementation of provisioners for new clouds with clear API
  • Bridging gap between provisioner and other tools like Terraform/Kubernetes with CRUD-compatible APIs
  • Minimal dependencies (with only cloud SDK)
  • Provisioning new clusters with minimal memory & CPU resources; can be executed in cheap instance type or lambda function
  • Full control of cloud config for easy feature addition or bug fixes
  • Making it easier for adding new features like customized permissions, EFS support, customized AMIs/Images, and more
  • Easy performance profiling
  • Support ARM or more architecture.
  • Separating the resource allocation (instance creation) from dependencies setup (upload sky wheels, install ray etc)
  • Faster cluster status refreshing
  • Faster zone fail-over
  • (and more)

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@concretevitamin
Copy link
Collaborator

(Was planning to try this out) I have a stopped controller in my cluster table and no other clusters. After switching to this branch, status shows

...

  File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/sky/backends/backend_utils.py", line 2188, in get_clusters
    records = global_user_state.get_clusters()
  File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/sky/global_user_state.py", line 533, in get_clusters
    'handle': pickle.loads(handle),
AttributeError: Can't get attribute 'CloudVmRayResourceHandle' on <module 'sky.backends.cloud_vm_ray_backend' from '/Users/zongheng/Dropbox/workspace/riselab/sky-computing/sky/backends/cloud_vm_ray_backend.py'>

Is this expected?

@suquark suquark force-pushed the new_provisioner branch 6 times, most recently from 4265d4d to 7ffc227 Compare March 4, 2023 23:09
@suquark
Copy link
Collaborator Author

suquark commented Mar 5, 2023

Although there are still some details for fixing (mostly mild UX issue), it is ok to start reviewing now.

@suquark
Copy link
Collaborator Author

suquark commented Mar 5, 2023

I rebased it a few times to sync with the latest upstream

@suquark suquark added the P0 label Mar 5, 2023
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super awesome. Just started trying out. Sorry, can't finish in 1 pass, pasting some UX comments first.

What I've tried & confirmed to work:

sky launch -c dbg --region us-east-1 --use-spot --cloud aws -t t3.micro
sky launch -c dbg2 --cloud aws -t t3.micro

# Autodown works
sky autostop -i0 --down 'dbg*'

# Another region works: us-east-2
sky launch -c dbg  --use-spot --cloud aws --cpus 3+

# Use a GCP controller to launch a spot node on EC2
sky spot launch --cloud aws --cpus 2 echo hi

# private VPC (3 lines in config)
sky launch -c dbg-private-vpc  --use-spot --cloud aws -t t3.micro
# logging in works:
ssh dbg-private-vpc
# terminate on console and let's try refresh: works
sky status -r

sky/provision/aws/instance.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic @suquark! I am still reading the PR, but think it would be good to leave my draft comments first. A backward compatibility problem I met:

  1. sky launch -c min -t t3.micro with the master branch
  2. sky launch -c min with the current branch. The following error occurs:
sky launch -c min
Running task on cluster min...
I 03-09 04:47:42 cloud_vm_ray_backend.py:1153] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-03-09-04-47-40-498429/provision.log
I 03-09 04:47:42 cloud_vm_backend.py:51] Launching on AWS us-east-1 (us-east-1a)
I 03-09 04:47:47 cloud_vm_backend.py:111] Successfully provisioned or found existing VM.
E 03-09 04:47:48 cloud_vm_backend.py:346] Post provision setup of cluster min failed.
E 03-09 04:47:48 cloud_vm_backend.py:346] Traceback (most recent call last):
E 03-09 04:47:48 cloud_vm_backend.py:346]   File "/home/gcpuser/skypilot/sky/backends/cloud_vm_backend.py", line 338, in post_provision_setup
E 03-09 04:47:48 cloud_vm_backend.py:346]     return _post_provision_setup(cloud_name,
E 03-09 04:47:48 cloud_vm_backend.py:346]   File "/home/gcpuser/skypilot/sky/backends/cloud_vm_backend.py", line 217, in _post_provision_setup
E 03-09 04:47:48 cloud_vm_backend.py:346]     raise RuntimeError(f'Provision failed for cluster "{cluster_name}". '
E 03-09 04:47:48 cloud_vm_backend.py:346] RuntimeError: Provision failed for cluster "min". Could not find any head instance.
Clusters
NAME  LAUNCHED        RESOURCES         STATUS  AUTOSTOP  COMMAND
min   a few secs ago  1x AWS(t3.micro)  INIT    -         sky launch -c min

Traceback (most recent call last):
  File "/opt/conda/envs/sky/bin/sky", line 8, in <module>
    sys.exit(cli())
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 220, in _record
    return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/cli.py", line 1024, in invoke
    return super().invoke(ctx)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/sky/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 241, in _record
    return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/cli.py", line 1257, in launch
    _launch_with_confirm(task,
  File "/home/gcpuser/skypilot/sky/cli.py", line 753, in _launch_with_confirm
    sky.launch(
  File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 241, in _record
    return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 241, in _record
    return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/execution.py", line 421, in launch
    _execute(
  File "/home/gcpuser/skypilot/sky/execution.py", line 264, in _execute
    handle = backend.provision(task,
  File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 241, in _record
    return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 220, in _record
    return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/backends/backend.py", line 56, in provision
    return self._provision(task, to_provision, dryrun, stream_logs,
  File "/home/gcpuser/skypilot/sky/backends/cloud_vm_ray_backend.py", line 2314, in _provision
    cluster_metadata = cloud_vm_backend.post_provision_setup(
  File "/home/gcpuser/skypilot/sky/backends/cloud_vm_backend.py", line 338, in post_provision_setup
    return _post_provision_setup(cloud_name,
  File "/home/gcpuser/skypilot/sky/backends/cloud_vm_backend.py", line 217, in _post_provision_setup
    raise RuntimeError(f'Provision failed for cluster "{cluster_name}". '
RuntimeError: Provision failed for cluster "min". Could not find any head instance.

sky/provision/__init__.py Outdated Show resolved Hide resolved
sky/provision/__init__.py Outdated Show resolved Hide resolved
sky/provision/__init__.py Outdated Show resolved Hide resolved
sky/provision/setup.py Outdated Show resolved Hide resolved
sky/provision/aws/config.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished a pass for the files except for the ones in provision/. Try to send the comments first.

sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
sky/provision/setup.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_backend.py Outdated Show resolved Hide resolved
sky/utils/command_runner.pyi Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
sky/provision/aws/instance.py Outdated Show resolved Hide resolved
sky/skylet/events.py Outdated Show resolved Hide resolved
sky/skylet/events.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished a pass! The PR looks pretty good to me generally.

sky/provision/aws/config.py Outdated Show resolved Hide resolved
sky/provision/aws/utils.py Outdated Show resolved Hide resolved
sky/provision/aws/utils.py Outdated Show resolved Hide resolved
sky/provision/aws/config.py Show resolved Hide resolved
sky/provision/aws/config.py Show resolved Hide resolved
sky/provision/utils.py Outdated Show resolved Hide resolved
sky/provision/utils.py Outdated Show resolved Hide resolved
sky/provision/utils.py Outdated Show resolved Hide resolved
sky/provision/utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll self-requested a review March 14, 2023 20:44
@suquark
Copy link
Collaborator Author

suquark commented Mar 16, 2023

Sorry, I have to squash and rebase this PR due to too many conflicts w/ upstream. This would not affect current reviews so far, as most of them have been addressed

@suquark suquark force-pushed the new_provisioner branch 2 times, most recently from d52e7fa to c9cc769 Compare March 17, 2023 00:45
@suquark
Copy link
Collaborator Author

suquark commented Mar 21, 2023

@Michaelvll I have fixed all comments. Feel free to take another review. Thanks.

@concretevitamin concretevitamin added this to the 0.3 milestone Mar 22, 2023
sky/provision/aws/instance.py Outdated Show resolved Hide resolved
sky/setup_files/setup.py Outdated Show resolved Hide resolved
sky/backends/provision_utils.py Outdated Show resolved Hide resolved
sky/backends/provision_utils.py Outdated Show resolved Hide resolved
sky/backends/provision_utils.py Outdated Show resolved Hide resolved
@suquark
Copy link
Collaborator Author

suquark commented Mar 22, 2023

@concretevitamin there is a general issue I think we need to make a decision: should the new provisioner imitates Ray?

By imitating Ray, I mean reserve most behaviors of Ray, even if they are not useful for sky. For example, add "ray" prefix to cluster name, and use "ray-note-type: head" to marker Ray head nodes, and so on.

I have this question, because I see there are changes that gradually breaking the connection with Ray, for example, we are using our own security group with AWS now, and it uses "skypilot" prefix instead of the original "ray" prefix. Also since we are not using Ray as the provisioner, I am a bit worried about whether the Ray name tags would confuse people.

@concretevitamin
Copy link
Collaborator

concretevitamin commented Mar 25, 2023

Tested some more. The below back compat tests all passed for me (before we finally merge the PR we should compile all these tests in various PR comments and rerun them) -- which is awesome:

  • back compat: An old autostopped cluster with a custom tag.
# in master
sky launch --cloud aws --cpus 2 -i0 -c dbg-autopped-custom-tag
# wait till autostopped
# switch to this branch
sky status -r  # shows stopped
sky launch --cloud aws --cpus 2 -i0 -c dbg-autopped-custom-tag  # restarted; tags intact
# observed it got stopped
sky status -r  # shows stopped
sky down dbg-autopped-custom-tag  # works
  • back compat: an old autostopped spot controller. Check sky spot queue -r restarts it and displays prior jobs.

  • back compat: An old running spot controller with running AWS spot jobs

    • can launch a new AWS job
      • sky spot launch --cloud aws sleep 1000 -n launched-from-new-provisioner-branch
    • can cancel old jobs
      • sky spot cancel -ay
  • Custom tags in aws-ray.yml.j2 works

    • modify the j2 to add some tag, e.g.,
diff --git a/sky/templates/aws-ray.yml.j2 b/sky/templates/aws-ray.yml.j2
index 1859aba9..05cbeaf6 100644
--- a/sky/templates/aws-ray.yml.j2
+++ b/sky/templates/aws-ray.yml.j2
@@ -63,6 +63,8 @@ available_node_types:
           Tags:
             - Key: skypilot-user
               Value: {{ user }}
+            - Key: my-custom-tag
+              Value: my-custom-value
 {% if num_nodes > 1 %}
   ray.worker.default:
     min_workers: {{num_nodes - 1}}
  • sky launch --cloud aws --cpus 2 -c dbg-custom-tag-launched-from-new-provisioner -i0

Potential issue

  • The VPC test in [Provisioner] New provisioner with AWS support #1702 (review) doesn't seem to work now

    • set up a private VPC
    • set up local ~/.sky/config.yaml
    • the sky launch command in the comment above got stuck in ⠸ Waiting SSH connection for dbg-private-vpc ...
    • however, I can do this to log in ssh -i ~/.ssh/sky-key -J ec2-user@<my jump host public ip> ubuntu@<my node's private ip>
    • This relates to the debugging comment above : also: maybe print the exact SSH command used to probe in this log file? so users can copy-paste and try.
  • I ran sky launch --cloud aws --zone us-east-1d -c dbg --use-spot --cpus 2+ -i10 --down and got

...
I 03-29 20:40:38 provision_utils.py:58] Launching on AWS us-east-1 (us-east-1d)
E 03-29 20:40:39 provision_utils.py:74] Failed to bootstrap configurations for "dbg".
E 03-29 20:40:39 provision_utils.py:156] *** Failed provisioning the cluster (dbg). ***
E 03-29 20:40:39 provision_utils.py:166] *** Terminating the failed cluster. ***
W 03-29 20:40:40 cloud_vm_ray_backend.py:1803] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in us-east-1d. Try changing resource requirements or u
se another zone.

The Failed to bootstrap configurations for "dbg". line doesn't give enough info for me to know why. Also, normal users will not know what "bootstrap configs" means.

I had to look inside the provision.log to see

...
RuntimeError: No usable subnets matching availability zone ['us-east-1d'] found. Choose a different availability zone or try manually creating an instance in your specified region to populate the list of subnets and trying this again. If you have set `use_internal_ips`, check that this zone has a subnet that (1) has the substring "private" in its name tag and (2) does not assign public IPs (`map_public_ip_on_launch` is False).

Can we directly print such a bootstrap-time error outside, replacing the current message? Note that I triggered this because I have ~/.sky/config.yaml with vpc_name, use_internal_ips, etc. set.

sky/provision/common.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the providers in case it is required for old aws clusters. We can remove in a future PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has some other problem getting rid of ray dependency locally for AWS. I am going to add the ray dependency back first for AWS in this PR and have another PR #2625 to fix the dependency issue. Wdyt?

@suquark
Copy link
Collaborator Author

suquark commented Sep 29, 2023

These changes LGTM. Thank you @Michaelvll !

@concretevitamin
Copy link
Collaborator

concretevitamin commented Sep 29, 2023

ece8564 smoke tests all passed with private VPC (1 region), except for some expected failures:

pytest tests/test_smoke.py --aws
FAILED tests/test_smoke.py::test_docker_preinstalled_package - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr6...
FAILED tests/test_smoke.py::test_job_queue_with_docker - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr6600000...
FAILED tests/test_smoke.py::test_aws_http_server_with_custom_ports - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xj...
FAILED tests/test_smoke.py::test_aws_zero_quota_failover - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr66000...
FAILED tests/test_smoke.py::test_spot_pipeline_recovery_aws - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr66...

@Michaelvll Michaelvll merged commit 3d46396 into master Sep 29, 2023
18 checks passed
@Michaelvll Michaelvll deleted the new_provisioner branch September 29, 2023 23:49
@concretevitamin
Copy link
Collaborator

Congratulations for shipping this major component @suquark @Michaelvll! 🚀

@suquark
Copy link
Collaborator Author

suquark commented Sep 30, 2023

This is great work! Thank you a lot for help @concretevitamin @Michaelvll !

jc9123 pushed a commit to jc9123/skypilot that referenced this pull request Oct 11, 2023
* new provisioner with AWS support

fix sky status refresh

update setup

update metadata

per-instance logs

improve ux

update comments

update logging

reduce retries of provisioning to match the original number

fix Ray compat issue

update the config with the latest changes

sync events with the latest changes

sync with upstream

resolve conflict

* cleanup unused functions

* lint

* update

* fix

* fix

* Fix network configuration and authentication

* Fix creation cluster name

* fix logging

* format

* format

* slight improvement for logging

* minor fix logging

* fix autostop

* fix autostop

* Fix ray ports used

* fix ray start

* Fix logging for ray start and skylet

* fix ports for ray dashboard

* add head / worker in name

* More logging information for launching

* fix cluster name in progress bar

* Fix ssh

* format

* grammar

* Fix message

* fix ssh_proxy_commands

* fix merge error

* Get rid of ray logging

* use info

* change logger to use contextmanager

* format

* [major] remove dependency for ray on aws

* minor fix

* allow warning to be printed

* adopt changes

* rename

* Better spinner

* format

* update logging

* rich_utils in provision_utils

* Fix instance log path

* Log head node to provision.log

* remove unused var

* fix ssh method

* Fix ray for worker node

* Fix ray cluster on worker

* fix ray cluster

* Fix logging and ip fetching

* show cluster name

* fix wait after cluster launch

* Add logging

* Add continue

* Change back to sleep before wait

* use sleep 1 instead

* clean

* Avoid check skylet twice

* skip docker

* remove get_feasible_ips

* fix comments

* minor fix for logs

* minor fix

* Fix the ip fetching logic

* Add quote for `cluster_name!r`

* fix return

* fix test

* fix comments

* fix comments

* fix comments

* Smoke tests changes to use a private VPC region.

* refactor get_ips

* minor debug info fix

* fix

* UX fix

* Fix proxy command to be None

* Address comments

* format

* Fix quote

* update logging

* compatibility for python 3.7

* format

* format

* change launched to provisioned

* clean up ray yaml

* format

* [Provisioner] Add docker support back (skypilot-org#2507)

* Add docker support back

* Fix

* Works!

* format

* Add docker back to the supported features

* fix docker_cmd

* Fix docker cmd

* fix DockerLoginConfig

* Move docker login config

* Fix backward compatibility for docker

* Fix docker login

* Fix docker login config new line issue

* Fix string process in ray yaml

* Update sky/provision/docker_utils.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* Update sky/provision/instance_setup.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* Update sky/provision/docker_utils.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* Update sky/provision/instance_setup.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* address comments

* format

* format

* Add comment

* Address comments

* format

---------

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* Wording for SSH connection

* Fix ray status check for backward compatibility

* remnant

* stronger backward compatibility tests

* remove unused tag

* Revert "remove unused tag"

This reverts commit a12df8e.

* Add default value for `docker_user`

* fix botocore config

* lint

* Fix mypy

* remove unused variable

* minor fix for logging

* expose bootstrap error

* format

* minor change for `wait_instances` API

* lint

* use `command_runner.ssh_options_list`

* fix ssh command

* reword

* Dim error for bootstrapping

* fix user known file

* Update sky/setup_files/setup.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* Add ray back, will fix the dependency issue skypilot-org#2625

* move dependency to local ray

* Address renaming comments

* renamed to provisioner

* refactoring for logging to reduce confusion

* format

* rename back to meta data for metadata utils

* format

* move provisioner to `provision/`

* Do not propagate to provisioner logger

* Minor changes

* Fix color for error of provisioner

* Remove dimmer

* Add back missing handler for `provision_logger`

* add comments

* Print error message for the failed ssh command

* Make the error yellow

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants