-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Provisioner] New provisioner with AWS support #1702
Conversation
7382c7c
to
1590d11
Compare
415d834
to
ee849d7
Compare
(Was planning to try this out) I have a stopped controller in my cluster table and no other clusters. After switching to this branch,
Is this expected? |
4265d4d
to
7ffc227
Compare
Although there are still some details for fixing (mostly mild UX issue), it is ok to start reviewing now. |
I rebased it a few times to sync with the latest upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super awesome. Just started trying out. Sorry, can't finish in 1 pass, pasting some UX comments first.
What I've tried & confirmed to work:
sky launch -c dbg --region us-east-1 --use-spot --cloud aws -t t3.micro
sky launch -c dbg2 --cloud aws -t t3.micro
# Autodown works
sky autostop -i0 --down 'dbg*'
# Another region works: us-east-2
sky launch -c dbg --use-spot --cloud aws --cpus 3+
# Use a GCP controller to launch a spot node on EC2
sky spot launch --cloud aws --cpus 2 echo hi
# private VPC (3 lines in config)
sky launch -c dbg-private-vpc --use-spot --cloud aws -t t3.micro
# logging in works:
ssh dbg-private-vpc
# terminate on console and let's try refresh: works
sky status -r
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic @suquark! I am still reading the PR, but think it would be good to leave my draft comments first. A backward compatibility problem I met:
sky launch -c min -t t3.micro
with the master branchsky launch -c min
with the current branch. The following error occurs:
sky launch -c min
Running task on cluster min...
I 03-09 04:47:42 cloud_vm_ray_backend.py:1153] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-03-09-04-47-40-498429/provision.log
I 03-09 04:47:42 cloud_vm_backend.py:51] Launching on AWS us-east-1 (us-east-1a)
I 03-09 04:47:47 cloud_vm_backend.py:111] Successfully provisioned or found existing VM.
E 03-09 04:47:48 cloud_vm_backend.py:346] Post provision setup of cluster min failed.
E 03-09 04:47:48 cloud_vm_backend.py:346] Traceback (most recent call last):
E 03-09 04:47:48 cloud_vm_backend.py:346] File "/home/gcpuser/skypilot/sky/backends/cloud_vm_backend.py", line 338, in post_provision_setup
E 03-09 04:47:48 cloud_vm_backend.py:346] return _post_provision_setup(cloud_name,
E 03-09 04:47:48 cloud_vm_backend.py:346] File "/home/gcpuser/skypilot/sky/backends/cloud_vm_backend.py", line 217, in _post_provision_setup
E 03-09 04:47:48 cloud_vm_backend.py:346] raise RuntimeError(f'Provision failed for cluster "{cluster_name}". '
E 03-09 04:47:48 cloud_vm_backend.py:346] RuntimeError: Provision failed for cluster "min". Could not find any head instance.
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
min a few secs ago 1x AWS(t3.micro) INIT - sky launch -c min
Traceback (most recent call last):
File "/opt/conda/envs/sky/bin/sky", line 8, in <module>
sys.exit(cli())
File "/opt/conda/envs/sky/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 220, in _record
return f(*args, **kwargs)
File "/home/gcpuser/skypilot/sky/cli.py", line 1024, in invoke
return super().invoke(ctx)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/envs/sky/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/envs/sky/lib/python3.10/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 241, in _record
return f(*args, **kwargs)
File "/home/gcpuser/skypilot/sky/cli.py", line 1257, in launch
_launch_with_confirm(task,
File "/home/gcpuser/skypilot/sky/cli.py", line 753, in _launch_with_confirm
sky.launch(
File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 241, in _record
return f(*args, **kwargs)
File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 241, in _record
return f(*args, **kwargs)
File "/home/gcpuser/skypilot/sky/execution.py", line 421, in launch
_execute(
File "/home/gcpuser/skypilot/sky/execution.py", line 264, in _execute
handle = backend.provision(task,
File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 241, in _record
return f(*args, **kwargs)
File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 220, in _record
return f(*args, **kwargs)
File "/home/gcpuser/skypilot/sky/backends/backend.py", line 56, in provision
return self._provision(task, to_provision, dryrun, stream_logs,
File "/home/gcpuser/skypilot/sky/backends/cloud_vm_ray_backend.py", line 2314, in _provision
cluster_metadata = cloud_vm_backend.post_provision_setup(
File "/home/gcpuser/skypilot/sky/backends/cloud_vm_backend.py", line 338, in post_provision_setup
return _post_provision_setup(cloud_name,
File "/home/gcpuser/skypilot/sky/backends/cloud_vm_backend.py", line 217, in _post_provision_setup
raise RuntimeError(f'Provision failed for cluster "{cluster_name}". '
RuntimeError: Provision failed for cluster "min". Could not find any head instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished a pass for the files except for the ones in provision/
. Try to send the comments first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished a pass! The PR looks pretty good to me generally.
Sorry, I have to squash and rebase this PR due to too many conflicts w/ upstream. This would not affect current reviews so far, as most of them have been addressed |
d52e7fa
to
c9cc769
Compare
@Michaelvll I have fixed all comments. Feel free to take another review. Thanks. |
@concretevitamin there is a general issue I think we need to make a decision: should the new provisioner imitates Ray? By imitating Ray, I mean reserve most behaviors of Ray, even if they are not useful for sky. For example, add "ray" prefix to cluster name, and use "ray-note-type: head" to marker Ray head nodes, and so on. I have this question, because I see there are changes that gradually breaking the connection with Ray, for example, we are using our own security group with AWS now, and it uses "skypilot" prefix instead of the original "ray" prefix. Also since we are not using Ray as the provisioner, I am a bit worried about whether the Ray name tags would confuse people. |
Tested some more. The below back compat tests all passed for me (before we finally merge the PR we should compile all these tests in various PR comments and rerun them) -- which is awesome:
# in master
sky launch --cloud aws --cpus 2 -i0 -c dbg-autopped-custom-tag
# wait till autostopped
# switch to this branch
sky status -r # shows stopped
sky launch --cloud aws --cpus 2 -i0 -c dbg-autopped-custom-tag # restarted; tags intact
# observed it got stopped
sky status -r # shows stopped
sky down dbg-autopped-custom-tag # works
Potential issue
The I had to look inside the
Can we directly print such a bootstrap-time error outside, replacing the current message? Note that I triggered this because I have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the providers in case it is required for old aws clusters. We can remove in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has some other problem getting rid of ray dependency locally for AWS. I am going to add the ray dependency back first for AWS in this PR and have another PR #2625 to fix the dependency issue. Wdyt?
These changes LGTM. Thank you @Michaelvll ! |
ece8564 smoke tests all passed with private VPC (1 region), except for some expected failures:
|
Congratulations for shipping this major component @suquark @Michaelvll! 🚀 |
This is great work! Thank you a lot for help @concretevitamin @Michaelvll ! |
* new provisioner with AWS support fix sky status refresh update setup update metadata per-instance logs improve ux update comments update logging reduce retries of provisioning to match the original number fix Ray compat issue update the config with the latest changes sync events with the latest changes sync with upstream resolve conflict * cleanup unused functions * lint * update * fix * fix * Fix network configuration and authentication * Fix creation cluster name * fix logging * format * format * slight improvement for logging * minor fix logging * fix autostop * fix autostop * Fix ray ports used * fix ray start * Fix logging for ray start and skylet * fix ports for ray dashboard * add head / worker in name * More logging information for launching * fix cluster name in progress bar * Fix ssh * format * grammar * Fix message * fix ssh_proxy_commands * fix merge error * Get rid of ray logging * use info * change logger to use contextmanager * format * [major] remove dependency for ray on aws * minor fix * allow warning to be printed * adopt changes * rename * Better spinner * format * update logging * rich_utils in provision_utils * Fix instance log path * Log head node to provision.log * remove unused var * fix ssh method * Fix ray for worker node * Fix ray cluster on worker * fix ray cluster * Fix logging and ip fetching * show cluster name * fix wait after cluster launch * Add logging * Add continue * Change back to sleep before wait * use sleep 1 instead * clean * Avoid check skylet twice * skip docker * remove get_feasible_ips * fix comments * minor fix for logs * minor fix * Fix the ip fetching logic * Add quote for `cluster_name!r` * fix return * fix test * fix comments * fix comments * fix comments * Smoke tests changes to use a private VPC region. * refactor get_ips * minor debug info fix * fix * UX fix * Fix proxy command to be None * Address comments * format * Fix quote * update logging * compatibility for python 3.7 * format * format * change launched to provisioned * clean up ray yaml * format * [Provisioner] Add docker support back (skypilot-org#2507) * Add docker support back * Fix * Works! * format * Add docker back to the supported features * fix docker_cmd * Fix docker cmd * fix DockerLoginConfig * Move docker login config * Fix backward compatibility for docker * Fix docker login * Fix docker login config new line issue * Fix string process in ray yaml * Update sky/provision/docker_utils.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * Update sky/provision/instance_setup.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * Update sky/provision/docker_utils.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * Update sky/provision/instance_setup.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * address comments * format * format * Add comment * Address comments * format --------- Co-authored-by: Tian Xia <cblmemo@gmail.com> * Wording for SSH connection * Fix ray status check for backward compatibility * remnant * stronger backward compatibility tests * remove unused tag * Revert "remove unused tag" This reverts commit a12df8e. * Add default value for `docker_user` * fix botocore config * lint * Fix mypy * remove unused variable * minor fix for logging * expose bootstrap error * format * minor change for `wait_instances` API * lint * use `command_runner.ssh_options_list` * fix ssh command * reword * Dim error for bootstrapping * fix user known file * Update sky/setup_files/setup.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Add ray back, will fix the dependency issue skypilot-org#2625 * move dependency to local ray * Address renaming comments * renamed to provisioner * refactoring for logging to reduce confusion * format * rename back to meta data for metadata utils * format * move provisioner to `provision/` * Do not propagate to provisioner logger * Minor changes * Fix color for error of provisioner * Remove dimmer * Add back missing handler for `provision_logger` * add comments * Print error message for the failed ssh command * Make the error yellow --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> Co-authored-by: Tian Xia <cblmemo@gmail.com>
This PR is a major rewriting of the provisioning layer of Skypilot. Here are the benefits summarized:
Tested (run the relevant ones):
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh