Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support Multiple Resources #2498

Merged
merged 128 commits into from
Nov 12, 2023
Merged

[Core] Support Multiple Resources #2498

merged 128 commits into from
Nov 12, 2023

Conversation

MaoZiming
Copy link
Collaborator

@MaoZiming MaoZiming commented Aug 31, 2023

This partially builds upon an old PR #1389.
The user can specify an ordered preference list or an unordered restriction set for accelerators.
Ordered preference list. The user prefers A100-40GB:1 over V100:1, etc. The optimizer will follow user-specified order in picking the accelerators.

accelerators: ['A100-40GB:1', 'V100:1', 'K80:1', 'T4:1']

Unordered restriction set. The user specifies multiple accelerators for accelerator to choose from, with no preference among them.

accelerators: {'A100-40GB:1', 'K80:1', 'V100:1', 'T4:1', 'T4:4'}

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • New manual tests: sky launch examples/multi_resources.yaml
  • New manual tests: sky launch examples/multi_accelerators.yaml
  • New smoke test: pytest tests/test_smoke.py::test_multiple_resources_unordered
  • New smoke test: pytest tests/test_smoke.py::test_multiple_resources_ordered
  • All smoke tests: pytest tests/test_smoke.py

@MaoZiming
Copy link
Collaborator Author

MaoZiming commented Aug 31, 2023

@infwinston We might need to think about the UI.
The optimizer will only print one chosen_resource per cloud. If the user specifies multiple accelerators, only one of them (picked by the optimizer) will be printed.

@concretevitamin
Copy link
Collaborator

This is awesome @MaoZiming. Tried it out real quick, some UX notes:

» sky launch --use-spot --down t-multi.yaml
Task from YAML spec: t-multi.yaml
Task(run=<empty>)
  resources: {<Cloud>([Spot], {'A100-80GB': 1}), <Cloud>([Spot], {'A100-40GB': 1})}
I 08-31 10:42:49 optimizer.py:129] Using user-specified resource list.
I 08-31 10:42:49 optimizer.py:1135] No resource satisfying <Cloud>([Spot], {'A100-40GB': 1}) on [AWS, Azure, GCP, Lambda, OCI, SCP].
I 08-31 10:42:49 optimizer.py:1135] No resource satisfying <Cloud>([Spot], {'A100-40GB': 1}) on [AWS, Azure, GCP, Lambda, OCI, SCP].
I 08-31 10:42:49 optimizer.py:703] == Optimizer ==
I 08-31 10:42:49 optimizer.py:715] Target: minimizing cost
I 08-31 10:42:49 optimizer.py:726] Estimated cost: $1.9 / hour
I 08-31 10:42:49 optimizer.py:726]
I 08-31 10:42:49 optimizer.py:800] Considered resources (1 node):
I 08-31 10:42:49 optimizer.py:855] ------------------------------------------------------------------------------------------------------
I 08-31 10:42:49 optimizer.py:855]  CLOUD   INSTANCE               vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE      COST ($)   CHOSEN
I 08-31 10:42:49 optimizer.py:855] ------------------------------------------------------------------------------------------------------
I 08-31 10:42:49 optimizer.py:855]  GCP     a2-ultragpu-1g[Spot]   12      170       A100-80GB:1    europe-west4-a   1.86          ✔
I 08-31 10:42:49 optimizer.py:855] ------------------------------------------------------------------------------------------------------
I 08-31 10:42:49 optimizer.py:855]
Launching a new cluster 'sky-957c-zongheng'. Proceed? [Y/n]:

I intentionally used an invalid GPU name and it worked out well by skipping it!

resources:
  accelerators: ['A100-40GB:1', 'A100-80GB']
  1. This set-based repr is surprising, because previously we used a set to mean "pick the best of these candidates" (see example_app.py), rather than "try these sequentially":
Task(run=<empty>)
  resources: {<Cloud>([Spot], {'A100-80GB': 1}), <Cloud>([Spot], {'A100-40GB': 1})}
  1. We may want to change this
I 08-31 10:42:49 optimizer.py:129] Using user-specified resource list. 

to bold/colored,

I 08-31 10:42:49 optimizer.py:129] Using user-specified accelerators list (will be tried in the listed order).
  1. A bit surprising to see No resource satisfying <Cloud>([Spot], {'A100-40GB': 1}) on [AWS, Azure, GCP, Lambda, OCI, SCP]. being printed twice?

@MaoZiming
Copy link
Collaborator Author

@concretevitamin Thanks for the suggestions!
I have fixed 2 and 3. For 1, we are still deciding on the interface. We initially wanted to use
['A100-40GB:1', 'A100-80GB'] to mean a list of accelerators that will be considered sequentially.

@MaoZiming
Copy link
Collaborator Author

@infwinston hey this should be working with the spot controller. When you get some time maybe could you review the code? Thanks!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @MaoZiming! The code looks mostly good to me! Once we have done the test for the smoke test and some manual tests. I think it should be good to go in.

sky/optimizer.py Outdated Show resolved Hide resolved
sky/optimizer.py Show resolved Hide resolved
sky/task.py Show resolved Hide resolved
sky/task.py Outdated Show resolved Hide resolved
sky/task.py Outdated Show resolved Hide resolved
sky/utils/dag_utils.py Outdated Show resolved Hide resolved
Comment on lines +53 to +57
}, {
'type': 'array',
'items': {
'type': 'string',
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in that case, which of the schemas here represents the {V100:1, A100:1}?

sky/utils/schemas.py Outdated Show resolved Hide resolved
@MaoZiming MaoZiming changed the title [Core] Support Multiple Resources and Regions [Core] Support Multiple Resources Nov 10, 2023
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @MaoZiming! I tried the following, but an error occurred:

resources:
  any_of:
    - cloud: aws
      region: us-east-1
      accelerators: A100:8
    - cloud: gcp
      accelerators: T4:4
    - cloud: aws

run:
  echo hi

sky launch -c test test.yaml twice:

sky/backends/cloud_vm_ray_backend.py", line 4347, in _check_existing_cluster
    assert len(task.resources) == 1
AssertionError

sky/optimizer.py Show resolved Hide resolved
sky/utils/schemas.py Outdated Show resolved Hide resolved
sky/optimizer.py Outdated Show resolved Hide resolved
@MaoZiming
Copy link
Collaborator Author

@Michaelvll Thanks! pushed a fix

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great effort @MaoZiming! The LGTM now. Once it has passed the smoke test and any additional manual tests, it should be good to go in. : )

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/optimizer.py Outdated Show resolved Hide resolved
tests/test_multi_regions.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
@MaoZiming
Copy link
Collaborator Author

Thanks for the review! Addressed the comments.
The smoke tests should be good (except the ones I mentioned in the issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resources: make clouds take a list of clouds Support for multiple regions or zones
5 participants