-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support Multiple Resources #2498
Conversation
@infwinston We might need to think about the UI. |
This is awesome @MaoZiming. Tried it out real quick, some UX notes: » sky launch --use-spot --down t-multi.yaml
Task from YAML spec: t-multi.yaml
Task(run=<empty>)
resources: {<Cloud>([Spot], {'A100-80GB': 1}), <Cloud>([Spot], {'A100-40GB': 1})}
I 08-31 10:42:49 optimizer.py:129] Using user-specified resource list.
I 08-31 10:42:49 optimizer.py:1135] No resource satisfying <Cloud>([Spot], {'A100-40GB': 1}) on [AWS, Azure, GCP, Lambda, OCI, SCP].
I 08-31 10:42:49 optimizer.py:1135] No resource satisfying <Cloud>([Spot], {'A100-40GB': 1}) on [AWS, Azure, GCP, Lambda, OCI, SCP].
I 08-31 10:42:49 optimizer.py:703] == Optimizer ==
I 08-31 10:42:49 optimizer.py:715] Target: minimizing cost
I 08-31 10:42:49 optimizer.py:726] Estimated cost: $1.9 / hour
I 08-31 10:42:49 optimizer.py:726]
I 08-31 10:42:49 optimizer.py:800] Considered resources (1 node):
I 08-31 10:42:49 optimizer.py:855] ------------------------------------------------------------------------------------------------------
I 08-31 10:42:49 optimizer.py:855] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 08-31 10:42:49 optimizer.py:855] ------------------------------------------------------------------------------------------------------
I 08-31 10:42:49 optimizer.py:855] GCP a2-ultragpu-1g[Spot] 12 170 A100-80GB:1 europe-west4-a 1.86 ✔
I 08-31 10:42:49 optimizer.py:855] ------------------------------------------------------------------------------------------------------
I 08-31 10:42:49 optimizer.py:855]
Launching a new cluster 'sky-957c-zongheng'. Proceed? [Y/n]: I intentionally used an invalid GPU name and it worked out well by skipping it! resources:
accelerators: ['A100-40GB:1', 'A100-80GB']
to bold/colored,
|
@concretevitamin Thanks for the suggestions! |
@infwinston hey this should be working with the spot controller. When you get some time maybe could you review the code? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @MaoZiming! The code looks mostly good to me! Once we have done the test for the smoke test and some manual tests. I think it should be good to go in.
}, { | ||
'type': 'array', | ||
'items': { | ||
'type': 'string', | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, in that case, which of the schemas here represents the {V100:1, A100:1}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @MaoZiming! I tried the following, but an error occurred:
resources:
any_of:
- cloud: aws
region: us-east-1
accelerators: A100:8
- cloud: gcp
accelerators: T4:4
- cloud: aws
run:
echo hi
sky launch -c test test.yaml
twice:
sky/backends/cloud_vm_ray_backend.py", line 4347, in _check_existing_cluster
assert len(task.resources) == 1
AssertionError
d56cb9a
to
7a8bee4
Compare
@Michaelvll Thanks! pushed a fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great effort @MaoZiming! The LGTM now. Once it has passed the smoke test and any additional manual tests, it should be good to go in. : )
Thanks for the review! Addressed the comments. |
…o multi-acc-rebase
This partially builds upon an old PR #1389.
The user can specify an ordered preference list or an unordered restriction set for accelerators.
Ordered preference list. The user prefers
A100-40GB:1
overV100:1
, etc. The optimizer will follow user-specified order in picking the accelerators.Unordered restriction set. The user specifies multiple accelerators for accelerator to choose from, with no preference among them.
Tested (run the relevant ones):
bash format.sh
sky launch examples/multi_resources.yaml
sky launch examples/multi_accelerators.yaml
pytest tests/test_smoke.py::test_multiple_resources_unordered
pytest tests/test_smoke.py::test_multiple_resources_ordered
pytest tests/test_smoke.py