-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: add a hint to customizing spot controller. #2753
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left one suggestion 🫡
disk_size: 100 | ||
|
||
The :code:`resources` field has the same spec as a normal SkyPilot job; see `here <https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html>`__. | ||
|
||
.. note:: | ||
These settings will not take effect if you have an existing controller (either |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a note to say that changing the config but not terminate the spot controller could lead to a failure?
Related: #2703 (for context of this PR, I'm waiting for Zhanghao's opinion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should add such a warning as part of #2703, since right now it doesn't error out. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it still errors out due to sky.exceptions.ResourcesMismatchError
, though arguably the error message is informative, so it should be fine to leave to #2703 ; )
sky spot launch whoami
Task from command: whoami
Managed spot job 'sky-cmd' will be launched on (estimated):
I 11-05 10:56:23 optimizer.py:674] == Optimizer ==
I 11-05 10:56:23 optimizer.py:686] Target: minimizing cost
I 11-05 10:56:23 optimizer.py:697] Estimated cost: $0.0 / hour
I 11-05 10:56:23 optimizer.py:697]
I 11-05 10:56:23 optimizer.py:770] Considered resources (1 node):
I 11-05 10:56:23 optimizer.py:818] ----------------------------------------------------------------------------------------------------------------
I 11-05 10:56:23 optimizer.py:818] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 11-05 10:56:23 optimizer.py:818] ----------------------------------------------------------------------------------------------------------------
I 11-05 10:56:23 optimizer.py:818] GCP n2-standard-8[Spot] 8 32 - northamerica-northeast2-a 0.04 ✔
I 11-05 10:56:23 optimizer.py:818] AWS m6i.2xlarge[Spot] 8 32 - eu-north-1c 0.12
I 11-05 10:56:23 optimizer.py:818] ----------------------------------------------------------------------------------------------------------------
I 11-05 10:56:23 optimizer.py:818]
Launching the spot job 'sky-cmd'. Proceed? [Y/n]:
Launching managed spot job 'sky-cmd' from spot controller...
Launching spot controller...
sky.exceptions.ResourcesMismatchError: Requested resources do not match the existing cluster.
Requested: 1x AWS(cpus=4, disk_size=50)
Existing: 1x GCP(n2-standard-4, disk_size=50)
To fix: specify a new cluster name, or down the existing cluster first: sky down sky-spot-controller-402b1bba
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep let's leave it there. I think the situation that same cloud but smaller controller is more common.
3. Changing the disk_size of the spot controller to store more logs. (Default: 50GB) | ||
1. Use a lower-cost controller (if you have a low number of concurrent spot jobs). | ||
2. Enforcing the spot controller to run on a specific location. (Default: cheapest location) | ||
3. Changing the maximum number of spot jobs that can be run concurrently, which is 2x the vCPUs of the controller. (Default: 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A not related question: where is this logic implemented? I've briefly go through the source code but doesn't found any logic to pending the job if the number of jobs is too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nvm.. Just notice that it is because our default job takes 0.5
CPUs
I only have one service that doesn't need autoscaling, what would be the lowest controller config possible? I am not able to get anything lower than m6i.xlarge even though I changed the configuration and made sure no controller was live. Thanks |
Hi @cyril94440 ! Thanks for your interest in SkyServe. Could you try the following config? serve:
controller:
resources:
cpus: 2 |
I tried that without any success, still an m6i.large unfortunately |
Hey @cyril94440, you may have to |
Thank you very much. It is working with the "serve:" directive. Is this config the lowest possible? Or can we use a t3 or t4g instance as a controller?. Thank you |
We would recommend to have at least 2GB memory with intel or amd cpus for a controller that serves only a single service, so t3.small might be good. |
This has come up quite a few times so adding a hint.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh