Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: add a hint to customizing spot controller. #2753

Merged
merged 1 commit into from
Nov 6, 2023
Merged

Conversation

concretevitamin
Copy link
Collaborator

This has come up quite a few times so adding a hint.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below): rendered locally
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left one suggestion 🫡

disk_size: 100

The :code:`resources` field has the same spec as a normal SkyPilot job; see `here <https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html>`__.

.. note::
These settings will not take effect if you have an existing controller (either
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a note to say that changing the config but not terminate the spot controller could lead to a failure?
Related: #2703 (for context of this PR, I'm waiting for Zhanghao's opinion)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should add such a warning as part of #2703, since right now it doesn't error out. Wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it still errors out due to sky.exceptions.ResourcesMismatchError, though arguably the error message is informative, so it should be fine to leave to #2703 ; )

sky spot launch whoami
Task from command: whoami
Managed spot job 'sky-cmd' will be launched on (estimated):
I 11-05 10:56:23 optimizer.py:674] == Optimizer ==
I 11-05 10:56:23 optimizer.py:686] Target: minimizing cost
I 11-05 10:56:23 optimizer.py:697] Estimated cost: $0.0 / hour
I 11-05 10:56:23 optimizer.py:697] 
I 11-05 10:56:23 optimizer.py:770] Considered resources (1 node):
I 11-05 10:56:23 optimizer.py:818] ----------------------------------------------------------------------------------------------------------------
I 11-05 10:56:23 optimizer.py:818]  CLOUD   INSTANCE              vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                 COST ($)   CHOSEN   
I 11-05 10:56:23 optimizer.py:818] ----------------------------------------------------------------------------------------------------------------
I 11-05 10:56:23 optimizer.py:818]  GCP     n2-standard-8[Spot]   8       32        -              northamerica-northeast2-a   0.04          ✔     
I 11-05 10:56:23 optimizer.py:818]  AWS     m6i.2xlarge[Spot]     8       32        -              eu-north-1c                 0.12                
I 11-05 10:56:23 optimizer.py:818] ----------------------------------------------------------------------------------------------------------------
I 11-05 10:56:23 optimizer.py:818] 
Launching the spot job 'sky-cmd'. Proceed? [Y/n]: 
Launching managed spot job 'sky-cmd' from spot controller...
Launching spot controller...

sky.exceptions.ResourcesMismatchError: Requested resources do not match the existing cluster.
  Requested:    1x AWS(cpus=4, disk_size=50) 
  Existing:     1x GCP(n2-standard-4, disk_size=50)
To fix: specify a new cluster name, or down the existing cluster first: sky down sky-spot-controller-402b1bba

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep let's leave it there. I think the situation that same cloud but smaller controller is more common.

3. Changing the disk_size of the spot controller to store more logs. (Default: 50GB)
1. Use a lower-cost controller (if you have a low number of concurrent spot jobs).
2. Enforcing the spot controller to run on a specific location. (Default: cheapest location)
3. Changing the maximum number of spot jobs that can be run concurrently, which is 2x the vCPUs of the controller. (Default: 16)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A not related question: where is this logic implemented? I've briefly go through the source code but doesn't found any logic to pending the job if the number of jobs is too much.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nvm.. Just notice that it is because our default job takes 0.5 CPUs

@concretevitamin concretevitamin merged commit d80c47b into master Nov 6, 2023
18 checks passed
@concretevitamin concretevitamin deleted the ctrl-docs branch November 6, 2023 02:53
@cyril94440
Copy link

I only have one service that doesn't need autoscaling, what would be the lowest controller config possible?

I am not able to get anything lower than m6i.xlarge even though I changed the configuration and made sure no controller was live.

Thanks

@cblmemo
Copy link
Collaborator

cblmemo commented Feb 22, 2024

I only have one service that doesn't need autoscaling, what would be the lowest controller config possible?

I am not able to get anything lower than m6i.xlarge even though I changed the configuration and made sure no controller was live.

Thanks

Hi @cyril94440 ! Thanks for your interest in SkyServe. Could you try the following config?

serve:
  controller:
    resources:
      cpus: 2

@cyril94440
Copy link

I only have one service that doesn't need autoscaling, what would be the lowest controller config possible?
I am not able to get anything lower than m6i.xlarge even though I changed the configuration and made sure no controller was live.
Thanks

Hi @cyril94440 ! Thanks for your interest in SkyServe. Could you try the following config?

serve:
  controller:
    resources:
      cpus: 2

I tried that without any success, still an m6i.large unfortunately

@Michaelvll
Copy link
Collaborator

Michaelvll commented Feb 23, 2024

I only have one service that doesn't need autoscaling, what would be the lowest controller config possible?
I am not able to get anything lower than m6i.xlarge even though I changed the configuration and made sure no controller was live.
Thanks

Hi @cyril94440 ! Thanks for your interest in SkyServe. Could you try the following config?

serve:
  controller:
    resources:
      cpus: 2

I tried that without any success, still an m6i.large unfortunately

Hey @cyril94440, you may have to sky down sky-serve-controller-<hash> manually, before the config will take effect : )

@cyril94440
Copy link

I only have one service that doesn't need autoscaling, what would be the lowest controller config possible?
I am not able to get anything lower than m6i.xlarge even though I changed the configuration and made sure no controller was live.
Thanks

Hi @cyril94440 ! Thanks for your interest in SkyServe. Could you try the following config?

serve:
  controller:
    resources:
      cpus: 2

I tried that without any success, still an m6i.large unfortunately

Hey @cyril94440, you may have to sky down sky-serve-controller-<hash> manually, before the config will take effect : )

Thank you very much. It is working with the "serve:" directive.

Is this config the lowest possible? Or can we use a t3 or t4g instance as a controller?.

Thank you

@Michaelvll
Copy link
Collaborator

I only have one service that doesn't need autoscaling, what would be the lowest controller config possible?
I am not able to get anything lower than m6i.xlarge even though I changed the configuration and made sure no controller was live.
Thanks

Hi @cyril94440 ! Thanks for your interest in SkyServe. Could you try the following config?

serve:
  controller:
    resources:
      cpus: 2

I tried that without any success, still an m6i.large unfortunately

Hey @cyril94440, you may have to sky down sky-serve-controller-<hash> manually, before the config will take effect : )

Thank you very much. It is working with the "serve:" directive.

Is this config the lowest possible? Or can we use a t3 or t4g instance as a controller?.

Thank you

We would recommend to have at least 2GB memory with intel or amd cpus for a controller that serves only a single service, so t3.small might be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants