Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SkyServe] deprecate old serve autoscaler policy #2868

Merged
merged 10 commits into from
Dec 16, 2023
Merged

Conversation

MaoZiming
Copy link
Collaborator

@MaoZiming MaoZiming commented Dec 14, 2023

After discussing with @cblmemo
This PR replaces the old Sky Serve autoscaler policy with the new counter-based version (as in the spot policy)

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual Test: sky serve up tests/test_yamls/test_serve_autoscaler.yaml followed by python3 tests/test_serve_autoscaler.py
  • All smoke tests: pytest tests/test_smoke.py

@MaoZiming MaoZiming changed the title [Serve] deprecate old serve autoscaler policy [SkyServe] deprecate old serve autoscaler policy Dec 14, 2023
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Maybe we could migrate more implementation into our on-demand autoscaler, such as:

  • Change the AutoscalerDecision to the latest version in the spot policy;
  • Change the type of return value for evaluate_scaling (i.e., List[AutoscalerDecision]);
  • Using the same interface (target_qps instead of {upper,lower}_threshold) as spot policy;
    • Probably also migrate the _get_desired_num_replicas function.

The latest bullet point needs more discussion; cc @Michaelvll @concretevitamin for a look.

sky/serve/autoscalers.py Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
@MaoZiming
Copy link
Collaborator Author

@cblmemo we probably want to think about how much to migrate from the spot policy PR. The current PR is kinda a minimal version of adding counter-based policy

@cblmemo
Copy link
Collaborator

cblmemo commented Dec 14, 2023

Seems like AUTOSCALER_COOLDOWN_SECONDS is deprecated too

@cblmemo
Copy link
Collaborator

cblmemo commented Dec 14, 2023

@cblmemo we probably want to think about how much to migrate from the spot policy PR. The current PR is kinda a minimal version of adding counter-based policy

Yeah, we should discuss more. Agreed that we should migrate the counter-based autoscaling policy to master ASAP

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are doing minimal migration then this version of the code looks great to me 🫡

sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Show resolved Hide resolved
@MaoZiming
Copy link
Collaborator Author

AUTOSCALER_COOLDOWN_SECONDS

https://github.com/skypilot-org/skypilot/blob/serve-spot/sky/serve/constants.py#L41
it is still hard-coded

@cblmemo
Copy link
Collaborator

cblmemo commented Dec 14, 2023

AUTOSCALER_COOLDOWN_SECONDS

https://github.com/skypilot-org/skypilot/blob/serve-spot/sky/serve/constants.py#L41 it is still hard-coded

I mean all variable/constants related to cooldown should be able to remove, including this constant and self.cooldown for RequestRateAutoscaler

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed one thing. We should not let the user wait for 5 minutes for the bootstrap case and we need some special judgment for this case.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great to me! Left some nits, and please run some manual tests and include them in the PR description.

sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a constant to delete IIRC, other than that it looks great to me! After some tests this PR should be ready to go 🫡

@MaoZiming
Copy link
Collaborator Author

@cblmemo added a manual test. commented out the constant

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for not making it clear.. I mean test to make sure it doesn't contain any unexpected errors. Probably we could eliminate it in the PR since it is very similar to our http server, and the interface will change soon (to target qps per replica).

examples/serve/autoscaler.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are some manual tests for the autoscaling behavior.
Are there other scripts where we can test the skyserve autoscaler?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually just use watch -n curl http://...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the yamls that are mainly for testing, we can probably have them in ./tests/test_yamls?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to test

@MaoZiming MaoZiming merged commit b1b606d into master Dec 16, 2023
19 checks passed
@MaoZiming MaoZiming deleted the master-autoscaler branch December 16, 2023 16:24
remyleone pushed a commit to remyleone/skypilot that referenced this pull request Dec 26, 2023
* deprecate old autoscaler policy

* remove redundant variable

* remove redundant line

* log upscale counter and downscale coutner

* fast autoscaling when num_replicas below min or above max

* address tian comments

* remove constant

* added test

* update script

* moved to tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants