Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Scale down old terminal replicas #3550

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented May 14, 2024

Previous implementation will keep old terminated (i.e. failed) replicas forever. This PR changes the behaviour to scale down old terminal replicas as well when an update is initiated.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
$ sky serve up tests/skyserve/failures/initial_delay.yaml -n fail-2
$ sky serve status
Services
NAME    VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT                                                                  
fail-2  -        -       NO_REPLICA  0/0       http://localhost:30072/skypilot/sky-serve-controller-402b1bba-402b/30001  

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT  LAUNCHED  RESOURCES  STATUS                REGION  
fail-2        1   1        -         -         -          FAILED_INITIAL_DELAY  -       
$ sky serve update fail-2 tests/skyserve/failures/initial_delay.yaml
$ sky serve status
Services
NAME    VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT                                                                  
fail-2  -        -       NO_REPLICA  0/2       http://localhost:30072/skypilot/sky-serve-controller-402b1bba-402b/30001  

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT  LAUNCHED     RESOURCES              STATUS         REGION      
fail-2        1   1        -         -            -                      SHUTTING_DOWN  -           
fail-2        2   2        -         30 secs ago  1x Kubernetes(vCPU=2)  STARTING       Kubernetes 
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting the PR @cblmemo! Left several questions.

Comment on lines +321 to +329
all_replica_ids_to_scale_down.extend(
self._select_outdated_nonterminal_replicas_to_scale_down(
replica_infos))
return all_replica_ids_to_scale_down

def _select_outdated_nonterminal_replicas_to_scale_down(
self,
replica_infos: List['replica_managers.ReplicaInfo']) -> List[int]:
"""Select outdated nonterminal replicas to scale down."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactoring seems not necessary.

Suggested change
all_replica_ids_to_scale_down.extend(
self._select_outdated_nonterminal_replicas_to_scale_down(
replica_infos))
return all_replica_ids_to_scale_down
def _select_outdated_nonterminal_replicas_to_scale_down(
self,
replica_infos: List['replica_managers.ReplicaInfo']) -> List[int]:
"""Select outdated nonterminal replicas to scale down."""

Comment on lines +319 to +320
if info.version < self.latest_version and info.is_terminal:
all_replica_ids_to_scale_down.append(info.replica_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the original behavior will terminate the failed replicas already. Is this only for removing the FAILED replica entries in the serve status table? In that case, should we make the sky serve down command able to remove a FAILED entry in the table instead?

I feel keeping the FAILED replicas for the previous version might be still useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is only for removing those failed records from a user's request. Agree that adding the ability to remove a record in sky serve down is a better idea. It is implemented in #3179 and I'll take a look soon 🫡

Copy link

This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Sep 15, 2024
@cblmemo cblmemo removed the Stale label Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants