Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Support manually terminating a replica #3179

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

dtran24
Copy link
Contributor

@dtran24 dtran24 commented Feb 18, 2024

Fixes #3135

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Terminated replicas with YAML specifying replicas: 2
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@dtran24 dtran24 changed the title [Serve] Manually terminate a replica [Serve] Support manually terminating a replica Feb 18, 2024
@dtran24 dtran24 marked this pull request as ready for review February 18, 2024 21:36
@dtran24
Copy link
Contributor Author

dtran24 commented Feb 18, 2024

Requesting review @cblmemo @MaoZiming

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this exciting feature!! This would be very helpful. Left some comments

sky/cli.py Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
sky/serve/controller.py Outdated Show resolved Hide resolved
sky/serve/core.py Outdated Show resolved Hide resolved
sky/serve/serve_utils.py Outdated Show resolved Hide resolved
sky/serve/serve_utils.py Outdated Show resolved Hide resolved
sky/serve/serve_utils.py Outdated Show resolved Hide resolved
if replica_id is None:
return {'message': 'Error: Replica ID is not specified.'}
logger.info(f'Terminating replica {replica_id}')
self._replica_manager.scale_down(replica_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not enough to only call replica_manager.scale_down since one important usecase of this command is cleanup failed records, but this function will keep any failed record. Should we add a purge argument to force clean it from database?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is something like this change (b838b1f) what you had in mind? If so, what's the best way to test this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is, except for we need to terminate the cluster too when purging. IIUC current implementation will only remove the replica record. You can try to sky serve down --purge <service-name> -r <id> for : 1. a failed replica, 2. a ready replica and see if the record and cluster got removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should print some error message if we terminate a failed replica without --purge and prompt the user to use --purge to cleanup the record.

@dtran24 dtran24 requested a review from cblmemo February 20, 2024 05:21
sky/cli.py Show resolved Hide resolved
Comment on lines +108 to +114
'message':
f'{colorama.Fore.YELLOW}Purged replica {replica_id} '
f'with failed status ({replica_info.status}). This may'
f' indicate a resource leak. Please check the following'
f' SkyPilot cluster on the controller: '
f'{replica_info.cluster_name}{colorama.Style.RESET_ALL}'
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as said in the comment above, we should cleanup this cluster as well. how about adding an argument purge: bool in ReplicaManager.scale_down and remove the replica record here?

if info.status_property.is_scale_down_succeeded(
self._get_initial_delay_seconds(info.version)):
# This means the cluster is deleted due to
# a scale down or the cluster is recovering
# from preemption. Delete the replica info
# so it won't count as a replica.
if info.status_property.preempted:
removal_reason = 'for preemption recovery'
else:
removal_reason = 'normally'
# Don't keep failed record for version mismatch replicas,
# since user should fixed the error before update.
elif info.version != self.latest_version:
removal_reason = 'for version outdated'
else:
logger.info(f'Termination of replica {replica_id} '
'finished. Replica info is kept since some '
'failure detected.')
serve_state.add_or_update_replica(self._service_name,
replica_id, info)
if removal_reason is not None:
serve_state.remove_replica(self._service_name, replica_id)
logger.info(f'Replica {replica_id} removed from the '
f'replica table {removal_reason}.')

sky/serve/controller.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll mentioned this pull request Mar 9, 2024
6 tasks
@Michaelvll Michaelvll added the P0 label Mar 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Serve] Support for terminating one of the replica manually
3 participants