-
Notifications
You must be signed in to change notification settings - Fork 425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Support manually terminating a replica #3179
base: master
Are you sure you want to change the base?
[Serve] Support manually terminating a replica #3179
Conversation
Requesting review @cblmemo @MaoZiming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this exciting feature!! This would be very helpful. Left some comments
sky/serve/controller.py
Outdated
if replica_id is None: | ||
return {'message': 'Error: Replica ID is not specified.'} | ||
logger.info(f'Terminating replica {replica_id}') | ||
self._replica_manager.scale_down(replica_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not enough to only call replica_manager.scale_down
since one important usecase of this command is cleanup failed records, but this function will keep any failed record. Should we add a purge
argument to force clean it from database?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is something like this change (b838b1f) what you had in mind? If so, what's the best way to test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is, except for we need to terminate the cluster too when purging. IIUC current implementation will only remove the replica record. You can try to sky serve down --purge <service-name> -r <id>
for : 1. a failed replica, 2. a ready replica and see if the record and cluster got removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should print some error message if we terminate a failed replica without --purge
and prompt the user to use --purge
to cleanup the record.
'message': | ||
f'{colorama.Fore.YELLOW}Purged replica {replica_id} ' | ||
f'with failed status ({replica_info.status}). This may' | ||
f' indicate a resource leak. Please check the following' | ||
f' SkyPilot cluster on the controller: ' | ||
f'{replica_info.cluster_name}{colorama.Style.RESET_ALL}' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as said in the comment above, we should cleanup this cluster as well. how about adding an argument purge: bool
in ReplicaManager.scale_down
and remove the replica record here?
skypilot/sky/serve/replica_managers.py
Lines 752 to 775 in e51c2fb
if info.status_property.is_scale_down_succeeded( | |
self._get_initial_delay_seconds(info.version)): | |
# This means the cluster is deleted due to | |
# a scale down or the cluster is recovering | |
# from preemption. Delete the replica info | |
# so it won't count as a replica. | |
if info.status_property.preempted: | |
removal_reason = 'for preemption recovery' | |
else: | |
removal_reason = 'normally' | |
# Don't keep failed record for version mismatch replicas, | |
# since user should fixed the error before update. | |
elif info.version != self.latest_version: | |
removal_reason = 'for version outdated' | |
else: | |
logger.info(f'Termination of replica {replica_id} ' | |
'finished. Replica info is kept since some ' | |
'failure detected.') | |
serve_state.add_or_update_replica(self._service_name, | |
replica_id, info) | |
if removal_reason is not None: | |
serve_state.remove_replica(self._service_name, replica_id) | |
logger.info(f'Replica {replica_id} removed from the ' | |
f'replica table {removal_reason}.') |
Fixes #3135
Tested (run the relevant ones):
bash format.sh
replicas: 2
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh