[Serve] Support manually terminating a replica #3179

dtran24 · 2024-02-18T04:28:44Z

Fixes #3135

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- Terminated replicas with YAML specifying replicas: 2
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

dtran24 · 2024-02-18T21:36:53Z

Requesting review @cblmemo @MaoZiming

cblmemo

Thanks for adding this exciting feature!! This would be very helpful. Left some comments

sky/cli.py

sky/serve/controller.py

sky/serve/core.py

sky/serve/serve_utils.py

cblmemo · 2024-02-19T07:44:35Z

sky/serve/controller.py

+                if replica_id is None:
+                    return {'message': 'Error: Replica ID is not specified.'}
+                logger.info(f'Terminating replica {replica_id}')
+                self._replica_manager.scale_down(replica_id)


It is not enough to only call replica_manager.scale_down since one important usecase of this command is cleanup failed records, but this function will keep any failed record. Should we add a purge argument to force clean it from database?

Is something like this change (b838b1f) what you had in mind? If so, what's the best way to test this?

Yes it is, except for we need to terminate the cluster too when purging. IIUC current implementation will only remove the replica record. You can try to sky serve down --purge <service-name> -r <id> for : 1. a failed replica, 2. a ready replica and see if the record and cluster got removed.

Also, we should print some error message if we terminate a failed replica without --purge and prompt the user to use --purge to cleanup the record.

sky/cli.py

cblmemo · 2024-02-21T01:30:38Z

sky/serve/controller.py

+                    'message':
+                        f'{colorama.Fore.YELLOW}Purged replica {replica_id} '
+                        f'with failed status ({replica_info.status}). This may'
+                        f' indicate a resource leak. Please check the following'
+                        f' SkyPilot cluster on the controller: '
+                        f'{replica_info.cluster_name}{colorama.Style.RESET_ALL}'
+                }


as said in the comment above, we should cleanup this cluster as well. how about adding an argument purge: bool in ReplicaManager.scale_down and remove the replica record here?

skypilot/sky/serve/replica_managers.py

Lines 752 to 775 in e51c2fb

if info.status_property.is_scale_down_succeeded(

self._get_initial_delay_seconds(info.version)):

# This means the cluster is deleted due to

# a scale down or the cluster is recovering

# from preemption. Delete the replica info

# so it won't count as a replica.

if info.status_property.preempted:

removal_reason = 'for preemption recovery'

else:

removal_reason = 'normally'

# Don't keep failed record for version mismatch replicas,

# since user should fixed the error before update.

elif info.version != self.latest_version:

removal_reason = 'for version outdated'

else:

logger.info(f'Termination of replica {replica_id} '

'finished. Replica info is kept since some '

'failure detected.')

serve_state.add_or_update_replica(self._service_name,

replica_id, info)

if removal_reason is not None:

serve_state.remove_replica(self._service_name, replica_id)

logger.info(f'Replica {replica_id} removed from the '

f'replica table {removal_reason}.')

sky/serve/controller.py

define replica id param in cli

0819a2c

dtran24 changed the title ~~[Serve] Manually terminate a replica~~ [Serve] Support manually terminating a replica Feb 18, 2024

David Tran and others added 5 commits February 18, 2024 00:49

create endpoint on controller

a66a28b

call controller endpoint to scale down replica

c74765d

Merge branch 'master' into serve/manually-terminate-replica

cfbe884

add classmethod decorator

fc516fb

add handler methods for readability in cli

2786c25

dtran24 marked this pull request as ready for review February 18, 2024 21:36

cblmemo reviewed Feb 19, 2024

View reviewed changes

David Tran added 6 commits February 19, 2024 17:51

update docstr and error msg, and inline in cli

1799c2f

update log and return err msg

41e6389

add docstr, catch and reraise err, add stopped and nonexistent message

82626ee

inline constant to avoid circular import

27eff42

fix error statement and return encoded str

53c0f32

add purge feature

b838b1f

dtran24 requested a review from cblmemo February 20, 2024 05:21

cblmemo reviewed Feb 21, 2024

View reviewed changes

David Tran added 2 commits February 21, 2024 22:45

add purge replica usage in docstr

10d43af

use .get to handle unexpected packages

f4acaa7

Michaelvll mentioned this pull request Mar 9, 2024

[Serve] Rolling update #3249

Merged

6 tasks

Michaelvll added the P0 label Mar 10, 2024

cblmemo mentioned this pull request May 17, 2024

[Serve] Scale down old terminal replicas #3550

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Support manually terminating a replica #3179

[Serve] Support manually terminating a replica #3179

dtran24 commented Feb 18, 2024 •

edited

Loading

dtran24 commented Feb 18, 2024

cblmemo left a comment

cblmemo Feb 19, 2024

dtran24 Feb 20, 2024

cblmemo Feb 21, 2024

cblmemo Feb 21, 2024

cblmemo Feb 21, 2024

	if info.status_property.is_scale_down_succeeded(
	self._get_initial_delay_seconds(info.version)):
	# This means the cluster is deleted due to
	# a scale down or the cluster is recovering
	# from preemption. Delete the replica info
	# so it won't count as a replica.
	if info.status_property.preempted:
	removal_reason = 'for preemption recovery'
	else:
	removal_reason = 'normally'
	# Don't keep failed record for version mismatch replicas,
	# since user should fixed the error before update.
	elif info.version != self.latest_version:
	removal_reason = 'for version outdated'
	else:
	logger.info(f'Termination of replica {replica_id} '
	'finished. Replica info is kept since some '
	'failure detected.')
	serve_state.add_or_update_replica(self._service_name,
	replica_id, info)
	if removal_reason is not None:
	serve_state.remove_replica(self._service_name, replica_id)
	logger.info(f'Replica {replica_id} removed from the '
	f'replica table {removal_reason}.')

[Serve] Support manually terminating a replica #3179

Are you sure you want to change the base?

[Serve] Support manually terminating a replica #3179

Conversation

dtran24 commented Feb 18, 2024 • edited Loading

dtran24 commented Feb 18, 2024

cblmemo left a comment

Choose a reason for hiding this comment

cblmemo Feb 19, 2024

Choose a reason for hiding this comment

dtran24 Feb 20, 2024

Choose a reason for hiding this comment

cblmemo Feb 21, 2024

Choose a reason for hiding this comment

cblmemo Feb 21, 2024

Choose a reason for hiding this comment

cblmemo Feb 21, 2024

Choose a reason for hiding this comment

dtran24 commented Feb 18, 2024 •

edited

Loading