-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Split out metadata for checkpointing #11533
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice progress
python/ray/serve/controller.py
Outdated
# Fetch actor handles for all of the backend replicas in the system. | ||
# All of these workers are guaranteed to already exist because they |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to move this to actor_nursery like actor_nursery.restore_from_checkpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even a custom deserializer?
python/ray/serve/controller.py
Outdated
# would not be written to a checkpoint in self.actor_nursery.workers | ||
# until they were created. | ||
for backend_tag, replica_tags in self.actor_nursery.replicas.items(): | ||
for replica_tag in replica_tags: | ||
replica_name = format_actor_name(replica_tag, | ||
self.controller_name) | ||
self.workers[backend_tag][replica_tag] = ray.get_actor( | ||
replica_name) | ||
self.actor_nursery.workers[backend_tag][ | ||
replica_tag] = ray.get_actor(replica_name) | ||
|
||
# Push configuration state to the router. | ||
# TODO(edoakes): should we make this a pull-only model for simplicity? | ||
for endpoint, traffic_policy in self.traffic_policies.items(): | ||
for endpoint, traffic_policy in self.configuration_store.\ | ||
traffic_policies.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checkpoint recovery logic between the configuration store and the actor nursery seems pretty tightly coupled. I would much prefer if we weren't directly accessing the fields of the actor nursery here (makes it a leaky abstraction). Would it be possible to just do something like load the config from the checkpoint and pass it into the actor nursery and have it reconcile the state & initialize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM few small style comments
python/ray/serve/controller.py
Outdated
|
||
@dataclass | ||
class Checkpoint: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space
For the follow up PR, I think there few things to verify/update:
|
Co-authored-by: Simon Mo <simon.mo@hey.com>
python/ray/serve/controller.py
Outdated
self.backends_to_remove.clear() | ||
|
||
async def _start_backend_worker( | ||
self, controller: "ServeController", backend_tag: BackendTag, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass in configuration store
python/ray/serve/controller.py
Outdated
controller.autoscaling_policies[ | ||
backend] = BasicAutoscalingPolicy( | ||
backend, metadata.autoscaling_config) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return dict of autoscaling policies!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after changes we discussed to remove cyclic dependency between the reconciler and controller
Why are these changes needed?
This PR splits out the data stored in the
Controller
by whether it is more 'stored metadata' or 'lifetime management data'Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.