Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Split out metadata for checkpointing #11533

Merged
merged 18 commits into from
Nov 3, 2020

Conversation

ijrsvt
Copy link
Contributor

@ijrsvt ijrsvt commented Oct 21, 2020

Why are these changes needed?

This PR splits out the data stored in the Controller by whether it is more 'stored metadata' or 'lifetime management data'

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ijrsvt ijrsvt marked this pull request as ready for review October 22, 2020 07:12
Copy link
Contributor

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice progress

python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
Comment on lines 503 to 504
# Fetch actor handles for all of the backend replicas in the system.
# All of these workers are guaranteed to already exist because they
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to move this to actor_nursery like actor_nursery.restore_from_checkpoint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even a custom deserializer?

python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
Comment on lines 505 to 517
# would not be written to a checkpoint in self.actor_nursery.workers
# until they were created.
for backend_tag, replica_tags in self.actor_nursery.replicas.items():
for replica_tag in replica_tags:
replica_name = format_actor_name(replica_tag,
self.controller_name)
self.workers[backend_tag][replica_tag] = ray.get_actor(
replica_name)
self.actor_nursery.workers[backend_tag][
replica_tag] = ray.get_actor(replica_name)

# Push configuration state to the router.
# TODO(edoakes): should we make this a pull-only model for simplicity?
for endpoint, traffic_policy in self.traffic_policies.items():
for endpoint, traffic_policy in self.configuration_store.\
traffic_policies.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checkpoint recovery logic between the configuration store and the actor nursery seems pretty tightly coupled. I would much prefer if we weren't directly accessing the fields of the actor nursery here (makes it a leaky abstraction). Would it be possible to just do something like load the config from the checkpoint and pass it into the actor nursery and have it reconcile the state & initialize?

Copy link
Contributor

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM few small style comments

python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved

@dataclass
class Checkpoint:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space

@simon-mo
Copy link
Contributor

simon-mo commented Nov 2, 2020

For the follow up PR, I think there few things to verify/update:

  • Do we need the writer_lock? what's the implication of having it/removing it?
  • Can we move shutdown (and more method) to actor_state_reconciler as well?
  • Merge the upcoming long pulling with the two state storage.
  • Can we simplify the _start/stop_pending_replicas logic?

self.backends_to_remove.clear()

async def _start_backend_worker(
self, controller: "ServeController", backend_tag: BackendTag,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass in configuration store

Comment on lines 432 to 435
controller.autoscaling_policies[
backend] = BasicAutoscalingPolicy(
backend, metadata.autoscaling_config)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return dict of autoscaling policies!

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after changes we discussed to remove cyclic dependency between the reconciler and controller

@simon-mo simon-mo merged commit c3074f5 into ray-project:master Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants