Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] ensure replica reconfigure runs after allocation check #24052

Merged

Conversation

iasoon
Copy link
Member

@iasoon iasoon commented Apr 20, 2022

Why are these changes needed?

Since remote calls provide no ordering guarantees, it could happen that reconfigure gets called before is_allocated Since reconfigure then runs the user initialization code, the replica actor could get blocked and never provide its allocation check.
This PR ensures that the allocation proof has been received before we run the replica initialization.

I believe this approach is better than handling the ordering in the calling code (in deployment_state), since that would require the controller to poll the replica in order to make progress on the initialization.

Would we want a test for this? What would that look like?

Related issue number

#24044

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@iasoon iasoon requested a review from edoakes April 20, 2022 20:55
@iasoon iasoon changed the title [Serve] ensure reconfigure runs after allocation check [Serve] ensure replica reconfigure runs after allocation check Apr 20, 2022
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever & simple solution! Disappointed that I didn't think of this...

Not sure how to write a test for this, but given the simplicity I don't think it's a hard requirement. Do you have any ideas for how it could be tested?

Comment on lines 340 to 341
# ensure that `reconfigure` will only be called after a response
# has been received from `is_allocated`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ensure that `reconfigure` will only be called after a response
# has been received from `is_allocated`.
# Ensure that `reconfigure` will only be called after a response
# has been received from `is_allocated`.

nit: capitalization

please also add why we care about the ordering here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment on that, but I find it a bit difficult to concisely explain. Feel free to nitpick it if you can think of something clearer!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great 👍

python/ray/serve/replica.py Outdated Show resolved Hide resolved
@edoakes edoakes self-assigned this Apr 20, 2022
@edoakes
Copy link
Contributor

edoakes commented Apr 21, 2022

@iasoon looks like there's a failure in the cross-language/java support:
https://buildkite.com/ray-project/ray-builders-pr/builds/29930#ce581916-8dda-4c06-a300-c52202e270a0/242-3751

@simon-mo could you provide some guidance on what to update to fix this?

@iasoon
Copy link
Member Author

iasoon commented Apr 21, 2022

I think the failure is because I used a keyword argument for the reconfigure call, I think it's fixed now

@edoakes edoakes merged commit c9f0e48 into ray-project:master Apr 21, 2022
@jjyao jjyao mentioned this pull request Jun 8, 2023
8 tasks
jjyao added a commit that referenced this pull request Jun 13, 2023
Disallow STARTING -> UPDATING transition. By only updating running replicas, we don't need to worry about the ordering issue between initialize_and_get_metadata and reconfigure as show in #24052 and will make the code easier to reason about.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Disallow STARTING -> UPDATING transition. By only updating running replicas, we don't need to worry about the ordering issue between initialize_and_get_metadata and reconfigure as show in ray-project#24052 and will make the code easier to reason about.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants