-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] ensure replica reconfigure runs after allocation check #24052
[Serve] ensure replica reconfigure runs after allocation check #24052
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever & simple solution! Disappointed that I didn't think of this...
Not sure how to write a test for this, but given the simplicity I don't think it's a hard requirement. Do you have any ideas for how it could be tested?
python/ray/serve/deployment_state.py
Outdated
# ensure that `reconfigure` will only be called after a response | ||
# has been received from `is_allocated`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# ensure that `reconfigure` will only be called after a response | |
# has been received from `is_allocated`. | |
# Ensure that `reconfigure` will only be called after a response | |
# has been received from `is_allocated`. |
nit: capitalization
please also add why we care about the ordering here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment on that, but I find it a bit difficult to concisely explain. Feel free to nitpick it if you can think of something clearer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great 👍
@iasoon looks like there's a failure in the cross-language/java support: @simon-mo could you provide some guidance on what to update to fix this? |
I think the failure is because I used a keyword argument for the reconfigure call, I think it's fixed now |
Disallow STARTING -> UPDATING transition. By only updating running replicas, we don't need to worry about the ordering issue between initialize_and_get_metadata and reconfigure as show in #24052 and will make the code easier to reason about. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Disallow STARTING -> UPDATING transition. By only updating running replicas, we don't need to worry about the ordering issue between initialize_and_get_metadata and reconfigure as show in ray-project#24052 and will make the code easier to reason about. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Why are these changes needed?
Since remote calls provide no ordering guarantees, it could happen that
reconfigure
gets called beforeis_allocated
Sincereconfigure
then runs the user initialization code, the replica actor could get blocked and never provide its allocation check.This PR ensures that the allocation proof has been received before we run the replica initialization.
I believe this approach is better than handling the ordering in the calling code (in deployment_state), since that would require the controller to poll the replica in order to make progress on the initialization.
Would we want a test for this? What would that look like?
Related issue number
#24044
Checks
scripts/format.sh
to lint the changes in this PR.