Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Proxy can mark itself healthy before receiving routes from the controller #43076

Closed
edoakes opened this issue Feb 9, 2024 · 0 comments · Fixed by #43086
Closed

[serve] Proxy can mark itself healthy before receiving routes from the controller #43076

edoakes opened this issue Feb 9, 2024 · 0 comments · Fixed by #43086
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks ray 2.10 serve Ray Serve Related Issue

Comments

@edoakes
Copy link
Contributor

edoakes commented Feb 9, 2024

From a slack user:

I observed some interesting behavior today, and wanted to see if my assessment is plausible
Observations / order of events:
T0: Head node alive
T1: Replica is assigned to new worker node
T2: Head node goes down
T3: Health checks start passing on the new worker node
T4: Start getting requests to the new worker node, they all fail with 404 errors
T5: Head node eventually gets restarted
T6: Requests start succeeding to that node, and we also see the HTTPProxy CPU usage metric reported for that node for the first time
Ray version: 2.9.1
Based on my high-level reading of the code:
It seems that the routes for the HTTPProxy are fetched via long polling to the head node, and that metrics such as the HTTPProxy are also sent to the head node to be reported
What I'm wondering/speculating: is it possible for the proxy to get initialized and respond positively to health checks, even when it hasn't yet successfully fetched the routes from the head node / serve controller?
ie. if the head node goes down mid initialization, can we end up in a bad state where a worker node has an HTTPProxy that is marked as healthy, but doesn't know about any routes until the head node is alive again

@edoakes edoakes added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue ray 2.10 labels Feb 9, 2024
@edoakes edoakes self-assigned this Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks ray 2.10 serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant