You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I observed some interesting behavior today, and wanted to see if my assessment is plausible
Observations / order of events:
T0: Head node alive
T1: Replica is assigned to new worker node
T2: Head node goes down
T3: Health checks start passing on the new worker node
T4: Start getting requests to the new worker node, they all fail with 404 errors
T5: Head node eventually gets restarted
T6: Requests start succeeding to that node, and we also see the HTTPProxy CPU usage metric reported for that node for the first time
Ray version: 2.9.1
Based on my high-level reading of the code:
It seems that the routes for the HTTPProxy are fetched via long polling to the head node, and that metrics such as the HTTPProxy are also sent to the head node to be reported
What I'm wondering/speculating: is it possible for the proxy to get initialized and respond positively to health checks, even when it hasn't yet successfully fetched the routes from the head node / serve controller?
ie. if the head node goes down mid initialization, can we end up in a bad state where a worker node has an HTTPProxy that is marked as healthy, but doesn't know about any routes until the head node is alive again
The text was updated successfully, but these errors were encountered:
edoakes
added
bug
Something that is supposed to be working; but isn't
P1
Issue that should be fixed within a few weeks
serve
Ray Serve Related Issue
ray 2.10
labels
Feb 9, 2024
From a slack user:
The text was updated successfully, but these errors were encountered: