-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Make sure dashboard agent will exit if grpc server fails #44899
Conversation
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
|
||
if self.server: | ||
await self.server.wait_for_termination() | ||
tasks.append(self.server.wait_for_termination()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually test it with a broken grpcio version but I don't know how to write a test for it since I don't know how to let wait_for_termination()
throw an exception with a good grpcio version.
@@ -50,4 +50,4 @@ async def run(self, server): | |||
|
|||
@staticmethod | |||
def is_minimal_module(): | |||
return True | |||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a minimal module since it depends on aio http server. cc @rkooo567 can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think you are right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix! Please confirm if healthz agent is minimal or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a unit tests?
@@ -50,4 +50,4 @@ async def run(self, server): | |||
|
|||
@staticmethod | |||
def is_minimal_module(): | |||
return True | |||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think you are right
@rkooo567 I don't know how to trigger an exception from wait_for_termination() beside using a bad grpcio version. Do you have any ideas? |
maybe you can manually cause port conflict or sth? |
I think there's an option called agent grpc port or sth (can be found in agent.py) |
Port conflict will fail |
hmm I see. Yeah maybe manual checking is enough in this case. |
Why are these changes needed?
If grpc server fails,
wait_for_termination()
will raise an exception and dashboard agent should exit in this case. However, currently dashboard agent won't callwait_for_termination()
since it's stuck atawait asyncio.gather(*tasks)
and those module tasks have infinite loop so it never has a chance to runwait_for_termination()
and discover the failure of grpc server. This PR makeswait_for_termination()
part of asyncio.gather so that it runs.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.