Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Make sure dashboard agent will exit if grpc server fails #44899

Merged
merged 1 commit into from
Apr 22, 2024

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Apr 22, 2024

Why are these changes needed?

If grpc server fails, wait_for_termination() will raise an exception and dashboard agent should exit in this case. However, currently dashboard agent won't call wait_for_termination() since it's stuck at await asyncio.gather(*tasks) and those module tasks have infinite loop so it never has a chance to run wait_for_termination() and discover the failure of grpc server. This PR makes wait_for_termination() part of asyncio.gather so that it runs.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao marked this pull request as ready for review April 22, 2024 17:47

if self.server:
await self.server.wait_for_termination()
tasks.append(self.server.wait_for_termination())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually test it with a broken grpcio version but I don't know how to write a test for it since I don't know how to let wait_for_termination() throw an exception with a good grpcio version.

@@ -50,4 +50,4 @@ async def run(self, server):

@staticmethod
def is_minimal_module():
return True
return False
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not a minimal module since it depends on aio http server. cc @rkooo567 can you confirm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think you are right

Copy link
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix! Please confirm if healthz agent is minimal or not.

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a unit tests?

@@ -50,4 +50,4 @@ async def run(self, server):

@staticmethod
def is_minimal_module():
return True
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think you are right

@jjyao jjyao merged commit 8e36e4d into ray-project:master Apr 22, 2024
5 checks passed
@jjyao jjyao deleted the jjyao/grpc branch April 22, 2024 23:32
@jjyao
Copy link
Collaborator Author

jjyao commented Apr 22, 2024

maybe add a unit tests?

@rkooo567 I don't know how to trigger an exception from wait_for_termination() beside using a bad grpcio version. Do you have any ideas?

@rkooo567
Copy link
Contributor

maybe you can manually cause port conflict or sth?

@rkooo567
Copy link
Contributor

I think there's an option called agent grpc port or sth (can be found in agent.py)

@jjyao
Copy link
Collaborator Author

jjyao commented Apr 23, 2024

Port conflict will fail grpc_server.add_insecure_port() instead of grpc_server.wait_for_shutdown()

@rkooo567
Copy link
Contributor

hmm I see. Yeah maybe manual checking is enough in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants