Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ray_client] close ray connection upon client deactivation #13919

Merged
merged 18 commits into from
Feb 7, 2021

Conversation

richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Feb 4, 2021

Why are these changes needed?

Disconnects from Ray when num_clients drops to 0. This kills actors/tasks when job disconnects.

However, I think this doesn't really work if you have multiple clients.

Questions:

  • I disabled a check for ray initialization upon client connection. Is this safe? (probably not?) Fixed
  • Should we create a new protobuf type for "servicer alive" (as opposed to servicer alive == ray connected)?
  • What's the right way to write a test here? Done

Related issue number

Closes #13354

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@ericl ericl self-assigned this Feb 5, 2021
break
# Ray is not ready yet, wait a timeout
time.sleep(timeout)
# # Now the HTTP2 channel is ready, or proxied, but the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what problem was this causing? Shouldn't is_initialized still work as before after the client connects?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, the data connection isn't added until later. How about we add a new dummy ping operation here then instead of is_initialized()? It can be anything really, is_initialized was just used as a convenient no_op.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_initialized() also had the advantage of checking if ray was initialized, which is kinda useful here. What you might want to do though is do the data connection earlier. you might need to wait for /it/ to return ready (meaning the send/receive loop thread pair is kicked off and sets a flag, or even better, returns its client_id) -- but in general, yeah, some op here is useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK i did something, take a look! happy to hear any suggestions

@ericl
Copy link
Contributor

ericl commented Feb 5, 2021

Great start! Comments inline.

However, I think this doesn't really work if you have multiple clients.

Yeah this is probably fine for now, if we consider the common use case here a cluster to have one or zero clients. There might be extra clients attached just for debugging purposes.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 5, 2021
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
address=args.redis_address,
_redis_password=args.redis_password)

def ray_connect_handler():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this top-level and pass in redis_address/redis_password?

break
# Ray is not ready yet, wait a timeout
time.sleep(timeout)
# # Now the HTTP2 channel is ready, or proxied, but the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_initialized() also had the advantage of checking if ray was initialized, which is kinda useful here. What you might want to do though is do the data connection earlier. you might need to wait for /it/ to return ready (meaning the send/receive loop thread pair is kicked off and sets a flag, or even better, returns its client_id) -- but in general, yeah, some op here is useful.

python/ray/util/client/server/dataservicer.py Show resolved Hide resolved
@@ -139,6 +139,7 @@ message ClusterInfoType {
CLUSTER_RESOURCES = 2;
AVAILABLE_RESOURCES = 3;
RUNTIME_CONTEXT = 4;
IS_ALIVE = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if a protocol change is strictly necessary -- we know it's alive, for instance, if the handshake from the data channel completes (which is what you'd want to test anyway)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed; I'm ok with this though as an explicit check since it seems straightforward.

This reverts commit 0410c37.
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
ray.disconnect()
finally:
ray_client_server.shutdown_with_server(server_handle.grpc_server)
time.sleep(2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved this stuff into the fixture

raise ConnectionError("ray client connection timeout")

# Initialize the streams to finish protocol negotiation.
self.data_client = DataClient(self.channel, self._client_id,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barakmich I just moved this into the try-except

# Ray is not ready yet, wait a timeout
time.sleep(timeout)
# Initialize the streams to finish protocol negotiation.
self.data_client = DataClient(self.channel, self._client_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor of dataclient doesn't look like it actually does anything, so this won't suffice as a health check. How about we just add a dummy ping RPC instead and preserve the original code structure here.

Copy link
Contributor Author

@richardliaw richardliaw Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question - what's the easiest way to implement a ping here? Should I modify the protobuf struct to have a dummy value that I will handle on the server side?

@ericl
Copy link
Contributor

ericl commented Feb 5, 2021 via email

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw
Copy link
Contributor Author

richardliaw commented Feb 5, 2021 via email

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@ericl
Copy link
Contributor

ericl commented Feb 6, 2021

Test client failing

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw merged commit 3a230fa into ray-project:master Feb 7, 2021
@richardliaw richardliaw deleted the fix-connection-client branch February 7, 2021 21:11
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core/client] Kill running actors and tasks when the client disconnects
3 participants