Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ray client] enable ray.get with >2 sec timeout (#21883) #22165

Merged

Conversation

takeshi-yoshimura
Copy link
Contributor

Commit 2cf4c72 ("[ray client] Fix ctrl-c for ray.get() by setting a short-server side timeout") introduced a short server-side timeout not to block later operations.

However, the fix implicitly assumes that get() is complete within MAX_BLOCKING_OPERATION_TIME_S (two seconds). This becomes a problem when apps use heavy objects or limited network I/O bandwidth that require more than two seconds to push all chunks. The current retry logic needs to re-push from the beginning of chunks and block clients with the infinite re-push.

I updated the logic to directly pass timeout if it is explicitly given. Without timeout, it still uses MAX_BLOCKING_OPERATION_TIME_S for polling with the short server-side timeout.

Initially, I thought the fix should be exponential back-off, but that may change the behavior of many apps. So, this patch focuses on enabling bigger timeouts.

Why are these changes needed?

Related issue number

Closes #21883

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Commit 2cf4c72 ("[ray client] Fix ctrl-c for ray.get() by setting a
short-server side timeout") introduced a short server-side timeout not
to block later operations.

However, the fix implicitly assumes that get() is complete within
MAX_BLOCKING_OPERATION_TIME_S (two seconds). This becomes a problem
when apps use heavy objects or limited network I/O bandwidth that
require more than two seconds to push all chunks. The current retry
logic needs to re-push from the beginning of chunks and block clients
with the infinite re-push.

I updated the logic to directly pass timeout if it is explicitly given.
Without timeout, it still uses MAX_BLOCKING_OPERATION_TIME_S for
polling with the short server-side timeout.
@stale
Copy link

stale bot commented Mar 24, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 24, 2022
@ckw017 ckw017 self-requested a review March 25, 2022 00:41
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 25, 2022
Copy link
Member

@ckw017 ckw017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, can you rebase/merge with master? That should rerun CI and hopefully fix the failing tests.

@takeshi-yoshimura
Copy link
Contributor Author

@ckw017
Thanks. I merged with master.

@ckw017 ckw017 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 30, 2022
Copy link
Member

@ckw017 ckw017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, re-evaluating this, I think this overloads the meaning of ray.get's timeout a bit too much. For example, users will have to explicitly add timeout to work in client, even though it would be unnecessary when attaching a driver directly.

@takeshi-yoshimura would being able to set an environment variable, i.e. something like RAY_CLIENT_MAX_BLOCKING_OPERATION_TIME_S=20 to increase the timeout work for your usecase instead?

Alternatively we can try bumping MAX_BLOCKING_OPERATION_TIME to a higher value

cc @ericl (author of #14425)

@takeshi-yoshimura
Copy link
Contributor Author

@ckw017
Updated the code to introduce a new environmental variable RAY_CLIENT_MAX_BLOCKING_OPERATION_TIME_S according to your request.

@ckw017
Copy link
Member

ckw017 commented Apr 5, 2022

Looks good, looks like the only thing failing is the lint. Can you update to match what the linter is suggesting?

Screen Shot 2022-04-05 at 9 59 15 AM

@takeshi-yoshimura
Copy link
Contributor Author

I fixed the format. thanks.

@ckw017
Copy link
Member

ckw017 commented Apr 7, 2022

Thanks! Looking into the failing tests

@ckw017 ckw017 removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 7, 2022
@ckw017
Copy link
Member

ckw017 commented Apr 21, 2022

Looks like we had a few problems with our testing infra when you first ran the PR, if you merge with master and push again it should be good to go!

@takeshi-yoshimura
Copy link
Contributor Author

Merged again. let's see...

@ckw017
Copy link
Member

ckw017 commented Apr 22, 2022

Okay, looks like failures are:

  • LinkCheck (safe to ignore, failing on master as well)
  • rllib:learning_tests_pendulum_ddppo (safe to ignore, failing on master)
  • rllib:learning_tests_cartpole_ddppo (safe to ignore, failing on master at time of merge)
  • mac tests (Looks like something in our infra dropped a bunch of mac agents around that time, IMO safe to ignore)

@ckw017 ckw017 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 22, 2022
@ckw017 ckw017 requested a review from ijrsvt April 22, 2022 19:09
@ckw017
Copy link
Member

ckw017 commented Apr 22, 2022

cc @ijrsvt can you take a look/merge this when you get the chance

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ijrsvt
Copy link
Contributor

ijrsvt commented Apr 25, 2022

RLLIB Tests & OSX test_shuffle are failing on master at the point of rebase.

@ijrsvt ijrsvt merged commit e115545 into ray-project:master Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Hard-coded timeout for get()
3 participants