Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Client][Proxy] Prevent Logstream from Timing Out when Delays in DataClient #16180

Merged

Conversation

ijrsvt
Copy link
Contributor

@ijrsvt ijrsvt commented Jun 1, 2021

Why are these changes needed?

This ensures that the LogstreamServicer does not time out fetching a channel when the DataStream initial connection takes a long time.

Related issue number

Closes #16178

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(
  • Manually testing that connections still happen with time.sleep(10) in rewrite_runtime_env_uris.


def delay_in_rewrite(input: JobConfig):
import time
time.sleep(6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test this fails before the fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl Added a check to ensure that logsclient is working! I had previously tested this interactively and forgot that the logclient thread failing mainly prints a scary warning.

@ijrsvt ijrsvt requested a review from mwtian June 2, 2021 19:43
Copy link
Member

@mwtian mwtian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't think I grok all details about this change. Besides moving the logic in the new create_specific_server() to run earlier, and skipping not-ready servers in the background _check_processes() thread, what other changes helps prevent Logstream from timing out?

python/ray/util/client/server/proxier.py Show resolved Hide resolved
Copy link
Member

@mwtian mwtian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ijrsvt
Copy link
Contributor Author

ijrsvt commented Jun 2, 2021

Sorry I don't think I grok all details about this change. Besides moving the logic in the new create_specific_server() to run earlier, and skipping not-ready servers in the background _check_processes() thread, what other changes helps prevent Logstream from timing out?

No worries @mwtian !

Moving the creation logic to create_specific_server() means that when the Logstream makes a call to get_channel the call will wait on wait_ready() as opposed to fail to find the SpecificServer!

I prefer to keep the creation logic associated with Dataclient/Datapath because that is how the server normally manages the lifecycle of the connection (i.e. the DataServicer manages the num_clients https://github.com/ray-project/ray/blob/master/python/ray/util/client/server/dataservicer.py )

@mwtian
Copy link
Member

mwtian commented Jun 2, 2021

Sorry I don't think I grok all details about this change. Besides moving the logic in the new create_specific_server() to run earlier, and skipping not-ready servers in the background _check_processes() thread, what other changes helps prevent Logstream from timing out?

No worries @mwtian !

Moving the creation logic to create_specific_server() means that when the Logstream makes a call to get_channel the call will wait on wait_ready() as opposed to fail to find the SpecificServer!

I prefer to keep the creation logic associated with Dataclient/Datapath because that is how the server normally manages the lifecycle of the connection (i.e. the DataServicer manages the num_clients https://github.com/ray-project/ray/blob/master/python/ray/util/client/server/dataservicer.py )

I see. It seems in the existing implementation, start_specific_server() creates the SpecificServer for client_id and is also called before get_channel(). How does the server fail to be found currently?

@ijrsvt
Copy link
Contributor Author

ijrsvt commented Jun 2, 2021

@mwtian The current failure is that the SpecificServer is created after the InitRequest is received. This can be delayed if hashing the working directory takes a long time here:

runtime_env.rewrite_runtime_env_uris(job_config)

@ijrsvt ijrsvt force-pushed the avoid-synchronization-issues-on-server branch from 36981fc to 431f16e Compare June 3, 2021 06:35
@ericl ericl merged commit 22bd7ce into ray-project:master Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] [Runtime env] [Client] Client hangs sometimes when specifying runtime_env
6 participants