Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] ERROR gcs_utils.py:137 – Failed to send request to gcs #22326

Closed
1 of 2 tasks
Nithanaroy opened this issue Feb 11, 2022 · 7 comments
Closed
1 of 2 tasks

[Bug] ERROR gcs_utils.py:137 – Failed to send request to gcs #22326

Nithanaroy opened this issue Feb 11, 2022 · 7 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks
Milestone

Comments

@Nithanaroy
Copy link

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

On ray start on worker I expected the head and worker to be connected, but instead I see this error only on ray 1.9
Worker can talk to the head in ray version 1.8.

$ ray start --address='100.96.24.172:6379' --redis-password='5241590000000000'
Local node IP: 100.96.191.45
2022-02-05 16:39:15,662 ERROR gcs_utils.py:137 -- Failed to send request to gcs, reconnecting. Error <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1644079155.661725492","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1644079155.661724203","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>
2022-02-05 16:39:16,665 ERROR gcs_utils.py:137 -- Failed to send request to gcs, reconnecting. Error <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1644079156.664823970","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1644079156.664822821","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>

Versions / Dependencies

  • Ray 1.9
  • Python 3.7.10
  • OS Linux

Reproduction script

Install the above versions on 2 nodes.

Start the head node using ray.init()

Start the worker node using `ray start --address=.... --redis-password=...

Anything else

https://discuss.ray.io/t/error-gcs-utils-py-137-failed-to-send-request-to-gcs/4936

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Nithanaroy Nithanaroy added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 11, 2022
@rkooo567
Copy link
Contributor

Cc @iycheng @mwtian this is the regression from 1.8->1.9

@mwtian
Copy link
Member

mwtian commented Feb 12, 2022

@Nithanaroy thanks for reporting the issue. Just to make sure, in the case of Ray 1.9, does curl 100.96.24.172:6379 get back some result (from redis)? Is redis running on the head node listening to 6379?

@Nithanaroy
Copy link
Author

@mwtian I did not try to curl 100.96.24.172:6379 but to verify if the 6379 is accessible or not, I stopped ray and started a simple HTTP server using python on 6379. I was able to curl to that. I also attached the nmap result in https://discuss.ray.io/t/error-gcs-utils-py-137-failed-to-send-request-to-gcs/4936/2

@rkooo567
Copy link
Contributor

@mwtian @iycheng Is it okay if I assign one of you guys here? It is a clear regression, and I think we should fix it to make sure it works well with manual deployment.

@mwtian
Copy link
Member

mwtian commented Feb 15, 2022

@Nithanaroy I sent a message to you on discuss.ray.io to setup a debugging session. What I want to investigate together:

  1. See if the issue exists in Ray 1.10.0
  2. After starting Ray 1.9 head node, use Redis-cli to try to read the value of key GcsServerAddress from the head node address, e.g. 100.96.24.172:6379. Verify if it is set correctly.
  3. curl the GCS address from the worker node.

Please let us know if having a debugging session would help!

@rkooo567 rkooo567 added this to the Core Backlog milestone Feb 16, 2022
@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 17, 2022
@mwtian
Copy link
Member

mwtian commented Feb 18, 2022

Debugged with @Nithanaroy today. The issue appears under the following condition: in a notebook cell, start Ray head node with ! ray start --head. It was observed that only Redis keeps running. GCS and other processes exited abruptly without an error, soon after ray start.

The issue is reproducible on my local jupyter notebook as well. If not in a note cell (e.g. notebook terminal), or starting Ray with ray.init(), the issue does not appear.

The best workaround right now is to start Ray with ray.init(), if it needs to be done inside a notebook cell.

@mwtian
Copy link
Member

mwtian commented Feb 19, 2022

Closing this issue since there is a workaround. But feel free to reopen if you think this use case needs to be supported!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants