Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray Cluster: Failed to create a ray cluster using running container #45148

Open
hahmad2008 opened this issue May 4, 2024 · 1 comment
Open
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes P2 Important issue, but not time-critical

Comments

@hahmad2008
Copy link

What happened + What you expected to happen

I am using ray==2.9.2 inside a running container, so I need to create a cluster using the following command:

docker exec -it MY_CONTAINER ray start --head --object-manager-port=8076 --node-manager-port=8077
Then I got message that it successfully created for the head cluster node.
however then when I tried to check the cluster status:

docker exec -it MY_CONTAINER ray status

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 3168, in ray._raylet.check_health
  File "python/ray/_raylet.pyx", line 580, in ray._raylet.check_status
ray.exceptions.RpcError: failed to connect to all addresses; last error: UNKNOWN: ipv4:11.1.1.111:6379: Failed to connect to remote host: Connection refused

What is the problem here?

Versions / Dependencies

ray==2.9.2

Reproduction script

I am using ray==2.9.2 inside a running container, so I need to create a cluster using the following command:

docker exec -it MY_CONTAINER ray start --head --object-manager-port=8076 --node-manager-port=8077
Then I got message that it successfully created for the head cluster node.
however then when I tried to check the cluster status:

docker exec -it MY_CONTAINER ray status

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 3168, in ray._raylet.check_health
  File "python/ray/_raylet.pyx", line 580, in ray._raylet.check_status
ray.exceptions.RpcError: failed to connect to all addresses; last error: UNKNOWN: ipv4:11.1.1.111:6379: Failed to connect to remote host: Connection refused

What is the problem here?

Issue Severity

High: It blocks me from completing my task.

@hahmad2008 hahmad2008 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 4, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label May 6, 2024
@jjyao
Copy link
Contributor

jjyao commented May 6, 2024

Kuberay (https://github.com/ray-project/kuberay) is the recommended way to run Ray cluster inside container and k8s. Can you try that?

@jjyao jjyao added P2 Important issue, but not time-critical core-clusters For launching and managing Ray clusters/jobs/kubernetes and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

3 participants