Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[runtime_env] Client connection fails with TimeoutError: CreateRuntimeEnv request failed after 5 attempts. when using working dir with different file system on head node #19792

Closed
1 of 2 tasks
ckw017 opened this issue Oct 27, 2021 · 13 comments · Fixed by #19827
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks

Comments

@ckw017
Copy link
Member

ckw017 commented Oct 27, 2021

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Others

What happened + What you expected to happen

The following script fails when connecting to a cluster with a different file system than the client:

# test.py
import ray

# Replace this with a valid dir on your machine
d = "REPLACEME"
j = ray.job_config.JobConfig(runtime_env={"working_dir": d})
ray.util.connect("localhost:10001", job_config=j)
print("Done!")

on my machine, hangs for ~30 secs then:

Traceback (most recent call last):
  File "test2.py", line 4, in <module>
    ray.util.connect("localhost:10001", job_config=j)
  File "/Users/cwong/anaconda3/envs/ray38/lib/python3.8/site-packages/ray/util/client_connect.py", line 33, in connect
    conn = ray.connect(
  File "/Users/cwong/anaconda3/envs/ray38/lib/python3.8/site-packages/ray/util/client/__init__.py", line 221, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "/Users/cwong/anaconda3/envs/ray38/lib/python3.8/site-packages/ray/util/client/__init__.py", line 81, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "/Users/cwong/anaconda3/envs/ray38/lib/python3.8/site-packages/ray/util/client/worker.py", line 705, in _server_init
    raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 622, in Datapath
    if not self.proxy_manager.start_specific_server(
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 279, in start_specific_server
    serialized_runtime_env_context = self._create_runtime_env(
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 258, in _create_runtime_env
    raise TimeoutError(
TimeoutError: CreateRuntimeEnv request failed after 5 attempts.

Versions / Dependencies

Python 3.8
Ray commit 99a0088
MacOS client, Ubuntu head node

Reproduction script

Repro is done by running the head node in docker and connecting client from host machine

docker setup

docker pull ubuntu
docker run -p 10001:10001 -it ubuntu

Inside the container

apt-get -y update
apt-get -y install python3
apt-get -y install python3-pip
apt-get -y install wget
wget https://ray-wheels.s3.us-west-2.amazonaws.com/master/99a00882337e85589a8fbc193b8ec77846a4dd6a/ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
pip install ./ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl[default]
ray start --head 

In host shell

pip install https://ray-wheels.s3.us-west-2.amazonaws.com/master/99a00882337e85589a8fbc193b8ec77846a4dd6a/ray-2.0.0.dev0-cp38-cp38-macosx_10_15_x86_64.whl
python test.py 

Anything else

sanity check

inside docker:

pip install 'ray[default]==1.7.1'
ray stop
ray start --head

on local machine:

pip install 'ray[default]==1.7.1'
python test.py

This works, (prints "Done!")

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@ckw017 ckw017 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 27, 2021
@architkulkarni architkulkarni added P0 Issue that must be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 27, 2021
@architkulkarni
Copy link
Contributor

cc @edoakes, I was able to repro this using the steps above and I'm looking into it now. I think the commit that breaks it should be in the last 7 days, because I was using this commit 6d23fb1 when manually running the working_dir+client OSS nightly test and it was passing.

@architkulkarni
Copy link
Contributor

architkulkarni commented Oct 27, 2021

Found this in the logs on the docker container, I wonder if the runtime env agent failed to start. Let me try installing the dependencies and seeing if that fixes it. (If this is the case, we also need to understand if #19491 is working correctly or not)

root@ce509e942283:/tmp/ray/session_latest/logs# cat dashboard.log
2021-10-27 19:35:20,644	INFO head.py:122 -- Dashboard head grpc address: 172.17.0.3:43225
2021-10-27 19:35:20,645	INFO dashboard.py:90 -- Setup static dir for dashboard: /usr/local/lib/python3.8/dist-packages/ray/dashboard/client/build
2021-10-27 19:35:20,648	INFO head.py:50 -- Connect to GCS at b'172.17.0.3:34535'
2021-10-27 19:35:20,650	INFO utils.py:200 -- Get all modules by type: DashboardHeadModule
2021-10-27 19:35:20,669	ERROR dashboard.py:239 -- The dashboard on node ce509e942283 failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/dashboard.py", line 224, in <module>
    loop.run_until_complete(dashboard.run())
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/dashboard.py", line 116, in run
    await self.dashboard_head.run()
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/head.py", line 211, in run
    modules = self._load_modules()
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/head.py", line 159, in _load_modules
    head_cls_list = dashboard_utils.get_all_modules(
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/utils.py", line 206, in get_all_modules
    importlib.import_module(name)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/job/data_types.py", line 1, in <module>
    from pydantic import BaseModel
ModuleNotFoundError: No module named 'pydantic'

@architkulkarni
Copy link
Contributor

architkulkarni commented Oct 27, 2021

Yup, this is just an issue of missing dependencies when only ray[default] is installed. After installing pydantic and ray[serve], the repro passes.

The two recent PRs cause the dashboard not to start when only ray[default] is installed, and this wasn't caught in CI because we don't have pip install ray[default]-only tests in CI (similar to how we have "Minimal install" tests in CI which only use pip install ray):

I think I'll downgrade to P1 since we have a simple workaround, which is to install pydantic and ray[serve] on the cluster.

@architkulkarni architkulkarni added P1 Issue that should be fixed within a few weeks and removed P0 Issue that must be fixed in short order labels Oct 27, 2021
@edoakes
Copy link
Contributor

edoakes commented Oct 27, 2021

cc @jiaodong

Let's make the pydantic import conditional for now (only if you use job submission API) and not import from serve in the job submission server.

@edoakes
Copy link
Contributor

edoakes commented Oct 27, 2021

@richardliaw do we have any plans to add test cases for ray[default] similar to the minimal test?

@ckw017
Copy link
Member Author

ckw017 commented Oct 27, 2021

@architkulkarni Sorry to piggyback on this issue but the repro is 95% the same:

After doing the original setup in docker, and then:

ray stop
pip install pydantic
pip install ./ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl[default,serve]
ray start --head

python test.py works the first time

After a second run of python test.py:

l^CTraceback (most recent call last):
  File "/Users/cwong/anaconda3/envs/timeouttest/lib/python3.8/site-packages/ray/util/client/__init__.py", line 81, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "/Users/cwong/anaconda3/envs/timeouttest/lib/python3.8/site-packages/ray/util/client/worker.py", line 705, in _server_init
    raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 627, in Datapath
    raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23001.err for detailed logs.

ray_client_server_23001.err:

bash: line 0: cd: /tmp/ray/session_2021-10-27_21-36-58_923820_4931/runtime_resources/_ray_pkg_d9: No such file or directory

@architkulkarni
Copy link
Contributor

Thanks a lot for the investigation and detailed repros! This issue sounds like the right place for it, I'll continue to investigate

@architkulkarni
Copy link
Contributor

The issue of bash: line 0: cd: /tmp/ray/session_2021-10-27_21-36-58_923820_4931/runtime_resources/_ray_pkg_d9: No such file or directory happening on the second and subsequent runs is fixed on master, fixed by https://github.com/ray-project/ray/pull/19651/files#diff-8b80beaae6d13cb37b31712b50db575327c8d0449f644883687a7c511d64af27R109.

@wjrforcyber
Copy link
Contributor

@architkulkarni Sorry to piggyback on this issue but the repro is 95% the same:

After doing the original setup in docker, and then:

ray stop
pip install pydantic
pip install ./ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl[default,serve]
ray start --head

python test.py works the first time

After a second run of python test.py:

l^CTraceback (most recent call last):
  File "/Users/cwong/anaconda3/envs/timeouttest/lib/python3.8/site-packages/ray/util/client/__init__.py", line 81, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "/Users/cwong/anaconda3/envs/timeouttest/lib/python3.8/site-packages/ray/util/client/worker.py", line 705, in _server_init
    raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 627, in Datapath
    raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23001.err for detailed logs.

ray_client_server_23001.err:

bash: line 0: cd: /tmp/ray/session_2021-10-27_21-36-58_923820_4931/runtime_resources/_ray_pkg_d9: No such file or directory

I am still seeing this issue on macOS
macOs Monterey, version 12.5
chip Apple M2
Memory 24 GB
ray 1.13.0
My error message:

Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.9/3.9.15/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/homebrew/Cellar/python@3.9/3.9.15/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/uvicorn/subprocess.py", line 76, in subprocess_started
    target(sockets=sockets)
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/uvicorn/server.py", line 60, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/opt/homebrew/Cellar/python@3.9/3.9.15/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/uvicorn/server.py", line 67, in serve
    config.load()
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/uvicorn/config.py", line 458, in load
    self.loaded_app = import_from_string(self.app)
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/uvicorn/importer.py", line 21, in import_from_string
    module = importlib.import_module(module_str)
  File "/opt/homebrew/Cellar/python@3.9/3.9.15/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/./eda_design_flow/main.py", line 17, in <module>
    init_ray()
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/./eda_design_flow/utils.py", line 6, in init_ray
    return ray.init(address='ray://localhost:10001',dashboard_host='0.0.0.0',dashboard_port=8265)
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/ray/worker.py", line 887, in init
    return builder.connect()
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/ray/client_builder.py", line 160, in connect
    client_info_dict = ray.util.client_connect.connect(
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/ray/util/client_connect.py", line 36, in connect
    conn = ray.connect(
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/ray/util/client/__init__.py", line 243, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/ray/util/client/__init__.py", line 94, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "/Users/jingrenwang/Github/EDA/Project/eda-flow-python/venv/lib/python3.9/site-packages/ray/util/client/worker.py", line 811, in _server_init
    raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/opt/eda-venv/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 664, in Datapath
    raise RuntimeError(
RuntimeError: Starting Ray client server failed. See ray_client_server_23000.err for detailed logs.

@architkulkarni
Copy link
Contributor

Hi @wjrforcyber, do you have any more details about your workload? Are there any relevant logs in /tmp/ray/session_latest/logs/ray_client_server_23000.err?

@wjrforcyber
Copy link
Contributor

wjrforcyber commented Oct 21, 2022

Hi @wjrforcyber, do you have any more details about your workload? Are there any relevant logs in /tmp/ray/session_latest/logs/ray_client_server_23000.err?

I found another issue with the exact same log here but no more discussion and bot had the issue closed. I think it's still a M2 chip issue.

@architkulkarni
Copy link
Contributor

@wjrforcyber got it, let's keep this issue for the runtime_env dependency error, and move this discussion to the other issue.

@aleeve
Copy link

aleeve commented Jan 11, 2023

Hi! Thanks for the great software!

Sorry if this is the wrong issue for this. I'm hitting
RuntimeError: Starting Ray client server failed. See ray_client_server_23004.err for detailed logs.
in different ways when passing in py_modules as the runtime_env. One time I accidentally tried to load a numpy version different from that on the node causing openblas mixup, and another time a subpackage of jedi called subprocess got imported as a toplevel package which created a circular dependency.

I wanted to pass the full env to a "blank" worker which was perhaps not super. Do you think you will add more extensive import guards or is the docker image runtime my best bet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants