-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup. #19834
Comments
I just got the same cryptic exception when trying to TracebackTraceback (most recent call last):
File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/node.py", line 238, in __init__
ray._private.services.wait_for_node(
File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/_private/services.py", line 324, in wait_for_node
raise TimeoutError("Timed out while waiting for node to startup.")
TimeoutError: Timed out while waiting for node to startup.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/.../ir-erank-2021/ir/__main__.py", line 90, in <module>
main()
File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/knockknock/slack_sender.py", line 105, in wrapper_sender
raise ex
File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/knockknock/slack_sender.py", line 63, in wrapper_sender
value = func(*args, **kwargs)
File "/home/.../ir-erank-2021/ir/__main__.py", line 54, in main
ray.init(local_mode=hyperparams.parser_args.ray_local_mode)
File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/worker.py", line 908, in init
_global_node = ray.node.Node(
File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/node.py", line 242, in __init__
raise Exception(
Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup. OS$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7" |
This happens to me whenever I try to update an exsting node with: The head node updates fine, but any worker nodes shutdown and restart completely which takes a lot of tiime. |
This happens to me when trying to do |
Happening to myself as well, more-or-less vanilla cluster setup w/ AWS EC2 (but on private subnet). Using cached instance: https://docs.ray.io/en/releases-1.9.2/cluster/config.html#cluster-configuration-cache-stopped-nodes Has anyone figured out what log info to look to get more details? I'd consider inspecting worker nodes, but hard to do when |
I used workaround to inspect pre-shutdown logs after restarting the node: #22707 (comment) When looking through, I see two types of logs - one that shows things being OK, then one showing things are NOT OK: The main stuff from failing node is gRPC failing:
However, I can't run EDIT: I can reproduce by re-running |
K, so if I ensure I only call However, if I cal But, this isn't great if I want to manually start a worker node, and then have Are these the right expectations? And more important to this issue - is this what any of y'all are experiencing as well? |
I'm getting the same error on HPC. |
Getting the same error on HPC, can't start ray by: |
Getting the same error on Debian when trying to run without a GPU. Works fine in an identical environment (pytorch with cpuonly) on a machine that does have a GPU. |
same error on docker-compose, latest rayproject/ray docker image, command: |
@lundybernard is this also on Windows? |
no, this is on Macintosh M1 OSX, the container is running on |
@rkooo567 Got this error on windows too. |
Could someone provide a clear reproducer and description of the hardware/software stack? |
TimeoutError Traceback (most recent call last) File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\services.py:438, in wait_for_node(redis_address, gcs_address, node_plasma_store_socket_name, redis_password, timeout) TimeoutError: Timed out while waiting for node to startup. During handling of the above exception, another exception occurred: Exception Traceback (most recent call last) File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm_config.py:471, in AlgorithmConfig.build(self, env, logger_creator, use_copy) File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm.py:424, in Algorithm.init(self, config, env, logger_creator, **kwargs) File E:\Anaconda\envs\rllib\lib\site-packages\ray\tune\trainable\trainable.py:167, in Trainable.init(self, config, logger_creator, remote_checkpoint_dir, custom_syncer, sync_timeout) File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm.py:542, in Algorithm.setup(self, config) File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:151, in WorkerSet.init(self, env_creator, validate_env, default_policy_class, config, num_workers, local_worker, logdir, _setup, policy_class, trainer_config) File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:474, in WorkerSet.add_workers(self, num_workers, validate) File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:475, in (.0) File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:785, in WorkerSet._make_worker(self, cls, env_creator, validate_env, worker_index, num_workers, recreated_worker, config, spaces) File E:\Anaconda\envs\rllib\lib\site-packages\ray\actor.py:529, in ActorClass.remote(self, *args, **kwargs) File E:\Anaconda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py:387, in _tracing_actor_creation.._invocation_actor_class_remote_span(self, args, kwargs, *_args, **_kwargs) File E:\Anaconda\envs\rllib\lib\site-packages\ray\actor.py:764, in ActorClass._remote(self, args, kwargs, **actor_options) File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\client_mode_hook.py:124, in client_mode_should_convert(auto_init) File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\client_mode_hook.py:105, in client_mode_hook..wrapper(*args, **kwargs) File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\worker.py:1428, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs) File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\node.py:319, in Node.init(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only) Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup. MY Env: how can I solve the problem on windows, thanks |
That thing might happen when there're some errors in your last time running. And the error causes your ray node to start but not be stopped. I met the same problem, and I use the code below to figure it out.
Then it does work. |
@1121091694 could you give more information about your environment (where did you get python, do you have a NVidia GPU as well as a CPU, which exact version of nightly are you using)? It seems you are using a nightly (3.0.0.dev0), does the latest official release also fail? |
I think we should close this. We have not gotten a complete report from a user that hits this:
Instead, we keep getting partial reports |
Did you manage to resolve it? Any way to run ray on computecanada ? |
I'm still experiencing this issue. It is probably due to a large number of jobs that are running/have run on a slurm cluster, but no way to debug it further. There isn't any information in the logs afaict. Anyone able to work around this somehow? To the developers: sorry, providing a reproducible example for this is pretty difficult. But, I am on ray 2.2 with rocky linux 8.5. Update: in my case, sometimes just trying to .init() again solves the problem. |
(Running on Windows 10) I think I found a solution but I dont know If I should laugh or cry right now.... So basically I had the same error messages after using ray.init() or ray start --head: I had no clue what this means and google + chatgpt had no answers that worked for me. I decided to find it myself and "[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/wire_format_lite.cc:581] String field 'ray.rpc.GcsNodeInfo.node_manager_hostname' contains invalid UTF-8 data when serializing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes. I skipped it first because I didnt unterstand it at first and thought maybe it is setting the hostname to None or something because it didnt find a node. But after several hours of trying something else and getting desperate I luckily came back to this error and thought to myself: And then it hit me. I instantly hit the windows key, opened my system settings and went to the info tab. And there it was, the root of my problems: "Device name: der_gerät" The stupid name I gave my PC stopped me from using ray and just made me debug for atleast 4 hours over multiple days. I dont know why, but I guess the letter ä is not in UTF-8 haha. After changing the name of my PC to something without weird letters of the german language, ray.init() started to work, finally. I hope it helps someone else, because im sure will I tell my colleagues (or in my case my fellow cs student friends) about this stupid bug. Cheers! 😄 |
我的报错是: The above exception was the direct cause of the following exception: Traceback (most recent call last): 排查了很长时间发现是因为磁盘满了,没法生成调用ray所必要的中间文件,后来把磁盘清理之后就解决了该问题 |
Search before asking
Ray Component
Ray Tune
What happened + What you expected to happen
2021-10-28 18:01:24,117 INFO services.py:1255 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
File "E:\software\conda\lib\site-packages\ray\node.py", line 265, in init
self.redis_password)
File "E:\software\conda\lib\site-packages\ray_private\services.py", line 276, in wait_for_node
raise TimeoutError("Timed out while waiting for node to startup.")
TimeoutError: Timed out while waiting for node to startup.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:\software\PyCharm 2020.2.5\plugins\python\helpers\pydev\pydevd.py", line 1448, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "E:\software\PyCharm 2020.2.5\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/document/JacksonProject/wrap_angle/angle_ray.py", line 7, in
ray.init()
File "E:\software\conda\lib\site-packages\ray_private\client_mode_hook.py", line 89, in wrapper
return func(*args, **kwargs)
File "E:\software\conda\lib\site-packages\ray\worker.py", line 897, in init
ray_params=ray_params)
File "E:\software\conda\lib\site-packages\ray\node.py", line 268, in init
"The current node has not been updated within 30 "
Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.
[0x7FF97CFAE0A4] ANOMALY: use of REX.w is meaningless (default operand size is 64)
Versions / Dependencies
Name: ray
Version: 1.7.1
Summary: Ray provides a simple, universal API for building distributed applications.
Home-page: https://github.com/ray-project/ray
Name: numpy
Version: 1.19.5
windows10
anaconda
python3.7
Reproduction script
import ray
from ray.tune import register_trainable, run_experiments
import numpy as np
from ray.tune.utils import pin_in_object_store, get_pinned_object
ray.init()
X_id can be referenced in closures
X_id = pin_in_object_store(np.random.random(size=100000000))
def f(config, reporter):
X = get_pinned_object(X_id)
# use X
register_trainable("f", f)
run_experiments(...)
Anything else
每次都会发生
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: