Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup. #19834

Closed
2 tasks done
JackonLiu opened this issue Oct 28, 2021 · 22 comments
Labels
bug Something that is supposed to be working; but isn't needs-repro-script Issue needs a runnable script to be reproduced P2 Important issue, but not time-critical QS Quantsight triage label

Comments

@JackonLiu
Copy link

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

2021-10-28 18:01:24,117 INFO services.py:1255 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
File "E:\software\conda\lib\site-packages\ray\node.py", line 265, in init
self.redis_password)
File "E:\software\conda\lib\site-packages\ray_private\services.py", line 276, in wait_for_node
raise TimeoutError("Timed out while waiting for node to startup.")
TimeoutError: Timed out while waiting for node to startup.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "E:\software\PyCharm 2020.2.5\plugins\python\helpers\pydev\pydevd.py", line 1448, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "E:\software\PyCharm 2020.2.5\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/document/JacksonProject/wrap_angle/angle_ray.py", line 7, in
ray.init()
File "E:\software\conda\lib\site-packages\ray_private\client_mode_hook.py", line 89, in wrapper
return func(*args, **kwargs)
File "E:\software\conda\lib\site-packages\ray\worker.py", line 897, in init
ray_params=ray_params)
File "E:\software\conda\lib\site-packages\ray\node.py", line 268, in init
"The current node has not been updated within 30 "
Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.
[0x7FF97CFAE0A4] ANOMALY: use of REX.w is meaningless (default operand size is 64)

Versions / Dependencies

Name: ray
Version: 1.7.1
Summary: Ray provides a simple, universal API for building distributed applications.
Home-page: https://github.com/ray-project/ray
Name: numpy
Version: 1.19.5

windows10
anaconda
python3.7

Reproduction script

import ray
from ray.tune import register_trainable, run_experiments

import numpy as np
from ray.tune.utils import pin_in_object_store, get_pinned_object

ray.init()

X_id can be referenced in closures

X_id = pin_in_object_store(np.random.random(size=100000000))

def f(config, reporter):
X = get_pinned_object(X_id)
# use X

register_trainable("f", f)
run_experiments(...)

Anything else

每次都会发生

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@JackonLiu JackonLiu added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2021
@keyboardAnt
Copy link

I just got the same cryptic exception when trying to init Ray on CentOS.

Traceback

Traceback (most recent call last):
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/node.py", line 238, in __init__
    ray._private.services.wait_for_node(
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/_private/services.py", line 324, in wait_for_node
    raise TimeoutError("Timed out while waiting for node to startup.")
TimeoutError: Timed out while waiting for node to startup.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/.../ir-erank-2021/ir/__main__.py", line 90, in <module>
    main()
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/knockknock/slack_sender.py", line 105, in wrapper_sender
    raise ex
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/knockknock/slack_sender.py", line 63, in wrapper_sender
    value = func(*args, **kwargs)
  File "/home/.../ir-erank-2021/ir/__main__.py", line 54, in main
    ray.init(local_mode=hyperparams.parser_args.ray_local_mode)
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/worker.py", line 908, in init
    _global_node = ray.node.Node(
  File "/home/.../anaconda3/envs/ir-venv/lib/python3.8/site-packages/ray/node.py", line 242, in __init__
    raise Exception(
Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.

OS

$ cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

@outdoteth
Copy link

This happens to me whenever I try to update an exsting node with: ray up -y my_config.yaml.

The head node updates fine, but any worker nodes shutdown and restart completely which takes a lot of tiime.

@LuisFelipeLeivaH
Copy link

LuisFelipeLeivaH commented Jan 22, 2022

This happens to me when trying to do ray.init() on a HPC computer cluster on computecanada.

@EricCousineau-TRI
Copy link
Contributor

Happening to myself as well, more-or-less vanilla cluster setup w/ AWS EC2 (but on private subnet).

Using cached instance: https://docs.ray.io/en/releases-1.9.2/cluster/config.html#cluster-configuration-cache-stopped-nodes

Has anyone figured out what log info to look to get more details? I'd consider inspecting worker nodes, but hard to do when ray auto-stops the instance.

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Mar 1, 2022

I used workaround to inspect pre-shutdown logs after restarting the node: #22707 (comment)

When looking through, I see two types of logs - one that shows things being OK, then one showing things are NOT OK:
https://gist.github.com/EricCousineau-TRI/4822b8be94fccc7483a51040e7f44d47

The main stuff from failing node is gRPC failing:

grpc_server.cc:102:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. [...] Try running lsof -i :8076 to check if there are other processes listening to the port.

However, I can't run lsof -i :8076 because ray has already shut down the node :(

EDIT: I can reproduce by re-running ray up <config_file>, and it seems to happen when restarting the worker's ray.
I don't understand why, though, since we explicitly add ray stop as is the default:
https://docs.ray.io/en/releases-1.9.2/cluster/config.html#cluster-configuration-worker-start-ray-commands

@EricCousineau-TRI
Copy link
Contributor

K, so if I ensure I only call ray up once, then I do not get this issue.
If I call it twice or more, then I get the port conflict on the worker.

However, if I cal ray up --no-restart, then it's fine.

But, this isn't great if I want to manually start a worker node, and then have ray use it.
I also expect ray up to be idempotent - especially if ray stop is explicitly in the worker's start commands.

Are these the right expectations?

And more important to this issue - is this what any of y'all are experiencing as well?

@robertreaney
Copy link

robertreaney commented Mar 31, 2022

This happens to me when trying to do ray.init() on a HPC computer cluster on computecanada.

I'm getting the same error on HPC.

@AdamYoung71
Copy link

Getting the same error on HPC, can't start ray by:
ray start --head
The error:
raise TimeoutError("Timed out while waiting for node to startup.

@utkarshp
Copy link

Getting the same error on Debian when trying to run without a GPU. Works fine in an identical environment (pytorch with cpuonly) on a machine that does have a GPU.

@lundybernard
Copy link

same error on docker-compose, latest rayproject/ray docker image, command: ray start -v --head ...

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 2, 2022

@lundybernard is this also on Windows?

@lundybernard
Copy link

@lundybernard is this also on Windows?

no, this is on Macintosh M1 OSX, the container is running on platform: linux/amd64

@eromoe
Copy link

eromoe commented Oct 19, 2022

@rkooo567 Got this error on windows too.

@mattip
Copy link
Contributor

mattip commented Oct 20, 2022

Could someone provide a clear reproducer and description of the hardware/software stack?

@mattip mattip added the needs-repro-script Issue needs a runnable script to be reproduced label Oct 20, 2022
@1121091694
Copy link

TimeoutError Traceback (most recent call last)
File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\node.py:312, in Node.init(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
311 try:
--> 312 ray._private.services.wait_for_node(
313 self.redis_address,
314 self.gcs_address,
315 self._plasma_store_socket_name,
316 self.redis_password,
317 )
318 except TimeoutError:

File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\services.py:438, in wait_for_node(redis_address, gcs_address, node_plasma_store_socket_name, redis_password, timeout)
437 time.sleep(0.1)
--> 438 raise TimeoutError("Timed out while waiting for node to startup.")

TimeoutError: Timed out while waiting for node to startup.

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last)
Input In [2], in <cell line: 7>()
5 config = PPOConfig().training(gamma=0.9, lr=0.01, kl_coeff=0.3).resources(num_gpus=0).rollouts(num_rollout_workers=1)
6 print(config.to_dict())
----> 7 algo = config.build(env="CartPole-v1")

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm_config.py:471, in AlgorithmConfig.build(self, env, logger_creator, use_copy)
468 if logger_creator is not None:
469 self.logger_creator = logger_creator
--> 471 return self.algo_class(
472 config=self if not use_copy else copy.deepcopy(self),
473 logger_creator=self.logger_creator,
474 )

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm.py:424, in Algorithm.init(self, config, env, logger_creator, **kwargs)
412 # Initialize common evaluation_metrics to nan, before they become
413 # available. We want to make sure the metrics are always present
414 # (although their values may be nan), so that Tune does not complain
415 # when we use these as stopping criteria.
416 self.evaluation_metrics = {
417 "evaluation": {
418 "episode_reward_max": np.nan,
(...)
421 }
422 }
--> 424 super().init(
425 config=config,
426 logger_creator=logger_creator,
427 **kwargs,
428 )
430 # Check, whether training_iteration is still a tune.Trainable property
431 # and has not been overridden by the user in the attempt to implement the
432 # algos logic (this should be done now inside training_step).
433 try:

File E:\Anaconda\envs\rllib\lib\site-packages\ray\tune\trainable\trainable.py:167, in Trainable.init(self, config, logger_creator, remote_checkpoint_dir, custom_syncer, sync_timeout)
165 start_time = time.time()
166 self._local_ip = ray.util.get_node_ip_address()
--> 167 self.setup(copy.deepcopy(self.config))
168 setup_time = time.time() - start_time
169 if setup_time > SETUP_TIME_THRESHOLD:

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm.py:542, in Algorithm.setup(self, config)
535 if _init is False:
536 # - Create rollout workers here automatically.
537 # - Run the execution plan to create the local iterator to next()
538 # in each training iteration.
539 # This matches the behavior of using build_trainer(), which
540 # has been deprecated.
541 try:
--> 542 self.workers = WorkerSet(
543 env_creator=self.env_creator,
544 validate_env=self.validate_env,
545 default_policy_class=self.get_default_policy_class(self.config),
546 config=self.config,
547 num_workers=self.config["num_workers"],
548 local_worker=True,
549 logdir=self.logdir,
550 )
551 # WorkerSet creation possibly fails, if some (remote) workers cannot
552 # be initialized properly (due to some errors in the RolloutWorker's
553 # constructor).
554 except RayActorError as e:
555 # In case of an actor (remote worker) init failure, the remote worker
556 # may still exist and will be accessible, however, e.g. calling
557 # its sample.remote() would result in strange "property not found"
558 # errors.

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:151, in WorkerSet.init(self, env_creator, validate_env, default_policy_class, config, num_workers, local_worker, logdir, _setup, policy_class, trainer_config)
149 # Create a number of @ray.remote workers.
150 self._remote_workers = []
--> 151 self.add_workers(
152 num_workers,
153 validate=config.validate_workers_after_construction,
154 )
156 # Create a local worker, if needed.
157 # If num_workers > 0 and we don't have an env on the local worker,
158 # get the observation- and action spaces for each policy from
159 # the first remote worker (which does have an env).
160 if (
161 local_worker
162 and self._remote_workers
163 and not config.create_env_on_local_worker
164 and (not config.observation_space or not config.action_space)
165 ):

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:474, in WorkerSet.add_workers(self, num_workers, validate)
457 """Creates and adds a number of remote workers to this worker set.
458
459 Can be called several times on the same WorkerSet to add more
(...)
470 properly.
471 """
472 old_num_workers = len(self._remote_workers)
473 self._remote_workers.extend(
--> 474 [
475 self._make_worker(
476 cls=self._cls,
477 env_creator=self._env_creator,
478 validate_env=None,
479 worker_index=old_num_workers + i + 1,
480 num_workers=old_num_workers + num_workers,
481 config=self._remote_config,
482 )
483 for i in range(num_workers)
484 ]
485 )
487 # Validate here, whether all remote workers have been constructed properly
488 # and are "up and running". If not, the following will throw a RayError
489 # which needs to be handled by this WorkerSet's owner (usually
490 # a RLlib Algorithm instance).
491 if validate:

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:475, in (.0)
457 """Creates and adds a number of remote workers to this worker set.
458
459 Can be called several times on the same WorkerSet to add more
(...)
470 properly.
471 """
472 old_num_workers = len(self._remote_workers)
473 self._remote_workers.extend(
474 [
--> 475 self._make_worker(
476 cls=self._cls,
477 env_creator=self._env_creator,
478 validate_env=None,
479 worker_index=old_num_workers + i + 1,
480 num_workers=old_num_workers + num_workers,
481 config=self._remote_config,
482 )
483 for i in range(num_workers)
484 ]
485 )
487 # Validate here, whether all remote workers have been constructed properly
488 # and are "up and running". If not, the following will throw a RayError
489 # which needs to be handled by this WorkerSet's owner (usually
490 # a RLlib Algorithm instance).
491 if validate:

File E:\Anaconda\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py:785, in WorkerSet._make_worker(self, cls, env_creator, validate_env, worker_index, num_workers, recreated_worker, config, spaces)
782 logger.debug("Creating TF session {}".format(config["tf_session_args"]))
783 return tf1.Session(config=tf1.ConfigProto(**config["tf_session_args"]))
--> 785 worker = cls(
786 env_creator=env_creator,
787 validate_env=validate_env,
788 default_policy_class=self._policy_class,
789 tf_session_creator=(session_creator if config["tf_session_args"] else None),
790 config=config,
791 worker_index=worker_index,
792 num_workers=num_workers,
793 recreated_worker=recreated_worker,
794 log_dir=self._logdir,
795 spaces=spaces,
796 dataset_shards=self._ds_shards,
797 )
799 return worker

File E:\Anaconda\envs\rllib\lib\site-packages\ray\actor.py:529, in ActorClass.remote(self, *args, **kwargs)
517 def remote(self, *args, **kwargs):
518 """Create an actor.
519
520 Args:
(...)
527 A handle to the newly created actor.
528 """
--> 529 return self._remote(args=args, kwargs=kwargs, **self._default_options)

File E:\Anaconda\envs\rllib\lib\site-packages\ray\util\tracing\tracing_helper.py:387, in _tracing_actor_creation.._invocation_actor_class_remote_span(self, args, kwargs, *_args, **_kwargs)
385 if not _is_tracing_enabled():
386 assert "_ray_trace_ctx" not in kwargs
--> 387 return method(self, args, kwargs, *_args, **_kwargs)
389 class_name = self.ray_metadata.class_name
390 method_name = "init"

File E:\Anaconda\envs\rllib\lib\site-packages\ray\actor.py:764, in ActorClass._remote(self, args, kwargs, **actor_options)
761 if actor_options.get("max_concurrency") is None:
762 actor_options["max_concurrency"] = 1000 if is_asyncio else 1
--> 764 if client_mode_should_convert(auto_init=True):
765 return client_mode_convert_actor(self, args, kwargs, **actor_options)
767 # fill actor required options

File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\client_mode_hook.py:124, in client_mode_should_convert(auto_init)
118 import ray
120 if (
121 os.environ.get("RAY_ENABLE_AUTO_CONNECT", "") != "0"
122 and not ray.is_initialized()
123 ):
--> 124 ray.init()
126 # is_client_mode_enabled_by_default is used for testing with
127 # RAY_CLIENT_MODE=1. This flag means all tests run with client mode.
128 return (
129 is_client_mode_enabled or is_client_mode_enabled_by_default
130 ) and _get_client_hook_status_on_thread()

File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\client_mode_hook.py:105, in client_mode_hook..wrapper(*args, **kwargs)
103 if func.name != "init" or is_client_mode_enabled_by_default:
104 return getattr(ray, func.name)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\worker.py:1428, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs)
1386 ray_params = ray._private.parameter.RayParams(
1387 node_ip_address=node_ip_address,
1388 raylet_ip_address=raylet_ip_address,
(...)
1422 node_name=_node_name,
1423 )
1424 # Start the Ray processes. We set shutdown_at_exit=False because we
1425 # shutdown the node in the ray.shutdown call that happens in the atexit
1426 # handler. We still spawn a reaper process in case the atexit handler
1427 # isn't called.
-> 1428 _global_node = ray._private.node.Node(
1429 head=True, shutdown_at_exit=False, spawn_reaper=True, ray_params=ray_params
1430 )
1431 else:
1432 # In this case, we are connecting to an existing cluster.
1433 if num_cpus is not None or num_gpus is not None:

File E:\Anaconda\envs\rllib\lib\site-packages\ray_private\node.py:319, in Node.init(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
312 ray._private.services.wait_for_node(
313 self.redis_address,
314 self.gcs_address,
315 self._plasma_store_socket_name,
316 self.redis_password,
317 )
318 except TimeoutError:
--> 319 raise Exception(
320 "The current node has not been updated within 30 "
321 "seconds, this could happen because of some of "
322 "the Ray processes failed to startup."
323 )
324 node_info = ray._private.services.get_node_to_connect_for_driver(
325 self.redis_address,
326 self.gcs_address,
327 self._raylet_ip_address,
328 redis_password=self.redis_password,
329 )
330 if self._ray_params.node_manager_port == 0:

Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.

MY Env:
windows:ray '3.0.0.dev0'
just code: ray.init()

how can I solve the problem on windows, thanks

@jzxycsjzy
Copy link

jzxycsjzy commented Nov 9, 2022

That thing might happen when there're some errors in your last time running. And the error causes your ray node to start but not be stopped. I met the same problem, and I use the code below to figure it out.

ray.shutdown()

ray.init()

Then it does work.

@mattip
Copy link
Contributor

mattip commented Nov 13, 2022

@1121091694 could you give more information about your environment (where did you get python, do you have a NVidia GPU as well as a CPU, which exact version of nightly are you using)? It seems you are using a nightly (3.0.0.dev0), does the latest official release also fail?

@hora-anyscale hora-anyscale added the P2 Important issue, but not time-critical label Dec 14, 2022
@hora-anyscale hora-anyscale removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Dec 14, 2022
@mattip
Copy link
Contributor

mattip commented Jan 18, 2023

I think we should close this. We have not gotten a complete report from a user that hits this:

  • a description of the hardware, OS and environment
  • what version of the software stack is being used, including the exact version of ray
  • a reproducer

Instead, we keep getting partial reports

@Nikita-Dudorov
Copy link

This happens to me when trying to do ray.init() on a HPC computer cluster on computecanada.

Did you manage to resolve it? Any way to run ray on computecanada ?

@jpgard
Copy link

jpgard commented May 2, 2023

I'm still experiencing this issue. It is probably due to a large number of jobs that are running/have run on a slurm cluster, but no way to debug it further. There isn't any information in the logs afaict.

Anyone able to work around this somehow?

To the developers: sorry, providing a reproducible example for this is pretty difficult. But, I am on ray 2.2 with rocky linux 8.5.

Update: in my case, sometimes just trying to .init() again solves the problem.

@vertfreeber
Copy link

vertfreeber commented May 19, 2023

(Running on Windows 10)

I think I found a solution but I dont know If I should laugh or cry right now....

So basically I had the same error messages after using ray.init() or ray start --head:
"Timed out after 60 seconds while waiting for node to startup. Did not find socket "socket name" in the list of object store socket names" and
"The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup."

I had no clue what this means and google + chatgpt had no answers that worked for me. I decided to find it myself and
I wasted alot of time debugging the ray code in my IDE and sifting through the logs, trying to understand whats happening exactly, hoping to find the error. While looking trough the logs I found this in raylet.err:

"[libprotobuf ERROR external/com_google_protobuf/src/google/protobuf/wire_format_lite.cc:581] String field 'ray.rpc.GcsNodeInfo.node_manager_hostname' contains invalid UTF-8 data when serializing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.
"

I skipped it first because I didnt unterstand it at first and thought maybe it is setting the hostname to None or something because it didnt find a node. But after several hours of trying something else and getting desperate I luckily came back to this error and thought to myself:
" Hey, what could they mean with node_manager_hostname?"

And then it hit me.

I instantly hit the windows key, opened my system settings and went to the info tab. And there it was, the root of my problems:

"Device name: der_gerät"

The stupid name I gave my PC stopped me from using ray and just made me debug for atleast 4 hours over multiple days. I dont know why, but I guess the letter ä is not in UTF-8 haha.

After changing the name of my PC to something without weird letters of the german language, ray.init() started to work, finally. I hope it helps someone else, because im sure will I tell my colleagues (or in my case my fellow cs student friends) about this stupid bug.

Cheers! 😄

@tongjingqi
Copy link

tongjingqi commented Jun 7, 2024

我的报错是:
sampleing ===== SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=0.0, top_p=1, top_k=-1, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['Question:', 'Question', 'USER:', 'USER', 'ASSISTANT:', 'ASSISTANT', 'Instruction:', 'Instruction', 'Response:', 'Response'], ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True)
Traceback (most recent call last):
File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/ray/_private/node.py", line 318, in init
ray._private.services.wait_for_node(
File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/ray/_private/services.py", line 464, in wait_for_node
raise TimeoutError(
TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2024-06-06_11-42-53_463432_8501/sockets/plasma_store in the list of object store socket names.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/mnt/data/user/zhao_jun/MetaMath/eval/eval_GSM8K_category.py", line 134, in
gsm8k_test(model=args.model, data_path=args.data_file, start=args.start, end=args.end, batch_size=args.batch_size, tensor_parallel_size=args.tensor_parallel_size)
File "/mnt/data/user/zhao_jun/MetaMath/eval/eval_GSM8K_category.py", line 92, in gsm8k_test
llm = LLM(model=model,tensor_parallel_size=tensor_parallel_size)
File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 93, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 228, in from_engine_args
distributed_init_method, placement_group = initialize_cluster(
File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/vllm/engine/ray_utils.py", line 77, in initialize_cluster
ray.init(address=ray_address, ignore_reinit_error=True)
File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/ray/_private/worker.py", line 1645, in init
_global_node = ray._private.node.Node(
File "/opt/miniconda3/envs/metamath/lib/python3.10/site-packages/ray/_private/node.py", line 323, in init
raise Exception(
Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.

排查了很长时间发现是因为磁盘满了,没法生成调用ray所必要的中间文件,后来把磁盘清理之后就解决了该问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't needs-repro-script Issue needs a runnable script to be reproduced P2 Important issue, but not time-critical QS Quantsight triage label
Projects
None yet
Development

No branches or pull requests