Skip to content

terminated idle nodes generate a misleading warning #14129

@AmeerHajAli

Description

@AmeerHajAli

start an AWS cluster with 0 workers (m5.large, default yaml, set idle timeout to 1).
run this code:

import ray
ray.init(address="auto")
@ray.remote(num_cpus=2)
... def f():
...     time.sleep(60)
ray.get([f.remote for _ in range(2)])

This would result spinning up 1 worker. When this worker becomes idle, autoscaler terminates it, but generates the following warning:

>>> 2021-02-16 04:52:40,168     WARNING worker.py:1034 -- The node with node id 610641aa77c977618fefc691d433d4606876e904 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.

- [ ] I have verified my script runs in a clean environment and reproduces the issue.
- [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tusability

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions