[Ray Autoscaler] Autoscaler kills working nodes unexpectedly #46492

yx367563 · 2024-07-09T01:06:56Z

What happened + What you expected to happen

I have found that Ray autoscaler sometimes mistakenly kills some nodes that are working. My scenario is that 400 Ray Tasks are submitted at the same time, but not all of them can be allocated resources at the beginning, and I find that almost every time I start scaler down at the final stage, I get an error report ray.exceptions.NodeDiedError: Task failed due to the node dying.
The reason I have troubleshooted so far is that the kill of a working node causes an exception, and then the whole job fails. I am currently using Ray 2.23.0, Kuberay 1.0.0, without autoscaler v2 enabled.
Since the development of autoscaler v2 has been temporarily suspended, I'd like to ask if there is any solution, or if I can find out what is causing this and what the code logic is?

Versions / Dependencies

Ray 2.23.0
Kuberay 1.0.0

Reproduction script

You can reproduce my problem with the following code, be careful not to make all tasks executable at the same time, there needs to be a task in pending for the bug to be triggered. looking forward to your replies!

import ray
import time
import os
import random

@ray.remote(max_retries=5, num_cpus=8)
def inside_ray_task():
    sleep_time = random.randint(120, 300) # The node idleTimeoutSeconds is 60.

    start_time = time.perf_counter()
    while True:
        if(time.perf_counter() - start_time < sleep_time):
            time.sleep(0.001)
        else:
            break

@ray.remote(max_retries=0)
def outside_ray_task():
    future_list = []
    for i in range(100):
        future_list.append(inside_ray_task.remote())
    ray.get(future_list)

if __name__ == '__main__':
    ray.init()
    ray.get(outside_ray_task.remote())

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

yx367563 · 2024-07-09T01:11:29Z

I have tried autoscaler v2 and the bug did go away, but it introduced other new problems, based on the fact that development of autoscaler v2 is currently on hold, so I'm wondering if there's a better solution?

anyscalesam · 2024-07-10T20:51:02Z

Thanks for reproing on ASv2 ... we'll take a look at this @yx367563

yx367563 · 2024-07-11T01:32:23Z

@anyscalesam Or I would also be happy to fix this issue in autoscaler v1. Where can I get the documentation for autoscaler v1? I feel it is difficult to find the problem just by reading the source code. Or you can share with me the possible reasons you can think of.

rynewang · 2024-07-15T21:17:22Z

@yx367563 can you provide a more detailed repro setup script including the scaling down part? We plan to repro it and try to fix.

Expected behavior: when a task is assigned to a node, in state "waiting for resources", and the node dead. The task should be reassigned to other nodes, and here it looks like they are marked dead.

rynewang · 2024-07-15T21:18:21Z

@kevin85421 can you try to repro this with @yx367563 ?

yx367563 · 2024-07-16T01:07:26Z

Of course, I set up a single worker with minWorkerNum of 2, maxWorkerNum of 1000, each worker requires 8 cpu, and idleTimeoutSeconds of 60. The cluster can only provide 24 workers at most, so if the above code is executed, at most 24 tasks can be executed at the same time, and the rest of the tasks will be in pending state. This bug is triggered almost every time the last tasks are scheduled and scale-down occurs. cc @rynewang @kevin85421

anyscalesam · 2024-07-19T03:09:07Z

@rynewang is P1 right for this; (OSS AS killing nodes unexpectedly feels like an important issue).

yx367563 · 2024-07-23T01:58:13Z

Additional discovery: When using ray job submit, if --entrypoint-num-cpus=1 is specified, this bug does not seem to be triggered
Is there any way to achieve the same effect as specifying --entrypoint-num-cpus=1 when connecting to a ray cluster interactively? @anyscalesam @kevin85421

anyscalesam · 2024-07-23T05:17:07Z

What other issues come up when ASv2 is turne don @yx367563 ?

yx367563 · 2024-07-23T07:32:03Z

@anyscalesam You can refer to ray-project/kuberay#2223.
Additional discovery: This bug is only triggered in the nested scenario. Will this help locate the bug? I think this bug has a large impact. Perhaps many users have encountered it before but cannot locate the specific cause or reproducible code. cc @kevin85421

yx367563 · 2024-07-23T09:50:55Z

Oh! I seem to have found the root cause! In the code snippet above, the outside_ray_task takes up 1 cpu, the inside_ray_task takes up 8 cpus, and there are only 8 cpus on a single worker node, but in reality, the outside_ray_task and a certain inside_ray_task are assigned to a same worker node, resulting in this node being marked as idle after the inside_ray_task on this node completes.

I've found that in the case of nested, no matter how much cpu the outer task declares it needs, it will run on the same node as a certain inside task, which should not be expected. cc @anyscalesam @rynewang @kevin85421

yx367563 · 2024-07-23T09:58:39Z

In fact, none of the cpu resources of the outer task are counted, and the eight cpu occupied here are entirely occupied by the inner task

yx367563 · 2024-07-23T10:39:25Z

If I set the memory requirement of the outer task to 1000 it runs fine, there should be a problem with the cpu setting of the outer task

yx367563 · 2024-07-24T03:38:21Z

It seems that when ray.get is called, the CPU resources requested on the current node will be temporarily released, so that other tasks can be scheduled. This design is reasonable, but it will cause the autoscaler to kill such nodes by mistake. Is there any solution?

DmitriGekhtman · 2024-09-06T14:21:44Z

Nice going finding the root cause!

This is one of Ray's classic sharp edges -- logical CPU resources of the parent task are released while the child is running, but other resources are not released.

It sort of makes sense, in that physical CPU is not consumed while the parent is waiting, whereas some other resources (like memory) are still blocked.

But then you get some funny edge cases, like this one.

Looking forward to the fix!

mimiliaogo · 2024-10-03T21:26:02Z

@jjyao @anyscalesam

We have recently identified the root cause of the issue and developed a fully reproducible script. We would like to contribute to this issue. However, there may be multiple potential solutions, and it is a high impact issue, so I would like to know if you can provide any guidance.

The root cause is related to a nested setting: when a child task, located on the same node as its parent task, completes, and no new tasks are scheduled on that node, the node is removed after the idle timeout. As a result, unfinished child tasks fail.
It's reasonable that when calling ray.get(), parent worker releases the CPU resources to prevent deadlock, however, labeling it as idle leads to being killed unexpectedly.
Below is a diagram illustrating the issue:

anyscalesam · 2024-10-03T21:29:30Z

Way to go to go deep here @mimiliaogo > so so your expected that the outside tasks persists until all of it's inked inside tasks (on other worker nodes) complete?

can you link the repro script on this issue here?

mimiliaogo · 2024-10-03T21:37:20Z

This is the repro script: https://pastecode.io/s/wozizu6u
In this setting, you'll get ray.exceptions.NodeDiedError since the parent node gets removed after idle timeout while the longer child task is still executing on the other worker node.

It's reasonable that the parent task shouldn't occupy CPU resources, but its state (memory,... ) should be preserved to other active nodes before removing that node during the downscaling.

kevin85421 · 2024-10-24T05:31:37Z

I will schedule a meeting with @mimiliaogo to discuss this issue.

yx367563 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 9, 2024

yx367563 mentioned this issue Jul 9, 2024

[Bug] Can't scaler up when using autoscaler v2 ray-project/kuberay#2223

Open

2 tasks

anyscalesam added the core Issues that should be addressed in Ray Core label Jul 10, 2024

rynewang added @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 15, 2024

rynewang assigned kevin85421 Jul 15, 2024

anyscalesam removed the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Jul 19, 2024

jjyao assigned jjyao and unassigned kevin85421 Jul 29, 2024

jjyao added p0.5 and removed P1 Issue that should be fixed within a few weeks labels Jul 29, 2024

anyscalesam added the core-autoscaler autoscaler related issues label Sep 9, 2024

jjyao added P1 Issue that should be fixed within a few weeks and removed P0.5 labels Oct 30, 2024

mimiliaogo linked a pull request Nov 3, 2024 that will close this issue

[core][autoscaler] Fix incorrectly terminating nodes misclassified as idle in autoscaler v1 #48519

Draft

8 tasks

kevin85421 assigned mimiliaogo and unassigned jjyao Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Autoscaler] Autoscaler kills working nodes unexpectedly #46492

[Ray Autoscaler] Autoscaler kills working nodes unexpectedly #46492

yx367563 commented Jul 9, 2024

yx367563 commented Jul 9, 2024

anyscalesam commented Jul 10, 2024

yx367563 commented Jul 11, 2024

rynewang commented Jul 15, 2024

rynewang commented Jul 15, 2024

yx367563 commented Jul 16, 2024

anyscalesam commented Jul 19, 2024

yx367563 commented Jul 23, 2024 •

edited

Loading

anyscalesam commented Jul 23, 2024

yx367563 commented Jul 23, 2024

yx367563 commented Jul 23, 2024

yx367563 commented Jul 23, 2024 •

edited

Loading

yx367563 commented Jul 23, 2024

yx367563 commented Jul 24, 2024

DmitriGekhtman commented Sep 6, 2024 •

edited

Loading

mimiliaogo commented Oct 3, 2024

anyscalesam commented Oct 3, 2024

mimiliaogo commented Oct 3, 2024 •

edited

Loading

kevin85421 commented Oct 24, 2024

[Ray Autoscaler] Autoscaler kills working nodes unexpectedly #46492

[Ray Autoscaler] Autoscaler kills working nodes unexpectedly #46492

Comments

yx367563 commented Jul 9, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

yx367563 commented Jul 9, 2024

anyscalesam commented Jul 10, 2024

yx367563 commented Jul 11, 2024

rynewang commented Jul 15, 2024

rynewang commented Jul 15, 2024

yx367563 commented Jul 16, 2024

anyscalesam commented Jul 19, 2024

yx367563 commented Jul 23, 2024 • edited Loading

anyscalesam commented Jul 23, 2024

yx367563 commented Jul 23, 2024

yx367563 commented Jul 23, 2024

yx367563 commented Jul 23, 2024 • edited Loading

yx367563 commented Jul 23, 2024

yx367563 commented Jul 24, 2024

DmitriGekhtman commented Sep 6, 2024 • edited Loading

mimiliaogo commented Oct 3, 2024

anyscalesam commented Oct 3, 2024

mimiliaogo commented Oct 3, 2024 • edited Loading

kevin85421 commented Oct 24, 2024

yx367563 commented Jul 23, 2024 •

edited

Loading

yx367563 commented Jul 23, 2024 •

edited

Loading

DmitriGekhtman commented Sep 6, 2024 •

edited

Loading

mimiliaogo commented Oct 3, 2024 •

edited

Loading