[k8s][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod #16935

Luis-Victor · 2021-07-07T17:06:04Z

What is the problem?

I'm trying to avoid task scheduling on rayHead by setting .values.rayResourses: { "CPU": 0 } but it keeps trying to schedule tasks at headPod, failing due to lack of resources and, this way, getting stuck without scaling any workerPods.

Python version: 3.8.5
Linux: Ubuntu 18.04
Kubernetes version: 1.18.14 (AKS)

Reproduction (REQUIRED)

Install k8s ray cluster using helm
values.yaml

# RayCluster settings:
image: rayproject/ray:nightly-py38
headPodType: rayHeadType
podTypes:
  rayHeadType:
    minWorkers: 0
    maxWorkers: 0
    CPU: 1
    memory: 512Mi
    GPU: 0
    rayResources: { "CPU": 0, "GPU": 0 }
    nodeSelector: { use: development }
  rayWorkerType:
    minWorkers: 0
    maxWorkers: 6
    memory: 512Mi
    CPU: 1
    GPU: 0
    rayResources: { "GPU": 0 }
    nodeSelector: { use: development }

# Operator settings:
operatorOnly: false
clusterOnly: false
namespacedOperator: false
operatorNamespace: default
operatorImage: rayproject/ray:nightly-py38

task.py

import ray
LOCAL_PORT = 10001

@ray.remote(num_cpus=1)
def f(i):
    import time
    print(f"Try number {i}")
    time.sleep(60)

if __name__ == "__main__":
    ray.util.connect(f"127.0.0.1:{LOCAL_PORT}")
    ray.get([f.remote(i + 1) for i in range(10)])

task log

The actor or task with ID ehd9rsa79fe5783b8bbc6e3fba23srd094a7q98c7axec7 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

The text was updated successfully, but these errors were encountered:

richardliaw · 2021-07-07T17:52:40Z

@DmitriGekhtman take a look?

DmitriGekhtman · 2021-07-08T15:29:27Z

Taking a look

DmitriGekhtman · 2021-07-08T16:26:44Z

Reproduced.
The fact that nothing is getting scheduled on the head node is consistent with the CPU 0 annotation.
The fact that no nodes are coming up at all is alarming. Investigating.

DmitriGekhtman · 2021-07-08T17:17:48Z

Confirming that this is related to the resource annotation (does not happen without the annotation)

DmitriGekhtman · 2021-07-08T18:50:07Z

Reproduced with AWSNodeProvider: 0 CPU resource override for head node prevents workers from launching.

Specifically, it looks like there's no request for CPU resources registered with the autoscaler when the tasks are submitted.

Digging into why.

This is important to fix, as setting 0 CPU for head node to render head a pure control node is a common setting for large clusters.

DmitriGekhtman · 2021-07-08T19:20:30Z

autoscaler.sdk.request_resources works as expected, but task submission does not.
cc @wuisawesome for ideas on what might be off / debugging strategy

DmitriGekhtman · 2021-07-08T20:59:04Z

No resource demands visible from Ray status when the tasks are submitted; suggests Ray Core issue

======== Autoscaler status: 2021-07-08 20:49:09.109187 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.00/4.313 GiB memory
 0.00/2.157 GiB object_store_memory

Demands:
 (no resource demands)

DmitriGekhtman · 2021-07-09T16:57:17Z

cc @mwtian you might also be interested in interaction of c++ core and autoscaler

DmitriGekhtman · 2021-07-12T17:56:53Z

Actually, I'm no longer able to reproduce on AWS. Looks like a K8s issue, but still have no idea why I'm not getting any resource demands from GCS.

wuisawesome · 2021-08-31T20:32:33Z

Easiest repro @DmitriGekhtman and I have found is

import ray

ray.init(num_cpus=0)

@ray.remote(num_cpus=1)
def foo():
    pass

ray.get(foo.remote())

ray status won't show this demand

cc @ericl

DmitriGekhtman · 2021-08-31T20:47:26Z

@wuisawesome Ping me on the PR fixing this when core team gets to it -- I'm pretty curious about what silliness is happening here...

max0x7ba · 2021-09-15T19:30:29Z

I experience this problem in gcp, no k8s. My config is:

cluster_name: ray

provider:
    type: gcp
    region: [...]
    availability_zone: [...]
    project_id: [...]
    cache_stopped_nodes: False

auth:
    ssh_user: max

available_node_types:
    worker-16:
        resources: {CPU: 16, GPU: 0}
        node_config:
            machineType: e2-standard-16
            sourceInstanceTemplate: global/instanceTemplates/worker-16
    head-4:
        resources: {CPU: 0, GPU: 0}
        node_config:
            machineType: e2-standard-4
            sourceInstanceTemplate: global/instanceTemplates/head-4

head_node_type: head-4
min_workers: 1
max_workers: 20

The cluster user reports:

2021-09-15 19:04:40.345 INFO ray.worker Using address ray://10.128.0.32:10001 set in the environment variable RAY_ADDRESS
The actor or task with ID 8bcdf0419748c8e5ad4f7818bb313cdbae807c4939698ef1 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.

While ray status on the head node reports no resource demand:

======== Autoscaler status: 2021-09-15 19:05:01.104671 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head-4
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.00/9.241 GiB memory
 0.00/4.621 GiB object_store_memory

Demands:
 (no resource demands)

What is a solution for this problem, please?

My additional expectation is that min_workers should start that many extra worker nodes in addition to the head node, but that doesn't happen.

@ericl

mwtian · 2021-09-15T22:51:16Z

This is in my TODO list. Aiming to have it fixed with Ray 1.7.

mwtian · 2021-10-04T17:28:36Z

The fix is too late to make into Ray 1.7. It is currently available in Ray nightly wheels.

Luis-Victor added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 7, 2021

richardliaw added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 7, 2021

richardliaw changed the title ~~[Autoscaler] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod~~ [k8s] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod Jul 7, 2021

richardliaw added the k8s label Jul 8, 2021

DmitriGekhtman self-assigned this Jul 8, 2021

DmitriGekhtman added P0 Issue that must be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Jul 8, 2021

DmitriGekhtman changed the title ~~[k8s] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod~~ [k8s][aws][autoscaler] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod Jul 8, 2021

DmitriGekhtman added this to the Serverless Autoscaling milestone Jul 8, 2021

DmitriGekhtman changed the title ~~[k8s][aws][autoscaler] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod~~ [k8s][aws][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod Jul 8, 2021

DmitriGekhtman assigned wuisawesome and unassigned DmitriGekhtman Jul 8, 2021

DmitriGekhtman added P1 Issue that should be fixed within a few weeks and removed P0 Issue that must be fixed in short order labels Jul 12, 2021

DmitriGekhtman changed the title ~~[k8s][aws][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod~~ [k8s][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod Jul 12, 2021

DmitriGekhtman assigned DmitriGekhtman and mwtian Jul 12, 2021

wuisawesome modified the milestones: Serverless Autoscaling, Core Backlog Aug 31, 2021

ericl unassigned wuisawesome and DmitriGekhtman Sep 7, 2021

mwtian mentioned this issue Sep 30, 2021

[Scheduling] Report resource demand for infeasible 1-CPU tasks #19000

Merged

6 tasks

ericl closed this as completed in #19000 Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod #16935

[k8s][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod #16935

Luis-Victor commented Jul 7, 2021

richardliaw commented Jul 7, 2021

DmitriGekhtman commented Jul 8, 2021

DmitriGekhtman commented Jul 8, 2021

DmitriGekhtman commented Jul 8, 2021

DmitriGekhtman commented Jul 8, 2021 •

edited

DmitriGekhtman commented Jul 8, 2021

DmitriGekhtman commented Jul 8, 2021

DmitriGekhtman commented Jul 9, 2021

DmitriGekhtman commented Jul 12, 2021

wuisawesome commented Aug 31, 2021

DmitriGekhtman commented Aug 31, 2021

max0x7ba commented Sep 15, 2021 •

edited

mwtian commented Sep 15, 2021

mwtian commented Oct 4, 2021

[k8s][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod #16935

[k8s][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod #16935

Comments

Luis-Victor commented Jul 7, 2021

What is the problem?

Reproduction (REQUIRED)

richardliaw commented Jul 7, 2021

DmitriGekhtman commented Jul 8, 2021

DmitriGekhtman commented Jul 8, 2021

DmitriGekhtman commented Jul 8, 2021

DmitriGekhtman commented Jul 8, 2021 • edited

DmitriGekhtman commented Jul 8, 2021

DmitriGekhtman commented Jul 8, 2021

DmitriGekhtman commented Jul 9, 2021

DmitriGekhtman commented Jul 12, 2021

wuisawesome commented Aug 31, 2021

DmitriGekhtman commented Aug 31, 2021

max0x7ba commented Sep 15, 2021 • edited

mwtian commented Sep 15, 2021

mwtian commented Oct 4, 2021

DmitriGekhtman commented Jul 8, 2021 •

edited

max0x7ba commented Sep 15, 2021 •

edited