Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod #16935

Closed
2 tasks done
Luis-Victor opened this issue Jul 7, 2021 · 14 comments · Fixed by #19000
Closed
2 tasks done
Assignees
Labels
bug Something that is supposed to be working; but isn't k8s P1 Issue that should be fixed within a few weeks
Milestone

Comments

@Luis-Victor
Copy link

What is the problem?

I'm trying to avoid task scheduling on rayHead by setting .values.rayResourses: { "CPU": 0 } but it keeps trying to schedule tasks at headPod, failing due to lack of resources and, this way, getting stuck without scaling any workerPods.

  • Python version: 3.8.5
  • Linux: Ubuntu 18.04
  • Kubernetes version: 1.18.14 (AKS)

Reproduction (REQUIRED)

  • Install k8s ray cluster using helm
  • values.yaml
# RayCluster settings:
image: rayproject/ray:nightly-py38
headPodType: rayHeadType
podTypes:
  rayHeadType:
    minWorkers: 0
    maxWorkers: 0
    CPU: 1
    memory: 512Mi
    GPU: 0
    rayResources: { "CPU": 0, "GPU": 0 }
    nodeSelector: { use: development }
  rayWorkerType:
    minWorkers: 0
    maxWorkers: 6
    memory: 512Mi
    CPU: 1
    GPU: 0
    rayResources: { "GPU": 0 }
    nodeSelector: { use: development }

# Operator settings:
operatorOnly: false
clusterOnly: false
namespacedOperator: false
operatorNamespace: default
operatorImage: rayproject/ray:nightly-py38
  • task.py
import ray
LOCAL_PORT = 10001

@ray.remote(num_cpus=1)
def f(i):
    import time
    print(f"Try number {i}")
    time.sleep(60)

if __name__ == "__main__":
    ray.util.connect(f"127.0.0.1:{LOCAL_PORT}")
    ray.get([f.remote(i + 1) for i in range(10)])
  • task log
The actor or task with ID ehd9rsa79fe5783b8bbc6e3fba23srd094a7q98c7axec7 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@Luis-Victor Luis-Victor added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 7, 2021
@richardliaw
Copy link
Contributor

@DmitriGekhtman take a look?

@richardliaw richardliaw added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 7, 2021
@richardliaw richardliaw changed the title [Autoscaler] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod [k8s] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod Jul 7, 2021
@richardliaw richardliaw added the k8s label Jul 8, 2021
@DmitriGekhtman DmitriGekhtman self-assigned this Jul 8, 2021
@DmitriGekhtman
Copy link
Contributor

Taking a look

@DmitriGekhtman
Copy link
Contributor

Reproduced.
The fact that nothing is getting scheduled on the head node is consistent with the CPU 0 annotation.
The fact that no nodes are coming up at all is alarming. Investigating.

@DmitriGekhtman DmitriGekhtman added P0 Issue that must be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Jul 8, 2021
@DmitriGekhtman
Copy link
Contributor

Confirming that this is related to the resource annotation (does not happen without the annotation)

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Jul 8, 2021

Reproduced with AWSNodeProvider: 0 CPU resource override for head node prevents workers from launching.

Specifically, it looks like there's no request for CPU resources registered with the autoscaler when the tasks are submitted.

Digging into why.

This is important to fix, as setting 0 CPU for head node to render head a pure control node is a common setting for large clusters.

@DmitriGekhtman DmitriGekhtman changed the title [k8s] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod [k8s][aws][autoscaler] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod Jul 8, 2021
@DmitriGekhtman DmitriGekhtman added this to the Serverless Autoscaling milestone Jul 8, 2021
@DmitriGekhtman
Copy link
Contributor

autoscaler.sdk.request_resources works as expected, but task submission does not.
cc @wuisawesome for ideas on what might be off / debugging strategy

@DmitriGekhtman
Copy link
Contributor

No resource demands visible from Ray status when the tasks are submitted; suggests Ray Core issue

======== Autoscaler status: 2021-07-08 20:49:09.109187 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.00/4.313 GiB memory
 0.00/2.157 GiB object_store_memory

Demands:
 (no resource demands)

@DmitriGekhtman DmitriGekhtman changed the title [k8s][aws][autoscaler] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod [k8s][aws][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod Jul 8, 2021
@DmitriGekhtman
Copy link
Contributor

cc @mwtian you might also be interested in interaction of c++ core and autoscaler

@DmitriGekhtman DmitriGekhtman added P1 Issue that should be fixed within a few weeks and removed P0 Issue that must be fixed in short order labels Jul 12, 2021
@DmitriGekhtman
Copy link
Contributor

Actually, I'm no longer able to reproduce on AWS. Looks like a K8s issue, but still have no idea why I'm not getting any resource demands from GCS.

@DmitriGekhtman DmitriGekhtman changed the title [k8s][aws][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod [k8s][autoscaler][core] { "CPU": 0 } is not working for avoid task scheduling in rayHeadType Pod Jul 12, 2021
@wuisawesome
Copy link
Contributor

Easiest repro @DmitriGekhtman and I have found is

import ray

ray.init(num_cpus=0)

@ray.remote(num_cpus=1)
def foo():
    pass

ray.get(foo.remote())

ray status won't show this demand

cc @ericl

@DmitriGekhtman
Copy link
Contributor

@wuisawesome Ping me on the PR fixing this when core team gets to it -- I'm pretty curious about what silliness is happening here...

@max0x7ba
Copy link
Contributor

max0x7ba commented Sep 15, 2021

I experience this problem in gcp, no k8s. My config is:

cluster_name: ray

provider:
    type: gcp
    region: [...]
    availability_zone: [...]
    project_id: [...]
    cache_stopped_nodes: False

auth:
    ssh_user: max

available_node_types:
    worker-16:
        resources: {CPU: 16, GPU: 0}
        node_config:
            machineType: e2-standard-16
            sourceInstanceTemplate: global/instanceTemplates/worker-16
    head-4:
        resources: {CPU: 0, GPU: 0}
        node_config:
            machineType: e2-standard-4
            sourceInstanceTemplate: global/instanceTemplates/head-4

head_node_type: head-4
min_workers: 1
max_workers: 20

The cluster user reports:

2021-09-15 19:04:40.345 INFO ray.worker Using address ray://10.128.0.32:10001 set in the environment variable RAY_ADDRESS
The actor or task with ID 8bcdf0419748c8e5ad4f7818bb313cdbae807c4939698ef1 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.

While ray status on the head node reports no resource demand:

======== Autoscaler status: 2021-09-15 19:05:01.104671 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head-4
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.00/9.241 GiB memory
 0.00/4.621 GiB object_store_memory

Demands:
 (no resource demands)

What is a solution for this problem, please?

My additional expectation is that min_workers should start that many extra worker nodes in addition to the head node, but that doesn't happen.

@ericl

@mwtian
Copy link
Member

mwtian commented Sep 15, 2021

This is in my TODO list. Aiming to have it fixed with Ray 1.7.

@mwtian
Copy link
Member

mwtian commented Oct 4, 2021

The fix is too late to make into Ray 1.7. It is currently available in Ray nightly wheels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't k8s P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants