[Bug] Autoscaler doesn't scale to met demand #846

tekumara · 2022-12-22T08:28:22Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

======== Autoscaler status: 2022-12-22 00:22:04.898664 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head-group
 1 workergroup
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/3.725 GiB memory
 0.00/1.046 GiB object_store_memory

Demands:
 {'CPU': 2.0}: 10000+ pending tasks/actors
 2022-12-22 00:26:46,855 WARNING resource_demand_scheduler.py:783 -- The autoscaler could not find a node type to satisfy the request: [{'CPU': 2.0}...

Reproduction script

helm v0.4.0 raycluster with values.yaml:

image:
  tag: 2.2.0-py39-cpu

head:
  enableInTreeAutoscaling: true
  resources:
    limits:
      cpu: "1"
      memory: "2G"
    requests:
      cpu: "1"
      memory: "2G"

worker:
  minReplicas: 1
  maxReplicas: 10
  resources:
    limits:
      cpu: "1"
      memory: "2G"
    requests:
      cpu: "1"
      memory: "2G"

Then run this client script:

from collections import Counter
import socket
import time

import ray

ray.init("ray://127.0.0.1:10001")

print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
    {} memory resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU'], ray.cluster_resources()['memory']))

@ray.remote(num_cpus=2)
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname('localhost')

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
    print('    {} tasks on {}'.format(num_tasks, ip_address))

The script will just hang and the autoscaler will show logs as per above.

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

DmitriGekhtman · 2022-12-23T00:30:48Z

Each task you are trying to schedule require 2 CPUs, but the Ray pods that you would like to fit these tasks into have only 1 CPU. Thus, the autoscaler cannot schedule a Ray pod that will be able to fit a task.

The behavior is expected, but I think it is poorly documented.
More generally, the Ray-on-K8s architecture of
Ray tasks in Ray pods in K8s nodes
is not explained clearly enough.

I've relabelled this as a documentation issue.

DmitriGekhtman · 2022-12-23T00:33:05Z

Related:
#828

kevin85421 · 2024-02-03T22:51:10Z

PR: ray-project/ray#42962

tekumara added the bug Something isn't working label Dec 22, 2022

DmitriGekhtman added docs Improvements or additions to documentation P1 Issue that should be fixed within a few weeks and removed bug Something isn't working labels Dec 23, 2022

DmitriGekhtman assigned kevin85421, sihanwang41 and architkulkarni Dec 23, 2022

kevin85421 added this to the v0.5.0 release milestone Jan 4, 2023

kevin85421 mentioned this issue Apr 6, 2023

[Feature] FAQ page for KubeRay #933

Closed

2 tasks

kevin85421 unassigned architkulkarni and sihanwang41 Jan 30, 2024

kevin85421 mentioned this issue Feb 3, 2024

[Doc][KubeRay] Add a section about Autoscaler to the troubleshooting guide ray-project/ray#42962

Merged

8 tasks

jjyao closed this as completed in ray-project/ray#42962 Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Autoscaler doesn't scale to met demand #846

[Bug] Autoscaler doesn't scale to met demand #846

tekumara commented Dec 22, 2022

DmitriGekhtman commented Dec 23, 2022 •

edited

DmitriGekhtman commented Dec 23, 2022

kevin85421 commented Feb 3, 2024

[Bug] Autoscaler doesn't scale to met demand #846

[Bug] Autoscaler doesn't scale to met demand #846

Comments

tekumara commented Dec 22, 2022

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

DmitriGekhtman commented Dec 23, 2022 • edited

DmitriGekhtman commented Dec 23, 2022

kevin85421 commented Feb 3, 2024

DmitriGekhtman commented Dec 23, 2022 •

edited