Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Autoscaler doesn't scale to met demand #846

Closed
1 of 2 tasks
tekumara opened this issue Dec 22, 2022 · 3 comments · Fixed by ray-project/ray#42962
Closed
1 of 2 tasks

[Bug] Autoscaler doesn't scale to met demand #846

tekumara opened this issue Dec 22, 2022 · 3 comments · Fixed by ray-project/ray#42962
Assignees
Labels
docs Improvements or additions to documentation P1 Issue that should be fixed within a few weeks

Comments

@tekumara
Copy link
Contributor

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

======== Autoscaler status: 2022-12-22 00:22:04.898664 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head-group
 1 workergroup
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0.00/3.725 GiB memory
 0.00/1.046 GiB object_store_memory

Demands:
 {'CPU': 2.0}: 10000+ pending tasks/actors
 2022-12-22 00:26:46,855 WARNING resource_demand_scheduler.py:783 -- The autoscaler could not find a node type to satisfy the request: [{'CPU': 2.0}...

Reproduction script

helm v0.4.0 raycluster with values.yaml:

image:
  tag: 2.2.0-py39-cpu

head:
  enableInTreeAutoscaling: true
  resources:
    limits:
      cpu: "1"
      memory: "2G"
    requests:
      cpu: "1"
      memory: "2G"

worker:
  minReplicas: 1
  maxReplicas: 10
  resources:
    limits:
      cpu: "1"
      memory: "2G"
    requests:
      cpu: "1"
      memory: "2G"

Then run this client script:

from collections import Counter
import socket
import time

import ray

ray.init("ray://127.0.0.1:10001")

print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
    {} memory resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU'], ray.cluster_resources()['memory']))

@ray.remote(num_cpus=2)
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname('localhost')

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
    print('    {} tasks on {}'.format(num_tasks, ip_address))

The script will just hang and the autoscaler will show logs as per above.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@tekumara tekumara added the bug Something isn't working label Dec 22, 2022
@DmitriGekhtman DmitriGekhtman added docs Improvements or additions to documentation P1 Issue that should be fixed within a few weeks and removed bug Something isn't working labels Dec 23, 2022
@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Dec 23, 2022

Each task you are trying to schedule require 2 CPUs, but the Ray pods that you would like to fit these tasks into have only 1 CPU. Thus, the autoscaler cannot schedule a Ray pod that will be able to fit a task.

The behavior is expected, but I think it is poorly documented.
More generally, the Ray-on-K8s architecture of
Ray tasks in Ray pods in K8s nodes
is not explained clearly enough.

I've relabelled this as a documentation issue.

@DmitriGekhtman
Copy link
Collaborator

Related:
#828

@kevin85421
Copy link
Member

PR: ray-project/ray#42962

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Improvements or additions to documentation P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants