Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Cluster] Actors do not trigger the scale up process because the default num_cpus seems not work #27535

Closed
orcahmlee opened this issue Aug 5, 2022 · 3 comments
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@orcahmlee
Copy link
Contributor

What happened + What you expected to happen

I set up a Ray cluster on K8s which would auto-scaling the ray workers from 0 to 6. The values.yaml as follows:

image: rayproject/ray:1.12.0-py38
upscalingSpeed: 1.0
idleTimeoutMinutes: 5
headPodType: rayHeadType
podTypes:
  rayHeadType:
    CPU: 3
    memory: 28Gi
    rayResources: { "CPU": 0 }
  rayWorkerType:
    minWorkers: 0
    maxWorkers: 6
    memory: 28Gi
    CPU: 3
    GPU: 0

@ray.remote
def go():
    sleep(1)
    return "OK"


objs = [go.remote() for _ in range(50)]

I submitted 50 tasks when there has 0 ray worker. The ray status showed

Demands:
{'CPU': 1.0}: 50+ pending tasks/actors

Then the scaling up process would be executed immediately. This is the expected behavior.


@ray.remote
class ThreadedActor:
    def task(self):
        return f"native id: {threading.get_native_id()}"

actors = [ThreadedActor.options(max_concurrency=5).remote() for _ in range(50)]

However, I submitted 50 actors when there has 0 ray worker. The scaling up process would NOT be executed, and the ray status showed:

Demands:
{}: 50+ pending tasks/actors

I noticed the ray staus message did not list the CPU numbers, so I did two trials.

@ray.remote(num_cpus=1)  # <--- modified
class ThreadedActor:
    def task(self):
        return f"native id: {threading.get_native_id()}"

actors = [
    ThreadedActor.options(max_concurrency=5).remote() for _ in range(50)
]
@ray.remote
class ThreadedActor:
    def task(self):
        return f"native id: {threading.get_native_id()}"

actors = [
    ThreadedActor.options(num_cpus=1, max_concurrency=5).remote() for _ in range(50)  # <--- modified
]

The scaling up process would be executed immediately, when I added num_cpus=1 either to the decorator or the options(). The ray status showed

Demands:
{'CPU': 1.0}: 50+ pending tasks/actors

It's weird the default parameter seems not to affect Actors.

Versions / Dependencies

  • Python 3.8.13
  • ray 1.12.0
  • GKE 1.21.12-gke.1700

Reproduction script

from time import sleep
import threading
import ray


@ray.remote
def go():
    sleep(1)
    return "OK"


@ray.remote
class ThreadedActor:
    def task(self):
        return f"native id: {threading.get_native_id()}"


if __name__ == "__main__":
    ray.init(address="ray://HEAD-IP:10001")

    objs = [go.remote() for _ in range(50)]  # scaling up works

    actors = [
        ThreadedActor.options(max_concurrency=5).remote() for _ in range(50)  # scaling up does not work
    ]

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@orcahmlee orcahmlee added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 5, 2022
@DmitriGekhtman
Copy link
Contributor

This is a subtle issue that has been resolved in master in
#26813

@wuisawesome can elaborate here on why actor scheduling might not trigger scale-up with a 0-CPU head node

@wuisawesome
Copy link
Contributor

@orcahmlee happy to elaborate if you're interested, but for now i will close this as a duplicate of #26806. Please reopen if you can repro on a nightly after that commit.

@orcahmlee
Copy link
Contributor Author

Thanks to both of you, I'll check the nightly version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants