Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Request resources behaves inconsistently #12443

Closed
2 tasks done
PidgeyBE opened this issue Nov 26, 2020 · 3 comments · Fixed by #12465
Closed
2 tasks done

[autoscaler] Request resources behaves inconsistently #12443

PidgeyBE opened this issue Nov 26, 2020 · 3 comments · Fixed by #12465
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@PidgeyBE
Copy link
Contributor

PidgeyBE commented Nov 26, 2020

What is the problem?

  • Ray nightly

Reproduction (REQUIRED)

  1. set up k8s autoscaling cluster
  2. Run this (I did it in ipython on ray head):
import os
import ray
from ray.autoscaler.sdk import request_resources

ray.init(address="auto")

@ray.remote(num_cpus=0.2)
class ActorA:
    def __init__(self):
        pass

a = ActorA.remote()

request_resources(bundles=[{"CPU": 0.1}, {"CPU": 0.1}])

-> Output of autoscaling monitor is

2020-11-26 10:37:16,419 INFO resource_demand_scheduler.py:193 -- Resource demands: [{'CPU': 0.2}]
...
2020-11-26 10:37:16,425 INFO autoscaler.py:612 -- StandardAutoscaler: resource_requests=[{'CPU': 0.1}, {'CPU': 0.1}]
...
2020-11-26 10:37:26,588 INFO resource_demand_scheduler.py:193 -- Resource demands: [{'CPU': 0.2}, {'CPU': 0.1}]

-> The expected output is [{'CPU': 0.2}, {'CPU': 0.1}, {'CPU': 0.1}]
In other tests I did, the requested resources where totally ignored.

  1. If I do ray.kill(a) now, the output shows:
    Resource demands: [{'CPU': 0.2}, {'CPU': 0.1}, {'CPU': 0.1}]
    So the missing request shows up, but the the request related to the actor is not cleaned up ([autoscaler] Actor resource demands are not cleared after actor is scheduled #12441)
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@PidgeyBE PidgeyBE added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 26, 2020
@AmeerHajAli
Copy link
Contributor

AmeerHajAli commented Nov 27, 2020

Hi @PidgeyBE,
when you call request_resources(bundles=[{"CPU": 0.1}, {"CPU": 0.1}]) you get these resources "immediately", but they do not add on top of what you already have.
So if you resource demands are {"CPU": 0.2} the available resources become:
[{"CPU": 0.2}, {"CPU": 0.1}] the [{"CPU": 0.1}, {"CPU": 0.1}] becomes available immediately but the remaining {"CPU": 0.1} might take more time to become available.

Checkout how request_resources() works here.
Does it make sense?

@PidgeyBE
Copy link
Contributor Author

So if I do:

ray.remote(num_cpus=0.2)(ActorA).remote()
request_resources(bundles=[{"CPU": 0.1}, {"CPU": 0.1}])

the total Resource Demands become [{'CPU': 0.2}, {'CPU': 0.1}], because one requested bundle {'CPU': 0.1} fits into the currently deployed task with{'CPU': 0.2}.
But if I do

request_resources(bundles=[{"CPU": 0.2}, {"CPU": 0.2}])
ray.remote(num_cpus=0.1)(ActorA).remote()

the total Resource Demand becomes [{'CPU': 0.1}, {'CPU': 0.2}, {'CPU': 0.2}], because non of the requested bundles fits into one of the already deployed tasks?

So the Resource Demands are basically the requested resources, minus the bundles that are smaller or equal to running tasks?

@ericl
Copy link
Contributor

ericl commented Nov 28, 2020

@PidgeyBE that's right. Note that you're seeing here an artifact of the implementation of request_resources() (the internal bin packing algorithm). If only {"CPU": 1} were used instead of different shapes, this artifact were disappear. We could in the future improve the algorithm to fix this edge case.

The intended use of request_resources() is as a hint to scale the cluster to accommodate the requests, ignoring any existing utilization of the cluster; the result cluster size might be slightly larger than necessary due to sub-optimal packing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants