Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Request_resources and actual actors are counted double #12498

Closed
2 tasks done
PidgeyBE opened this issue Nov 30, 2020 · 0 comments · Fixed by #12661
Closed
2 tasks done

[autoscaler] Request_resources and actual actors are counted double #12498

PidgeyBE opened this issue Nov 30, 2020 · 0 comments · Fixed by #12661
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks
Milestone

Comments

@PidgeyBE
Copy link
Contributor

PidgeyBE commented Nov 30, 2020

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):

  • Ray nightly
  • k8s autoscaling

Reproduction (REQUIRED)

I have an autoscaling cluster with:

available_node_types:
    def_head:
        node_config: {}
        resources: {"CPU": 2}
        max_workers: 1
    def_worker:
        node_config: {}
        resources: {"CPU": 2, "GPU": 1,  "WORKER": 1}
        max_workers: 3

Then I run:

import os
import ray
from ray.autoscaler.sdk import request_resources

ray.init(address="auto")

@ray.remote(num_cpus=0.2, resources={"WORKER": 1.0})
class ActorA:
    def __init__(self):
        pass

# 1. Request resource bundle
request_resources(bundles=[{"CPU": 0.2, "WORKER": 1.0}])
  1. Wait untill worker is online and then start actor
a = ActorA.remote()

-> I see a second worker is scaled up, this is not needed, as the actor consumes exactly the same resources as requested before.

  1. Request same resources again
request_resources(bundles=[{"CPU": 0.2, "WORKER": 1.0}])

-> I see a third worker is scaled up. This should not happen. Edit: I could not reproduce this step, so possibly the third worker is a race condition
After some time 1 or 2 workers are scaled down and immediately an extra worker is scaled up again. This bouncing behavior keeps going.

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@PidgeyBE PidgeyBE added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 30, 2020
@ericl ericl added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 30, 2020
@ericl ericl added this to the Serverless Autoscaling milestone Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants