[autoscaler] Request_resources and actual actors are counted double #12498

PidgeyBE · 2020-11-30T11:38:34Z

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS):

Ray nightly
k8s autoscaling

Reproduction (REQUIRED)

I have an autoscaling cluster with:

available_node_types:
    def_head:
        node_config: {}
        resources: {"CPU": 2}
        max_workers: 1
    def_worker:
        node_config: {}
        resources: {"CPU": 2, "GPU": 1,  "WORKER": 1}
        max_workers: 3

Then I run:

import os
import ray
from ray.autoscaler.sdk import request_resources

ray.init(address="auto")

@ray.remote(num_cpus=0.2, resources={"WORKER": 1.0})
class ActorA:
    def __init__(self):
        pass

# 1. Request resource bundle
request_resources(bundles=[{"CPU": 0.2, "WORKER": 1.0}])

Wait untill worker is online and then start actor

a = ActorA.remote()

-> I see a second worker is scaled up, this is not needed, as the actor consumes exactly the same resources as requested before.

Request same resources again

request_resources(bundles=[{"CPU": 0.2, "WORKER": 1.0}])

-> I see a third worker is scaled up. This should not happen. Edit: I could not reproduce this step, so possibly the third worker is a race condition
After some time 1 or 2 workers are scaled down and immediately an extra worker is scaled up again. This bouncing behavior keeps going.

If we cannot run your script, we cannot fix your issue.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

The text was updated successfully, but these errors were encountered:

PidgeyBE added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 30, 2020

ericl assigned AmeerHajAli Nov 30, 2020

ericl added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 30, 2020

ericl added this to the Serverless Autoscaling milestone Nov 30, 2020

This was referenced Dec 4, 2020

[autoscaler] rewrite request_resources() #12614

Closed

[hotfix][autoscaler] Request resources refactor2 #12661

Merged

richardliaw added the autoscaler label Dec 9, 2020

ericl closed this as completed in #12661 Dec 9, 2020

amogkam mentioned this issue Jan 25, 2021

new_dashboard metrics agent crashed in Windows CI #13199

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Request_resources and actual actors are counted double #12498

[autoscaler] Request_resources and actual actors are counted double #12498

PidgeyBE commented Nov 30, 2020 •

edited

Loading

[autoscaler] Request_resources and actual actors are counted double #12498

[autoscaler] Request_resources and actual actors are counted double #12498

Comments

PidgeyBE commented Nov 30, 2020 • edited Loading

What is the problem?

Reproduction (REQUIRED)

PidgeyBE commented Nov 30, 2020 •

edited

Loading