[autoscaler] Too many workers are scaled in kubernetes #11551

PidgeyBE · 2020-10-22T09:37:54Z

What is the problem?

ray = 1.0.0
autoscaling on k8s

When autoscaling in kubernetes, it happens sometimes (almost 50% in my case) that instead of 1 worker, 2 workers are scaled up.

Reproduction (REQUIRED)

Have a ray autoscale cluster on kubernetes
Deploy an actor that requires new worker resources
Ray autoscale will start an extra worker
When the first worker is (almost) ready, for some reason, the autoscaler triggers another worker to start
I see this happens because for some reason NumNodesConnected becomes equal to NumNodesUsed.

 - NumNodesConnected: 3
 - NumNodesUsed: 3

A few seconds later (while the extra node is already being started and the actor is deployed) I see:

 - NumNodesConnected: 3
 - NumNodesUsed: 1.2

I've cut the important part out of the monitor.err logs monitor.err.txt

It seems like there is some race condition that causes the autoscaler to trigger a worker to start, while the actor is deploying...

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

BR, Pieterjan
cc @edoakes

The text was updated successfully, but these errors were encountered:

rkooo567 · 2020-10-22T16:49:53Z

Cc @wuisawesome probably should handle asap

wuisawesome · 2020-10-22T17:02:47Z

Cc @AmeerHajAli @ericl

I'll look into this, but feel free to steal it if you have the bandwidth

edoakes · 2020-10-22T18:09:44Z

@wuisawesome I vaguely remember seeing this in the past. IIRC the issue is that the autoscaler has no notion of "pending" nodes, so while the node is pending in k8s (which can take awhile), it continues to add more nodes because resource utilization is still high.

ericl · 2020-10-22T18:22:20Z

What's weird is that the autoscaler is reporting it has 1/1 target nodes, but 2 nodes connected. There seems to be a connected node that is not managed by the autoscaler, which leads to an off-by-one error.

@wuisawesome is this related to that issue you had with an unmanaged node being counted as a node?

@edoakes I do see 2020-10-20 09:30:33,392 INFO autoscaler.py:542 -- Cluster status: 1/2 target nodes (1 pending) (1 updating) so it seems we do have pending nodes. What's odd is the number of "used" nodes is 3 when the load is low, which seems wrong.

edoakes · 2020-10-22T18:25:56Z

huh, ok, that's actually sort of promising because it indicates that there's just some off-by-one or similar bug somewhere instead of a more intrinsic problem. @ericl does this mean you were able to easily reproduce the issue?

PidgeyBE · 2020-10-22T18:37:04Z

Indeed, I think there is notion of pending nodes in k8s autoscaler.
e.g. If I do

setup_commands:
    - sleep 600

It does not influence the problem (doesnt make it better or worse) ...

I'm not sure about the off-by-one thing...
It starts with 0/0, at this time the unmanaged node is already present and not counted.
Then, when deploying the Actor, it goes to 0/1.
After some time, the status becomes 1/1 but the added GPU is not yet visible in the Resource list.
At the point the GPU becomes visible in the ResourceUsage list, the target nodes are 1/2 and an extra node is spawned.
Then it goes to 2/2 (while actor is not yet deployed (0.0/1.0 GPU)) and quickly to 2/1 as soon as the Actor is actually deployed (1.0/1.0 GPU)

ericl · 2020-10-22T19:10:26Z

@edoakes I was just reading the logs above.

It's possible this PR fixes the issue: #11458 (it should be available in nightly wheels by now)

edoakes · 2020-10-26T15:59:17Z

@PidgeyBE have you had a chance to verify that the above PR fixes your issue?

edoakes · 2020-10-26T15:59:18Z

@PidgeyBE have you had a chance to verify that the above PR fixes your issue?

PidgeyBE · 2020-10-27T08:35:47Z

@edoakes Not yet. I'll try to do it today! Our build chain is not so flexible for changing ray versions but if the API is stable between 1.0 and the nightly it might go faster. I'll try

PidgeyBE · 2020-10-27T16:55:24Z

@edoakes I tried with latest nightly wheel, but the autoscaling monitor doesn't seem to start so I can't test...

2020-10-27 16:49:26,179	INFO log_timer.py:27 -- NodeUpdater: ray-head-qmcx9: Ray start commands succeeded [LogTimer=5822ms]
2020-10-27 16:49:26,179	INFO log_timer.py:27 -- NodeUpdater: ray-head-qmcx9: Applied config 47da8d49fe813c3c58828168d32f7bc8b397761c  [LogTimer=49794ms]
2020-10-27 16:49:26,269	INFO commands.py:716 -- get_or_create_head_node: Head node up-to-date, IP address is: 10.208.0.6
2020-10-27 16:48:33,431	INFO commands.py:204 -- Cluster: dynamic-ray
To monitor autoscaling activity, you can run:
  ray exec /etc/ray/config.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
To open a console on the cluster:
  ray attach /etc/ray/config.yaml
To get a remote shell to the cluster manually, run:
  kubectl -n dynamic-ray exec -it ray-head-qmcx9 bash

This looks ok (as in ray 1.0.0), but when I look into the ray head, it seems like the monitor doesnt start:

(regulator) root@ray-head-qmcx9:/app/core# tail -n 100 -f /tmp/ray/session_latest/logs/monitor*
==> /tmp/ray/session_latest/logs/monitor.err <==
                                                                '"conda '
                                                                'activate '
                                                                'regulator" '
                                                                '>> '
                                                                '~/.bashrc; '
                                                                'trap : '
                                                                'TERM INT; '
                                                                'sleep '
                                                                'infinity '
                                                                '& wait;'],
                                                       'command': ['/bin/bash',
                                                                   '-c',
                                                                   '--'],
                                                       'env': ....}},
                      'service': {'apiVersion': 'v1',
                                  'kind': 'Service',
                                  'spec': {'p.....
     'worker_setup_commands': ['./deploy/install-pipelines.sh',
                               'python ./main/run_on_worker_start.py'],
     'worker_start_ray_commands': ['ray stop',
                                   'ulimit -n 65536; ray start '
                                   '--num-cpus=$MY_CPU_REQUEST '
                                   '--address=$RAY_HEAD_IP:6379 '
                                   '--object-manager-port=8076 '
                                   '--resources=\'{"READY":9999}\'']}

==> /tmp/ray/session_latest/logs/monitor.out <==

Also when I start actors, no workers are spawned...

simon-mo · 2020-11-05T07:35:47Z

I will try to reproduce this on a local k8s instance.

simon-mo · 2020-11-09T06:35:32Z

Ok so I tried running this

import ray

ray.init(address="auto")

@ray.remote(num_cpus=1)
def spin():
    import time
    time.sleep(10000)

ray.get(spin.remote())

On nightly wheel+docker image, using minikube on a pretty beefy machine (32 cores). Running:

$ ray up cluster.yaml
Cluster: default

Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
No head node found. Launching a new cluster. Confirm [y/N]: y

Acquiring an up-to-date head node
2020-11-09 06:30:43,901	INFO node_provider.py:112 -- KubernetesNodeProvider: calling create_namespaced_pod (count=1).
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
2020-11-09 06:30:43,924	INFO node_provider.py:85 -- KubernetesNodeProvider: Caught a 409 error while setting node tags. Retrying...
  New status: waiting-for-ssh
  [1/6] Waiting for SSH to become available
    Running `uptime` as a test.
2020-11-09 06:30:44,442	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'
error: unable to upgrade connection: container not found ("ray-node")
    SSH still not available (Exit Status 1): k u b e c t l   - n   r a y   e x e c   - i t   r a y - h e a d - m p 2 v s   - -   b a s h   - - l o g i n   - c   - i   ' t r u e   & &   s o u r c e   ~ / . b a s h r c   & &   e x p o r t   O M P _ N U M _ T H R E A D S = 1   P Y T H O N W A R N I N G S = i g n o r e   & &   ( u p t i m e ) ', retrying in 5 seconds.
2020-11-09 06:30:49,569	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'
 06:30:50 up 3 days, 22:46,  0 users,  load average: 0.69, 0.47, 0.35
    Success.
  Updating cluster configuration. [hash=a9714d6ad81e442dc995ee12c944110d9ddddcde]
  New status: syncing-files
  [3/6] Processing file mounts
2020-11-09 06:30:50,132	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~)'
/home/ubuntu/.local/lib/python3.8/site-packages/ray/autoscaler/_private/command_runner.py:204: UserWarning: NodeUpdater: ray-head-mp2vs: rsync failed: '[Errno 2] No such file or directory: '/home/ubuntu/.local/lib/python3.8/site-packages/ray/autoscaler/_private/kubernetes/kubectl-rsync.sh''. Falling back to 'kubectl cp'
  warnings.warn(
  [4/6] No worker file mounts to sync
  New status: setting-up
  [3/6] No initialization commands to run.
  [4/6] No setup commands to run.
  [6/6] Starting the Ray runtime
2020-11-09 06:30:50,937	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'
Did not find any active Ray processes.
2020-11-09 06:30:51,989	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ulimit -n 65536; ray start --head --num-cpus=$MY_CPU_REQUEST --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0)'
Local node IP: 172.17.0.3
2020-11-09 06:30:53,338	INFO services.py:1110 -- View the Ray dashboard at http://172.17.0.3:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='172.17.0.3:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop
  New status: up-to-date

Useful commands
  Monitor autoscaling with
    ray exec /home/ubuntu/robovision-repro/cluster.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
  Connect to a terminal on the cluster head:
    ray attach /home/ubuntu/robovision-repro/cluster.yaml
  Get a remote shell to the cluster manually:
    kubectl -n ray exec -it ray-head-mp2vs bash


$ ray submit cluster.yaml request_resource.py
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
/home/ubuntu/.local/lib/python3.8/site-packages/ray/autoscaler/_private/command_runner.py:204: UserWarning: NodeUpdater: ray-head-mp2vs: rsync failed: '[Errno 2] No such file or directory: '/home/ubuntu/.local/lib/python3.8/site-packages/ray/autoscaler/_private/kubernetes/kubectl-rsync.sh''. Falling back to 'kubectl cp'
  warnings.warn(
2020-11-09 06:31:01,227	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python ~/request_resource.py)'
2020-11-09 06:31:01,989	INFO worker.py:672 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379
2020-11-09 06:31:45,315	WARNING worker.py:1111 -- The monitor failed with the following error:
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 356, in <module>
    monitor.run()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 301, in run
    self._run()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 257, in _run
    self.autoscaler.update()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 139, in update
    raise e
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 128, in update
    self._update()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 222, in _update
    self.load_metrics.get_static_node_resources_by_ip())
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 84, in get_nodes_to_launch
    self._infer_legacy_node_resources_if_needed(static_node_resources)
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 182, in _infer_legacy_node_resources_if_needed
    assert len(static_node_resources) == 1  # Only the head node.
AssertionError

^CTraceback (most recent call last):
  File "/root/request_resource.py", line 10, in <module>
    ray.get(spin.remote())
  File "/root/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1465, in get
    values = worker.get_objects(object_refs, timeout=timeout)
  File "/root/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 315, in get_objects
    object_refs, self.current_task_id, timeout_ms)
  File "python/ray/_raylet.pyx", line 817, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 142, in ray._raylet.check_status
KeyboardInterrupt
command terminated with exit code 1
2020-11-09 06:32:04,619	ERROR command_runner.py:175 -- NodeUpdater: ray-head-mp2vs: Command failed:

  kubectl -n ray exec -it ray-head-mp2vs --'bash --login -c -i '"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python ~/request_resource.py)'"'"''

I saw what @PidgeyBE observed, even though the head node has enough resource, an extra worker is started, and after 10+ seconds, that worker not crashed somehow. And monitor failed with

  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 84, in get_nodes_to_launch
    self._infer_legacy_node_resources_if_needed(static_node_resources)
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 182, in _infer_legacy_node_resources_if_needed
    assert len(static_node_resources) == 1  # Only the head node.
AssertionError

AmeerHajAli · 2020-11-09T08:06:56Z

@simon-mo , thanks for reporting this.
Will make a quick fix.

PidgeyBE · 2020-11-09T09:09:19Z

I haven't seen the exact issue of @simon-mo, but have faced several others. I was wondering if k8s autoscaling is tested in CI/CD?

ericl · 2020-11-23T22:55:13Z

This is fixed in the new autoscaler (nightly), please re-open if it still happens there.

PidgeyBE added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 22, 2020

wuisawesome self-assigned this Oct 22, 2020

edoakes added P1 Issue that should be fixed within a few weeks P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks labels Oct 22, 2020

AmeerHajAli mentioned this issue Nov 9, 2020

[autoscaler] Remove legacy autoscaler #11802

Merged

6 tasks

ericl closed this as completed Nov 23, 2020

markweissman mentioned this issue Apr 7, 2022

[Autoscaler] Autoscaler allocates too many workers #23779

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Too many workers are scaled in kubernetes #11551

[autoscaler] Too many workers are scaled in kubernetes #11551

PidgeyBE commented Oct 22, 2020 •

edited

Loading

rkooo567 commented Oct 22, 2020

wuisawesome commented Oct 22, 2020

edoakes commented Oct 22, 2020

ericl commented Oct 22, 2020

edoakes commented Oct 22, 2020

PidgeyBE commented Oct 22, 2020 •

edited

Loading

ericl commented Oct 22, 2020

edoakes commented Oct 26, 2020

edoakes commented Oct 26, 2020

PidgeyBE commented Oct 27, 2020

PidgeyBE commented Oct 27, 2020

simon-mo commented Nov 5, 2020

simon-mo commented Nov 9, 2020

AmeerHajAli commented Nov 9, 2020

PidgeyBE commented Nov 9, 2020

ericl commented Nov 23, 2020

[autoscaler] Too many workers are scaled in kubernetes #11551

[autoscaler] Too many workers are scaled in kubernetes #11551

Comments

PidgeyBE commented Oct 22, 2020 • edited Loading

What is the problem?

Reproduction (REQUIRED)

rkooo567 commented Oct 22, 2020

wuisawesome commented Oct 22, 2020

edoakes commented Oct 22, 2020

ericl commented Oct 22, 2020

edoakes commented Oct 22, 2020

PidgeyBE commented Oct 22, 2020 • edited Loading

ericl commented Oct 22, 2020

edoakes commented Oct 26, 2020

edoakes commented Oct 26, 2020

PidgeyBE commented Oct 27, 2020

PidgeyBE commented Oct 27, 2020

simon-mo commented Nov 5, 2020

simon-mo commented Nov 9, 2020

AmeerHajAli commented Nov 9, 2020

PidgeyBE commented Nov 9, 2020

ericl commented Nov 23, 2020

PidgeyBE commented Oct 22, 2020 •

edited

Loading

PidgeyBE commented Oct 22, 2020 •

edited

Loading