Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Too many workers are scaled in kubernetes #11551

Closed
1 of 2 tasks
PidgeyBE opened this issue Oct 22, 2020 · 16 comments
Closed
1 of 2 tasks

[autoscaler] Too many workers are scaled in kubernetes #11551

PidgeyBE opened this issue Oct 22, 2020 · 16 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical

Comments

@PidgeyBE
Copy link
Contributor

PidgeyBE commented Oct 22, 2020

What is the problem?

  • ray = 1.0.0
  • autoscaling on k8s

When autoscaling in kubernetes, it happens sometimes (almost 50% in my case) that instead of 1 worker, 2 workers are scaled up.

Reproduction (REQUIRED)

  • Have a ray autoscale cluster on kubernetes
  • Deploy an actor that requires new worker resources
  • Ray autoscale will start an extra worker
  • When the first worker is (almost) ready, for some reason, the autoscaler triggers another worker to start
    I see this happens because for some reason NumNodesConnected becomes equal to NumNodesUsed.
 - NumNodesConnected: 3
 - NumNodesUsed: 3

A few seconds later (while the extra node is already being started and the actor is deployed) I see:

 - NumNodesConnected: 3
 - NumNodesUsed: 1.2

I've cut the important part out of the monitor.err logs monitor.err.txt

It seems like there is some race condition that causes the autoscaler to trigger a worker to start, while the actor is deploying...

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

BR, Pieterjan
cc @edoakes

@PidgeyBE PidgeyBE added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 22, 2020
@rkooo567
Copy link
Contributor

Cc @wuisawesome probably should handle asap

@wuisawesome
Copy link
Contributor

Cc @AmeerHajAli @ericl

I'll look into this, but feel free to steal it if you have the bandwidth

@wuisawesome wuisawesome self-assigned this Oct 22, 2020
@edoakes
Copy link
Contributor

edoakes commented Oct 22, 2020

@wuisawesome I vaguely remember seeing this in the past. IIRC the issue is that the autoscaler has no notion of "pending" nodes, so while the node is pending in k8s (which can take awhile), it continues to add more nodes because resource utilization is still high.

@ericl
Copy link
Contributor

ericl commented Oct 22, 2020

What's weird is that the autoscaler is reporting it has 1/1 target nodes, but 2 nodes connected. There seems to be a connected node that is not managed by the autoscaler, which leads to an off-by-one error.

@wuisawesome is this related to that issue you had with an unmanaged node being counted as a node?

@edoakes I do see 2020-10-20 09:30:33,392 INFO autoscaler.py:542 -- Cluster status: 1/2 target nodes (1 pending) (1 updating) so it seems we do have pending nodes. What's odd is the number of "used" nodes is 3 when the load is low, which seems wrong.

@edoakes
Copy link
Contributor

edoakes commented Oct 22, 2020

huh, ok, that's actually sort of promising because it indicates that there's just some off-by-one or similar bug somewhere instead of a more intrinsic problem. @ericl does this mean you were able to easily reproduce the issue?

@PidgeyBE
Copy link
Contributor Author

PidgeyBE commented Oct 22, 2020

Indeed, I think there is notion of pending nodes in k8s autoscaler.
e.g. If I do

setup_commands:
    - sleep 600

It does not influence the problem (doesnt make it better or worse) ...

I'm not sure about the off-by-one thing...
It starts with 0/0, at this time the unmanaged node is already present and not counted.
Then, when deploying the Actor, it goes to 0/1.
After some time, the status becomes 1/1 but the added GPU is not yet visible in the Resource list.
At the point the GPU becomes visible in the ResourceUsage list, the target nodes are 1/2 and an extra node is spawned.
Then it goes to 2/2 (while actor is not yet deployed (0.0/1.0 GPU)) and quickly to 2/1 as soon as the Actor is actually deployed (1.0/1.0 GPU)

@ericl
Copy link
Contributor

ericl commented Oct 22, 2020

@edoakes I was just reading the logs above.

It's possible this PR fixes the issue: #11458 (it should be available in nightly wheels by now)

@edoakes edoakes added P1 Issue that should be fixed within a few weeks P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks labels Oct 22, 2020
@edoakes
Copy link
Contributor

edoakes commented Oct 26, 2020

@PidgeyBE have you had a chance to verify that the above PR fixes your issue?

1 similar comment
@edoakes
Copy link
Contributor

edoakes commented Oct 26, 2020

@PidgeyBE have you had a chance to verify that the above PR fixes your issue?

@PidgeyBE
Copy link
Contributor Author

@edoakes Not yet. I'll try to do it today! Our build chain is not so flexible for changing ray versions but if the API is stable between 1.0 and the nightly it might go faster. I'll try

@PidgeyBE
Copy link
Contributor Author

@edoakes I tried with latest nightly wheel, but the autoscaling monitor doesn't seem to start so I can't test...

2020-10-27 16:49:26,179	INFO log_timer.py:27 -- NodeUpdater: ray-head-qmcx9: Ray start commands succeeded [LogTimer=5822ms]
2020-10-27 16:49:26,179	INFO log_timer.py:27 -- NodeUpdater: ray-head-qmcx9: Applied config 47da8d49fe813c3c58828168d32f7bc8b397761c  [LogTimer=49794ms]
2020-10-27 16:49:26,269	INFO commands.py:716 -- get_or_create_head_node: Head node up-to-date, IP address is: 10.208.0.6
2020-10-27 16:48:33,431	INFO commands.py:204 -- Cluster: dynamic-ray
To monitor autoscaling activity, you can run:
  ray exec /etc/ray/config.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
To open a console on the cluster:
  ray attach /etc/ray/config.yaml
To get a remote shell to the cluster manually, run:
  kubectl -n dynamic-ray exec -it ray-head-qmcx9 bash

This looks ok (as in ray 1.0.0), but when I look into the ray head, it seems like the monitor doesnt start:

(regulator) root@ray-head-qmcx9:/app/core# tail -n 100 -f /tmp/ray/session_latest/logs/monitor*
==> /tmp/ray/session_latest/logs/monitor.err <==
                                                                '"conda '
                                                                'activate '
                                                                'regulator" '
                                                                '>> '
                                                                '~/.bashrc; '
                                                                'trap : '
                                                                'TERM INT; '
                                                                'sleep '
                                                                'infinity '
                                                                '& wait;'],
                                                       'command': ['/bin/bash',
                                                                   '-c',
                                                                   '--'],
                                                       'env': ....}},
                      'service': {'apiVersion': 'v1',
                                  'kind': 'Service',
                                  'spec': {'p.....
     'worker_setup_commands': ['./deploy/install-pipelines.sh',
                               'python ./main/run_on_worker_start.py'],
     'worker_start_ray_commands': ['ray stop',
                                   'ulimit -n 65536; ray start '
                                   '--num-cpus=$MY_CPU_REQUEST '
                                   '--address=$RAY_HEAD_IP:6379 '
                                   '--object-manager-port=8076 '
                                   '--resources=\'{"READY":9999}\'']}

==> /tmp/ray/session_latest/logs/monitor.out <==

Also when I start actors, no workers are spawned...

@simon-mo
Copy link
Contributor

simon-mo commented Nov 5, 2020

I will try to reproduce this on a local k8s instance.

@simon-mo
Copy link
Contributor

simon-mo commented Nov 9, 2020

Ok so I tried running this

import ray

ray.init(address="auto")

@ray.remote(num_cpus=1)
def spin():
    import time
    time.sleep(10000)

ray.get(spin.remote())

On nightly wheel+docker image, using minikube on a pretty beefy machine (32 cores). Running:

$ ray up cluster.yaml
Cluster: default

Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
No head node found. Launching a new cluster. Confirm [y/N]: y

Acquiring an up-to-date head node
2020-11-09 06:30:43,901	INFO node_provider.py:112 -- KubernetesNodeProvider: calling create_namespaced_pod (count=1).
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
2020-11-09 06:30:43,924	INFO node_provider.py:85 -- KubernetesNodeProvider: Caught a 409 error while setting node tags. Retrying...
  New status: waiting-for-ssh
  [1/6] Waiting for SSH to become available
    Running `uptime` as a test.
2020-11-09 06:30:44,442	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'
error: unable to upgrade connection: container not found ("ray-node")
    SSH still not available (Exit Status 1): k u b e c t l   - n   r a y   e x e c   - i t   r a y - h e a d - m p 2 v s   - -   b a s h   - - l o g i n   - c   - i   ' t r u e   & &   s o u r c e   ~ / . b a s h r c   & &   e x p o r t   O M P _ N U M _ T H R E A D S = 1   P Y T H O N W A R N I N G S = i g n o r e   & &   ( u p t i m e ) ', retrying in 5 seconds.
2020-11-09 06:30:49,569	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'
 06:30:50 up 3 days, 22:46,  0 users,  load average: 0.69, 0.47, 0.35
    Success.
  Updating cluster configuration. [hash=a9714d6ad81e442dc995ee12c944110d9ddddcde]
  New status: syncing-files
  [3/6] Processing file mounts
2020-11-09 06:30:50,132	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~)'
/home/ubuntu/.local/lib/python3.8/site-packages/ray/autoscaler/_private/command_runner.py:204: UserWarning: NodeUpdater: ray-head-mp2vs: rsync failed: '[Errno 2] No such file or directory: '/home/ubuntu/.local/lib/python3.8/site-packages/ray/autoscaler/_private/kubernetes/kubectl-rsync.sh''. Falling back to 'kubectl cp'
  warnings.warn(
  [4/6] No worker file mounts to sync
  New status: setting-up
  [3/6] No initialization commands to run.
  [4/6] No setup commands to run.
  [6/6] Starting the Ray runtime
2020-11-09 06:30:50,937	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'
Did not find any active Ray processes.
2020-11-09 06:30:51,989	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ulimit -n 65536; ray start --head --num-cpus=$MY_CPU_REQUEST --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0)'
Local node IP: 172.17.0.3
2020-11-09 06:30:53,338	INFO services.py:1110 -- View the Ray dashboard at http://172.17.0.3:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='172.17.0.3:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop
  New status: up-to-date

Useful commands
  Monitor autoscaling with
    ray exec /home/ubuntu/robovision-repro/cluster.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
  Connect to a terminal on the cluster head:
    ray attach /home/ubuntu/robovision-repro/cluster.yaml
  Get a remote shell to the cluster manually:
    kubectl -n ray exec -it ray-head-mp2vs bash


$ ray submit cluster.yaml request_resource.py
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
/home/ubuntu/.local/lib/python3.8/site-packages/ray/autoscaler/_private/command_runner.py:204: UserWarning: NodeUpdater: ray-head-mp2vs: rsync failed: '[Errno 2] No such file or directory: '/home/ubuntu/.local/lib/python3.8/site-packages/ray/autoscaler/_private/kubernetes/kubectl-rsync.sh''. Falling back to 'kubectl cp'
  warnings.warn(
2020-11-09 06:31:01,227	INFO command_runner.py:165 -- NodeUpdater: ray-head-mp2vs: Running kubectl -n ray exec -it ray-head-mp2vs -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python ~/request_resource.py)'
2020-11-09 06:31:01,989	INFO worker.py:672 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379
2020-11-09 06:31:45,315	WARNING worker.py:1111 -- The monitor failed with the following error:
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 356, in <module>
    monitor.run()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 301, in run
    self._run()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 257, in _run
    self.autoscaler.update()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 139, in update
    raise e
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 128, in update
    self._update()
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 222, in _update
    self.load_metrics.get_static_node_resources_by_ip())
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 84, in get_nodes_to_launch
    self._infer_legacy_node_resources_if_needed(static_node_resources)
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 182, in _infer_legacy_node_resources_if_needed
    assert len(static_node_resources) == 1  # Only the head node.
AssertionError

^CTraceback (most recent call last):
  File "/root/request_resource.py", line 10, in <module>
    ray.get(spin.remote())
  File "/root/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1465, in get
    values = worker.get_objects(object_refs, timeout=timeout)
  File "/root/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 315, in get_objects
    object_refs, self.current_task_id, timeout_ms)
  File "python/ray/_raylet.pyx", line 817, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 142, in ray._raylet.check_status
KeyboardInterrupt
command terminated with exit code 1
2020-11-09 06:32:04,619	ERROR command_runner.py:175 -- NodeUpdater: ray-head-mp2vs: Command failed:

  kubectl -n ray exec -it ray-head-mp2vs --'bash --login -c -i '"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python ~/request_resource.py)'"'"''

I saw what @PidgeyBE observed, even though the head node has enough resource, an extra worker is started, and after 10+ seconds, that worker not crashed somehow. And monitor failed with

  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 84, in get_nodes_to_launch
    self._infer_legacy_node_resources_if_needed(static_node_resources)
  File "/root/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/resource_demand_scheduler.py", line 182, in _infer_legacy_node_resources_if_needed
    assert len(static_node_resources) == 1  # Only the head node.
AssertionError

@AmeerHajAli
Copy link
Contributor

@simon-mo , thanks for reporting this.
Will make a quick fix.

@PidgeyBE
Copy link
Contributor Author

PidgeyBE commented Nov 9, 2020

I haven't seen the exact issue of @simon-mo, but have faced several others. I was wondering if k8s autoscaling is tested in CI/CD?

@ericl
Copy link
Contributor

ericl commented Nov 23, 2020

This is fixed in the new autoscaler (nightly), please re-open if it still happens there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

7 participants