Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Can't scaler up when using autoscaler v2 #2223

Open
1 of 2 tasks
yx367563 opened this issue Jul 8, 2024 · 6 comments
Open
1 of 2 tasks

[Bug] Can't scaler up when using autoscaler v2 #2223

yx367563 opened this issue Jul 8, 2024 · 6 comments
Assignees
Labels
autoscaler bug Something isn't working

Comments

@yx367563
Copy link

yx367563 commented Jul 8, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

For the same environment, only changing the use of autoscaler v1 or v2, for a one-time submission of 8000 tasks, v1 can work normally, but v2 will always be stuck, can not be scaled up
version:
Ray 2.23.0
Kuberay 1.1.1

Reproduction script

import ray
import time
import os
import random

@ray.remote(max_retries=5, num_cpus=8)
def inside_ray_task():
    sleep_time = random.randint(120, 600)

    start_time = time.perf_counter()
    while True:
        if(time.perf_counter() - start_time < sleep_time):
            time.sleep(0.001)
        else:
            break
  
@ray.remote(max_retries=0)
def outside_ray_task():
    future_list = []
    for i in range(8000):
        future_list.append(inside_ray_task.remote())
    ray.get(future_list)

if __name__ == '__main__':
    ray.init("ray://localhost:10001")
    ray.get(outside_ray_task.remote())

3adc4197-8928-4f55-9bce-a332d21b3b07

Anything else

I want to know what has made recent progress in AutoScaler V2? It seems that it has not been updated for a long time

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@yx367563 yx367563 added bug Something isn't working triage labels Jul 8, 2024
@jjyao
Copy link
Contributor

jjyao commented Jul 8, 2024

Hi @yx367563, for now please use autoscaler v1. v2 development is pause right now due to limited resource.

@yx367563
Copy link
Author

yx367563 commented Jul 9, 2024

@jjyao In fact, I want to use autosclaer v2 simply because there was a problem with killing working nodes in v1(ray-project/ray#46492). I was recommended to try v2 and the bug was indeed eliminated, and would like to ask if there is any solution in v1?

@rickyyx
Copy link
Contributor

rickyyx commented Jul 10, 2024

Thanks for reporting @yx367563 . Would it be easy for you to share some head node logs (particularly the monitor logs) with v2?

@yx367563
Copy link
Author

Sorry, I have stopped using autoscaler v2. I hope this bug can be fixed in v1 (ray-project/ray#46492).

@rickyyx
Copy link
Contributor

rickyyx commented Jul 11, 2024

Sorry, I have stopped using autoscaler v2. I hope this bug can be fixed in v1 (ray-project/ray#46492).

Sure - I will see if i have time to repro this on my end. Thanks!

@yx367563
Copy link
Author

@rickyyx Thank you! And looking forward to receiving your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoscaler bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants