Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Nested remote function slow start up time #45926

Open
sitaowang1998 opened this issue Jun 13, 2024 · 2 comments
Open

[Core] Nested remote function slow start up time #45926

sitaowang1998 opened this issue Jun 13, 2024 · 2 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical performance

Comments

@sitaowang1998
Copy link

What happened + What you expected to happen

Symptom

Ray has significant overhead when running nested remote function for first time.

image
In the chart above, blue, green and purple bar stands for the time used by the actual functions, orange and red is the time spent by Ray in between. For the first batch of tasks, Ray takes a lot of time to start each nested remote function. For later tasks, such overhead is more rare.

Analysis

Since there are more tasks than CPUs, the top layer function consumes all CPUs, and Ray have to start new worker process for second and third layer functions.
The logs show that most of the time is spent between the worker starting a new worker process and the new worker process initializing CoreWorker and connect itself to raylet. However, Ray spend much less time creating worker processes for the first layer functions.

Versions / Dependencies

Ray 2.9.3
Python 3.8.0
OS: Ubuntu 18.04.1 LTS

Reproduction script

import ray
import time
from datetime import datetime

import json

@ray.remote
def sleep_iter(iter: int):
    start_time = datetime.now()
    time.sleep(0.3)
    end_time = datetime.now()

    iter = iter - 1
    if iter > 0:
        times = ray.get(sleep_iter.remote(iter))
        return [start_time.timestamp() * 1000, end_time.timestamp() * 1000] + times

    return [start_time.timestamp() * 1000, end_time.timestamp() * 1000]

num_tasks = 100 # larger than number of cpus in the cluster
tasks = [sleep_iter.remote(3) for _ in range(num_tasks)]
ray.get(tasks)

Issue Severity

Low: It annoys or frustrates me.

@sitaowang1998 sitaowang1998 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 13, 2024
@ruisearch42 ruisearch42 added the core Issues that should be addressed in Ray Core label Jun 13, 2024
@jjyao
Copy link
Collaborator

jjyao commented Jun 24, 2024

Hi @sitaowang1998, maybe you are hitting worker_maximum_startup_concurrency?

@jjyao jjyao added performance P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 24, 2024
@sitaowang1998
Copy link
Author

There is no warning about exceeding the worker_maximum_startup_concurrency in the log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical performance
Projects
None yet
Development

No branches or pull requests

3 participants