[Core] Nested remote function slow start up time #45926

sitaowang1998 · 2024-06-13T07:59:31Z

What happened + What you expected to happen

Symptom

Ray has significant overhead when running nested remote function for first time.

In the chart above, blue, green and purple bar stands for the time used by the actual functions, orange and red is the time spent by Ray in between. For the first batch of tasks, Ray takes a lot of time to start each nested remote function. For later tasks, such overhead is more rare.

Analysis

Since there are more tasks than CPUs, the top layer function consumes all CPUs, and Ray have to start new worker process for second and third layer functions.
The logs show that most of the time is spent between the worker starting a new worker process and the new worker process initializing CoreWorker and connect itself to raylet. However, Ray spend much less time creating worker processes for the first layer functions.

Versions / Dependencies

Ray 2.9.3
Python 3.8.0
OS: Ubuntu 18.04.1 LTS

Reproduction script

import ray
import time
from datetime import datetime

import json

@ray.remote
def sleep_iter(iter: int):
    start_time = datetime.now()
    time.sleep(0.3)
    end_time = datetime.now()

    iter = iter - 1
    if iter > 0:
        times = ray.get(sleep_iter.remote(iter))
        return [start_time.timestamp() * 1000, end_time.timestamp() * 1000] + times

    return [start_time.timestamp() * 1000, end_time.timestamp() * 1000]

num_tasks = 100 # larger than number of cpus in the cluster
tasks = [sleep_iter.remote(3) for _ in range(num_tasks)]
ray.get(tasks)

Issue Severity

Low: It annoys or frustrates me.

The text was updated successfully, but these errors were encountered:

jjyao · 2024-06-24T21:21:07Z

Hi @sitaowang1998, maybe you are hitting worker_maximum_startup_concurrency?

sitaowang1998 · 2024-06-25T02:16:17Z

There is no warning about exceeding the worker_maximum_startup_concurrency in the log.

sitaowang1998 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 13, 2024

ruisearch42 added the core Issues that should be addressed in Ray Core label Jun 13, 2024

jjyao added performance P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Nested remote function slow start up time #45926

[Core] Nested remote function slow start up time #45926

sitaowang1998 commented Jun 13, 2024

jjyao commented Jun 24, 2024

sitaowang1998 commented Jun 25, 2024

[Core] Nested remote function slow start up time #45926

[Core] Nested remote function slow start up time #45926

Comments

sitaowang1998 commented Jun 13, 2024

What happened + What you expected to happen

Symptom

Analysis

Versions / Dependencies

Reproduction script

Issue Severity

jjyao commented Jun 24, 2024

sitaowang1998 commented Jun 25, 2024