Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Streaming executor backpressure #40754

Closed
raulchen opened this issue Oct 27, 2023 · 2 comments
Closed

[Data] Streaming executor backpressure #40754

raulchen opened this issue Oct 27, 2023 · 2 comments
Assignees
Labels
data Ray Data-related issues data-stability P1 Issue that should be fixed within a few weeks ray 2.10

Comments

@raulchen
Copy link
Contributor

Ray Data now has switched to the streaming execution backend. For Datasets that don't have aggregation operators, all data should be streamed through all the operators. However, if any operator is slow, data will pile up in the buffer and may cause OOM, disk spilling, or even out-of-disk errors.

As of Ray 2.7, we have implemented following backpressure mechanisms:

  1. Resource-based backpressure: check if there are enough resources to run a new op.
  2. Actor pool map backpressure: check free slots for actor-based map operators.
  3. Prioritize ops with least output buffer: the idea is to allocate more resources for the downstream operators.

Despite above mechanisms, there are still some scenarios where backpressure doesn't work properly.

  1. The executor will allocate all resources for the upstream operators, making downstream operators have no resource to run and to consume the upstream outputs.
    1. An experimental feature (concurrency-cap backpressure) will be implemented in 2.8 to address this issue. It's disabled by default in 2.8. See [Data] Cap op concurrency with exponential ramp-up #40275 for how to enable it and other details.
  2. When a single task is too big, data will be outputted in a streaming manner. But the output is not backpressure.
    1. If you see this issue, you may want to increase the parallelism in your read op (e.g., ray.data.read_image(..., parallelism=N)), so that each task is more fine-grained. This can usually solve most cases, unless one single file is too big.
    2. A new experimental feature (streaming output backpressure) will be implemented in 2.8. It's also disabled by default. See [data] implement streaming output backpressure #40387 for how to enable it and other details.
  3. Slow consumer doesn't trigger backpressure. See [data] slow consumers don't trigger backpressure #40753. Planned in 2.9.
  4. The resource-based backpressure doesn't consider the real resource usage.
    1. Currently it only consider the logical resources (e.g., num_cpus/num_gpus), we should use the metrics in OpRuntimeMetrics to make the calculation more accurate.
@raulchen raulchen added P1 Issue that should be fixed within a few weeks data Ray Data-related issues ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) labels Oct 27, 2023
@raulchen
Copy link
Contributor Author

Updates on the 2 new backpressure policies:

@bveeramani
Copy link
Member

Fixed by #43171

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues data-stability P1 Issue that should be fixed within a few weeks ray 2.10
Projects
None yet
Development

No branches or pull requests

3 participants