[Data] Streaming executor backpressure #40754

raulchen · 2023-10-27T18:33:47Z

Ray Data now has switched to the streaming execution backend. For Datasets that don't have aggregation operators, all data should be streamed through all the operators. However, if any operator is slow, data will pile up in the buffer and may cause OOM, disk spilling, or even out-of-disk errors.

As of Ray 2.7, we have implemented following backpressure mechanisms:

Resource-based backpressure: check if there are enough resources to run a new op.
Actor pool map backpressure: check free slots for actor-based map operators.
Prioritize ops with least output buffer: the idea is to allocate more resources for the downstream operators.

Despite above mechanisms, there are still some scenarios where backpressure doesn't work properly.

The executor will allocate all resources for the upstream operators, making downstream operators have no resource to run and to consume the upstream outputs.
1. An experimental feature (concurrency-cap backpressure) will be implemented in 2.8 to address this issue. It's disabled by default in 2.8. See [Data] Cap op concurrency with exponential ramp-up #40275 for how to enable it and other details.
When a single task is too big, data will be outputted in a streaming manner. But the output is not backpressure.
1. If you see this issue, you may want to increase the parallelism in your read op (e.g., ray.data.read_image(..., parallelism=N)), so that each task is more fine-grained. This can usually solve most cases, unless one single file is too big.
2. A new experimental feature (streaming output backpressure) will be implemented in 2.8. It's also disabled by default. See [data] implement streaming output backpressure #40387 for how to enable it and other details.
Slow consumer doesn't trigger backpressure. See [data] slow consumers don't trigger backpressure #40753. Planned in 2.9.
The resource-based backpressure doesn't consider the real resource usage.
1. Currently it only consider the logical resources (e.g., num_cpus/num_gpus), we should use the metrics in OpRuntimeMetrics to make the calculation more accurate.

The text was updated successfully, but these errors were encountered:

raulchen · 2023-11-16T01:25:08Z

Updates on the 2 new backpressure policies:

StreamingOutput backpressure works well on benchmarks. It will be enabled by default in 2.9.
ConcurrencyCap backpressure still needs more improvements. It will be enabled by default in 2.10. [data] Enable ConcurrencyCap backpressure by default #41193

bveeramani · 2024-03-04T21:29:53Z

Fixed by #43171

raulchen added P1 Issue that should be fixed within a few weeks data Ray Data-related issues ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) labels Oct 27, 2023

raulchen assigned raulchen and bveeramani Oct 27, 2023

raulchen mentioned this issue Nov 16, 2023

[data] Add object_store_memory to incremental_resource_usage() #41190

Closed

raulchen added the data-stability label Nov 16, 2023

This was referenced Nov 16, 2023

[data] Allow specifying memory resource for data operators #41191

Open

[data] Enable ConcurrencyCap backpressure by default #41193

Closed

raulchen mentioned this issue Nov 22, 2023

[data] Enable streaming output backpressure #41327

Merged

8 tasks

scottjlee mentioned this issue Dec 4, 2023

[Data] Serial running _sample_block greatly harms performance when sorting #41356

Open

anyscalesam added ray 2.10 and removed ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) labels Dec 4, 2023

c21 unassigned raulchen and bveeramani Dec 11, 2023

raulchen mentioned this issue Jan 6, 2024

[data] Improve StreamingOutputBackpressurePolicy #42217

Closed

anyscalesam assigned raulchen Feb 6, 2024

xhook mentioned this issue Feb 10, 2024

[data] slow consumers don't trigger backpressure #40753

Closed

raulchen mentioned this issue Feb 22, 2024

[data] Enable per-op resource reservation #43171

Merged

8 tasks

bveeramani closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Streaming executor backpressure #40754

[Data] Streaming executor backpressure #40754

raulchen commented Oct 27, 2023

raulchen commented Nov 16, 2023

bveeramani commented Mar 4, 2024

[Data] Streaming executor backpressure #40754

[Data] Streaming executor backpressure #40754

Comments

raulchen commented Oct 27, 2023

raulchen commented Nov 16, 2023

bveeramani commented Mar 4, 2024