[data] iter_batches needs streaming operation

### Description

Is there a reason for ray.data.iter_batches to execute lazy transformation of the dataset?  

We have an application in which an extremely large ray data is consumed by a generator, so iter_batches is used to yield the data for that use case.  

However, iter_batches invokes the total transformation, thus exceeding object storage and getting an OOM error.  

- why doesn't the object storage spill to disk to prevent the OOM error?
- why does iter_batches need to execute lazy transformation?
- is there a streaming iter_batches in the pipeline?

Thanks



### Use case

another user also reports needing a streaming iter_batches in which we have similar applications in mind
https://discuss.ray.io/t/ray-dataset-from-iterabledataset-no-lazy-implementation/20630


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] iter_batches needs streaming operation #49072

Description

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[data] iter_batches needs streaming operation #49072

Description

Description

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions