Skip to content

[data] iter_batches needs streaming operation #49072

@kekulai-fredchang

Description

@kekulai-fredchang

Description

Is there a reason for ray.data.iter_batches to execute lazy transformation of the dataset?

We have an application in which an extremely large ray data is consumed by a generator, so iter_batches is used to yield the data for that use case.

However, iter_batches invokes the total transformation, thus exceeding object storage and getting an OOM error.

  • why doesn't the object storage spill to disk to prevent the OOM error?
  • why does iter_batches need to execute lazy transformation?
  • is there a streaming iter_batches in the pipeline?

Thanks

Use case

another user also reports needing a streaming iter_batches in which we have similar applications in mind
https://discuss.ray.io/t/ray-dataset-from-iterabledataset-no-lazy-implementation/20630

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions