[Datasets] Make Dataset lazy-only #31639

jianoaix · 2023-01-12T17:55:08Z

The Dataset is lazy by default with #31286.

There are still some issues to keep eager execution as an option, like the memory tracking for blocks (complicated right now, and making the transition to new execution backend difficult: #30903), the in-place conversion (confusing semantics) from eager to lazy (ds.lazy()). Making Dataset lazy-only will make the execution semantics more clear and enable us to clean up complexities around handling block GC.

In particular for the memory model, we'll just rely on if the blocks are "owned" by consumer: we can eagerly release the blocks if the blocks are owned by consumer. We have those cases where blocks are not owned:

input blocks for from_XXX;
output blocks from split(); and
output blocks from fully_executed().

Key items:

Make from_XXX APIs lazy: currently they create eager dataset since they take a in-memory blocklist. We will handle from_XXX() and split() in a unified way, i.e. creating a lazy dataset which takes in a materialized blocklist that NOT owned (cannot be eagerly released after use).
Make fully_executed() and split() produce blocklists that are NOT owned by consumer (cannot be eagerly released after use).
Deprecate the .lazy() API: there will be no eager dataset, so this API will be obsolete.
Remove run_by_consumer arg: it's used to indicate if the blocklists are produced by consumption APIs (if yes, the blocks can be eagerly released after use); with lazy-only, run_by_consumer should always be True, so no longer needed.
Remove allow_clear_input_blocks arg: this is also used to for determining eager memory releasing. With lazy-only, this should also always be True, so no longer needed.
Make sure all tests (CI and Nightly) passing

@ericl @clarkzinzow @c21

The text was updated successfully, but these errors were encountered:

#31639

ray-project#31639 Signed-off-by: Jonathan Carter <jonathan.carter@magd.ox.ac.uk>

ray-project#31639 Signed-off-by: elliottower <elliot@elliottower.com>

ray-project#31639 Signed-off-by: Jack He <jackhe2345@gmail.com>

jianoaix added P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Jan 12, 2023

c21 added this to the Dataset Execution Optimizer milestone Feb 16, 2023

c21 added P0 Issue that must be fixed in short order P1 Issue that should be fixed within a few weeks and removed P1 Issue that should be fixed within a few weeks P0 Issue that must be fixed in short order labels Feb 16, 2023

jianoaix mentioned this issue Mar 28, 2023

Deprecate ds.lazy() since Dataset is lazy already #33812

Merged

8 tasks

ericl pushed a commit that referenced this issue Mar 29, 2023

Deprecate ds.lazy() since Dataset is lazy already (#33812)

dc0cee4

#31639

joncarter1 pushed a commit to joncarter1/ray that referenced this issue Apr 2, 2023

Deprecate ds.lazy() since Dataset is lazy already (ray-project#33812)

20a2038

ray-project#31639 Signed-off-by: Jonathan Carter <jonathan.carter@magd.ox.ac.uk>

jianoaix closed this as completed Apr 12, 2023

elliottower pushed a commit to elliottower/ray that referenced this issue Apr 22, 2023

Deprecate ds.lazy() since Dataset is lazy already (ray-project#33812)

baa5c4b

ray-project#31639 Signed-off-by: elliottower <elliot@elliottower.com>

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue May 4, 2023

Deprecate ds.lazy() since Dataset is lazy already (ray-project#33812)

a309593

ray-project#31639 Signed-off-by: Jack He <jackhe2345@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Make Dataset lazy-only #31639

[Datasets] Make Dataset lazy-only #31639

jianoaix commented Jan 12, 2023

[Datasets] Make Dataset lazy-only #31639

[Datasets] Make Dataset lazy-only #31639

Comments

jianoaix commented Jan 12, 2023