You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are still some issues to keep eager execution as an option, like the memory tracking for blocks (complicated right now, and making the transition to new execution backend difficult: #30903), the in-place conversion (confusing semantics) from eager to lazy (ds.lazy()). Making Dataset lazy-only will make the execution semantics more clear and enable us to clean up complexities around handling block GC.
In particular for the memory model, we'll just rely on if the blocks are "owned" by consumer: we can eagerly release the blocks if the blocks are owned by consumer. We have those cases where blocks are not owned:
input blocks for from_XXX;
output blocks from split(); and
output blocks from fully_executed().
Key items:
Make from_XXX APIs lazy: currently they create eager dataset since they take a in-memory blocklist. We will handle from_XXX() and split() in a unified way, i.e. creating a lazy dataset which takes in a materialized blocklist that NOT owned (cannot be eagerly released after use).
Make fully_executed() and split() produce blocklists that are NOT owned by consumer (cannot be eagerly released after use).
Deprecate the .lazy() API: there will be no eager dataset, so this API will be obsolete.
Remove run_by_consumer arg: it's used to indicate if the blocklists are produced by consumption APIs (if yes, the blocks can be eagerly released after use); with lazy-only, run_by_consumer should always be True, so no longer needed.
Remove allow_clear_input_blocks arg: this is also used to for determining eager memory releasing. With lazy-only, this should also always be True, so no longer needed.
c21
added
P0
Issue that must be fixed in short order
P1
Issue that should be fixed within a few weeks
and removed
P1
Issue that should be fixed within a few weeks
P0
Issue that must be fixed in short order
labels
Feb 16, 2023
The Dataset is lazy by default with #31286.
There are still some issues to keep eager execution as an option, like the memory tracking for blocks (complicated right now, and making the transition to new execution backend difficult: #30903), the in-place conversion (confusing semantics) from eager to lazy (
ds.lazy()
). Making Dataset lazy-only will make the execution semantics more clear and enable us to clean up complexities around handling block GC.In particular for the memory model, we'll just rely on if the blocks are "owned" by consumer: we can eagerly release the blocks if the blocks are owned by consumer. We have those cases where blocks are not owned:
from_XXX
;split()
; andfully_executed()
.Key items:
from_XXX
APIs lazy: currently they create eager dataset since they take a in-memory blocklist. We will handlefrom_XXX()
andsplit()
in a unified way, i.e. creating a lazy dataset which takes in a materialized blocklist that NOT owned (cannot be eagerly released after use).fully_executed()
andsplit()
produce blocklists that are NOT owned by consumer (cannot be eagerly released after use)..lazy()
API: there will be no eager dataset, so this API will be obsolete.run_by_consumer
arg: it's used to indicate if the blocklists are produced by consumption APIs (if yes, the blocks can be eagerly released after use); with lazy-only,run_by_consumer
should always be True, so no longer needed.allow_clear_input_blocks
arg: this is also used to for determining eager memory releasing. With lazy-only, this should also always be True, so no longer needed.@ericl @clarkzinzow @c21
The text was updated successfully, but these errors were encountered: