Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Make Dataset lazy-only #31639

Closed
jianoaix opened this issue Jan 12, 2023 · 0 comments
Closed

[Datasets] Make Dataset lazy-only #31639

jianoaix opened this issue Jan 12, 2023 · 0 comments
Labels
data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@jianoaix
Copy link
Contributor

The Dataset is lazy by default with #31286.

There are still some issues to keep eager execution as an option, like the memory tracking for blocks (complicated right now, and making the transition to new execution backend difficult: #30903), the in-place conversion (confusing semantics) from eager to lazy (ds.lazy()). Making Dataset lazy-only will make the execution semantics more clear and enable us to clean up complexities around handling block GC.

In particular for the memory model, we'll just rely on if the blocks are "owned" by consumer: we can eagerly release the blocks if the blocks are owned by consumer. We have those cases where blocks are not owned:

  • input blocks for from_XXX;
  • output blocks from split(); and
  • output blocks from fully_executed().

Key items:

  • Make from_XXX APIs lazy: currently they create eager dataset since they take a in-memory blocklist. We will handle from_XXX() and split() in a unified way, i.e. creating a lazy dataset which takes in a materialized blocklist that NOT owned (cannot be eagerly released after use).
  • Make fully_executed() and split() produce blocklists that are NOT owned by consumer (cannot be eagerly released after use).
  • Deprecate the .lazy() API: there will be no eager dataset, so this API will be obsolete.
  • Remove run_by_consumer arg: it's used to indicate if the blocklists are produced by consumption APIs (if yes, the blocks can be eagerly released after use); with lazy-only, run_by_consumer should always be True, so no longer needed.
  • Remove allow_clear_input_blocks arg: this is also used to for determining eager memory releasing. With lazy-only, this should also always be True, so no longer needed.
  • Make sure all tests (CI and Nightly) passing

@ericl @clarkzinzow @c21

@jianoaix jianoaix added P1 Issue that should be fixed within a few weeks data Ray Data-related issues labels Jan 12, 2023
@c21 c21 added this to the Dataset Execution Optimizer milestone Feb 16, 2023
@c21 c21 added P0 Issue that must be fixed in short order P1 Issue that should be fixed within a few weeks and removed P1 Issue that should be fixed within a few weeks P0 Issue that must be fixed in short order labels Feb 16, 2023
joncarter1 pushed a commit to joncarter1/ray that referenced this issue Apr 2, 2023
ray-project#31639
Signed-off-by: Jonathan Carter <jonathan.carter@magd.ox.ac.uk>
elliottower pushed a commit to elliottower/ray that referenced this issue Apr 22, 2023
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

2 participants