[Datasets] [Out-of-Band Serialization: 1/3] Refactor `LazyBlockList`. #23821

clarkzinzow · 2022-04-10T08:44:32Z

This PR refactors LazyBlockList in service of out-of-band serialization (see mono-PR) and is a precursor to an execution plan refactor (PR #2) and adding the actual out-of-band serialization APIs (PR #3). The following is included in this refactor:

ReadTasks are now a first-class concept, replacing calls;
read stage progress tracking is consolidated into LazyBlockList._get_blocks_with_metadta() and more of the read task complexity, e.g. the read remote function, was pushed into LazyBlockList to make ray.data.read_datasource() simpler;
we are a bit smarter with how we progressively launch tasks and fetch and cache metadata, including fetching the metadata for read tasks in .iter_blocks_with_metadata() instead of relying on the pre-read task metadata (which will be less accurate), and we also fix some small bugs in the lazy ramp-up around progressive metadata fetching.

(1) is the most important item for supporting out-of-band serialization and fundamentally changes the LazyBlockList data model. This is required since we need to be able to reference the underlying read tasks when rewriting read stages during optimization and when serializing the lineage of the Dataset. See the mono-PR for more context.

Other changes:

Changed stats actor to a global named actor singleton in order to obviate the need for serializing the actor handle with the Dataset stats; without this, we were encountering serialization failures.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl

A few questions on why the test cases change, but the high level approach looks good.

@jianoaix can you take the detailed review?

python/ray/data/impl/lazy_block_list.py

ericl · 2022-04-10T19:30:07Z

python/ray/data/impl/lazy_block_list.py

+                This serves as a cache of fetched block metadata.
+            ray_remote_args: Ray remote arguments for the read tasks.
+            stats_uuid: UUID for the dataset stats, used to group and fetch read task
+                stats. If not provided, a new UUID will be created.


This is great.

python/ray/data/tests/test_dataset_formats.py

ericl · 2022-04-10T19:32:54Z

python/ray/data/tests/test_dataset_formats.py

@@ -750,7 +748,7 @@ def test_numpy_read(ray_start_regular_shared, tmp_path):
    np.save(os.path.join(path, "test.npy"), np.expand_dims(np.arange(0, 10), 1))
    ds = ray.data.read_numpy(path)
    assert str(ds) == (
-        "Dataset(num_blocks=1, num_rows=None, "
+        "Dataset(num_blocks=1, num_rows=10, "


Hmm, how are we able to get the number of rows here without eager reading?

AFAICT, before this PR, we were already eagerly submitting the first read task AND blocking on the read task finishing in order to fetch the schema in this case (1, 2). After this PR, we fetch all of the block metadata so we can also get an more informative view of e.g. the number of rows for datasources lacking metadata peeking, as in this case.

The code linked seems still just look at the metadata for the first block? Also the above "test_numpy_roundtrip" test still has num_rows=None.

The code linked seems still just look at the metadata for the first block?

@jianoaix That's an older commit in this PR, as you can see in the current state, we block and fetch the metadata for the first block if the schema isn't set.

Also the above "test_numpy_roundtrip" test still has num_rows=None.

The reasoning for that is because the dataset for the test_numpy_read dataset only consists of a single block, so our blocking and fetching the metadata for the first block gives us the exact number of rows in the Dataset, and we can calculate the number of rows off of the block metadata alone. The test_numpy_roundtrip dataset, on the other hand, has 2 blocks, so when we block and fetch the metadata for the first block, we end up having the number of rows for only one of the two blocks, so we can't calculate the number of rows in the full dataset via the block metadata alone.

ericl · 2022-04-10T19:36:00Z

already-read block partitions are reused when possible to prevent unnecessary re-reads;

I'm a little worried about this introducing unnecessary complexity in the timing / stats, but if you and Jian think it's a good idea I'm ok with it.

clarkzinzow · 2022-04-11T17:07:53Z

I'm a little worried about this introducing unnecessary complexity in the timing / stats, but if you and Jian think it's a good idea I'm ok with it.

Agreed about the stats; the end-result will be that, for a fused read + transformation stage, the stats for the task operating on the already-read block will only include the transformation time, not the read time. However, I still think that it's worth it in order to avoid unnecessary re-reading, and if the stats discrepancy is untenable, I'd rather do something like a best-effort merge of the pre-fusion read stats + the post-fusion read + transform stats than do unnecessary file re-reads.

jianoaix · 2022-04-12T02:51:56Z

I'm a little worried about this introducing unnecessary complexity in the timing / stats, but if you and Jian think it's a good idea I'm ok with it.

Agreed about the stats; the end-result will be that, for a fused read + transformation stage, the stats for the task operating on the already-read block will only include the transformation time, not the read time. However, I still think that it's worth it in order to avoid unnecessary re-reading, and if the stats discrepancy is untenable, I'd rather do something like a best-effort merge of the pre-fusion read stats + the post-fusion read + transform stats than do unnecessary file re-reads.

Based on our offline discussion, how about drop this optimization for now in this PR, and evaluate in follow-up whether it's worth the complexity we may add?

jianoaix

Thank you Clark for splitting the mono-PR. The splitting and this PR itself look good to me!

python/ray/data/impl/lazy_block_list.py

jianoaix · 2022-04-12T03:05:10Z

python/ray/data/impl/lazy_block_list.py

+            fetched_metadata: An optional list of already computed AND fetched metadata.
+                This serves as a cache of fetched block metadata.
+            ray_remote_args: Ray remote arguments for the read tasks.
+            stats_uuid: UUID for the dataset stats, used to group and fetch read task


This UUID is per dataset, or per <dataset, stage>? If it's the later, name it more accurately will be helpful.

This UUID is per stats object, and amounts to ~per read stage so is basically ~per dataset, but predates the creation of a Dataset and its UUID, hence us have two different UUIDs.

The UUID is only used for gathering stats about read stages, so we could change this to read_stats_uuid if that makes it more clear.

python/ray/data/impl/lazy_block_list.py

clarkzinzow · 2022-04-12T16:29:11Z

Based on our offline discussion, how about drop this optimization for now in this PR, and evaluate in follow-up whether it's worth the complexity we may add?

@jianoaix I'll time-box the stats merging that we talked about (15 minutes), and if it's not straightforward, I'll revert that optimization and add it in a follow-up PR!

clarkzinzow · 2022-04-13T18:04:51Z

@jianoaix I did the time-boxed investigation of stats merging: the merging itself is simple, the complicated bit is propagating the partial-read stats to the downstream read + transform stats creation, which would require adding a "partially computed stats" concept that's passed around in a clunky way. Ultimately, opted to remove the optimization.

python/ray/data/impl/plan.py

jianoaix · 2022-04-13T21:12:48Z

python/ray/data/tests/test_dataset_formats.py

@@ -750,7 +748,7 @@ def test_numpy_read(ray_start_regular_shared, tmp_path):
    np.save(os.path.join(path, "test.npy"), np.expand_dims(np.arange(0, 10), 1))
    ds = ray.data.read_numpy(path)
    assert str(ds) == (
-        "Dataset(num_blocks=1, num_rows=None, "
+        "Dataset(num_blocks=1, num_rows=10, "


The code linked seems still just look at the metadata for the first block? Also the above "test_numpy_roundtrip" test still has num_rows=None.

python/ray/data/impl/lazy_block_list.py

jianoaix · 2022-04-13T21:24:09Z

python/ray/data/impl/lazy_block_list.py

 from ray.data.impl.block_list import BlockList
+from ray.data.impl.progress_bar import ProgressBar
+from ray.data.impl.remote_fn import cached_remote_fn
+from ray.data.impl.stats import DatasetStats, _get_or_create_stats_actor


 class LazyBlockList(BlockList):


Can you add some unit tests for the public methods of this class?

I'm happy to do this as a follow up, but since we don't currently have any unit testing at this layer (all unit testing is done on the Dataset and DatasetPipeline methods), I'd prefer not to block out-of-band serialization from landing on adding a new testing layer.

I think in generally we should strive for covering the code with unit test in the same PR, rather than followups, which is essentially a (short-term) tech debt; especially for newly added classes/functions, or new public methods in existing class. This PR is mostly refactoring, and we still jam unit tests together so far, so LG to move forward.

Strongly agree, I'm not suggesting that we should add new abstractions and the corresponding unit tests in separate PRs, I'm pushing for not making a large test coverage change in this PR for a refactoring of an existing abstraction that's blocking some AIR features. I would really like to have a much more strict unit testing standard, since most of our Python tests in core Ray and our ecosystem are technically integration tests, with many intermediate edge cases uncovered. So I'm very happy that you're of the same mind, definitely looking forward to pushing on this!

jianoaix · 2022-04-13T21:26:32Z

@jianoaix I did the time-boxed investigation of stats merging: the merging itself is simple, the complicated bit is propagating the partial-read stats to the downstream read + transform stats creation, which would require adding a "partially computed stats" concept that's passed around in a clunky way. Ultimately, opted to remove the optimization.

Sounds good, thanks for quick investigation! Please update the PR description and drop the item #2.

clarkzinzow requested review from ericl, scv119 and jjyao as code owners April 10, 2022 08:44

clarkzinzow assigned ericl and jianoaix Apr 10, 2022

Refactor LazyBlockList.

d9b2891

clarkzinzow force-pushed the datasets/feat/refactor-lazy-block-list branch from fd7f3a7 to d9b2891 Compare April 10, 2022 08:48

ericl reviewed Apr 10, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 10, 2022

Privatize execute_read_task.

00a0378

Revert to old metadata fetching semantics.

acadb9b

jianoaix reviewed Apr 12, 2022

View reviewed changes

PR feedback.

dade522

clarkzinzow force-pushed the datasets/feat/refactor-lazy-block-list branch from 703bace to dade522 Compare April 13, 2022 04:59

jianoaix reviewed Apr 13, 2022

View reviewed changes

PR feedback: move private method down, fix reverse conditional.

d13291f

jianoaix approved these changes Apr 14, 2022

View reviewed changes

ericl approved these changes Apr 14, 2022

View reviewed changes

ericl merged commit efc5ac5 into ray-project:master Apr 14, 2022

clarkzinzow mentioned this pull request Apr 18, 2022

[Dataset] Create the stats actor as a global named singleton instead of passing by handle #22778

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] [Out-of-Band Serialization: 1/3] Refactor `LazyBlockList`. #23821

[Datasets] [Out-of-Band Serialization: 1/3] Refactor `LazyBlockList`. #23821

clarkzinzow commented Apr 10, 2022 •

edited

ericl left a comment

ericl Apr 10, 2022

ericl Apr 10, 2022

clarkzinzow Apr 11, 2022 •

edited

jianoaix Apr 13, 2022

clarkzinzow Apr 13, 2022

ericl commented Apr 10, 2022 •

edited

clarkzinzow commented Apr 11, 2022 •

edited

jianoaix commented Apr 12, 2022

jianoaix left a comment

jianoaix Apr 12, 2022

clarkzinzow Apr 12, 2022

clarkzinzow commented Apr 12, 2022

clarkzinzow commented Apr 13, 2022

jianoaix Apr 13, 2022

jianoaix Apr 13, 2022

clarkzinzow Apr 13, 2022

jianoaix Apr 14, 2022

clarkzinzow Apr 14, 2022

jianoaix commented Apr 13, 2022

[Datasets] [Out-of-Band Serialization: 1/3] Refactor LazyBlockList. #23821

[Datasets] [Out-of-Band Serialization: 1/3] Refactor LazyBlockList. #23821

Conversation

clarkzinzow commented Apr 10, 2022 • edited

Checks

ericl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow Apr 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Apr 10, 2022 • edited

clarkzinzow commented Apr 11, 2022 • edited

jianoaix commented Apr 12, 2022

jianoaix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow commented Apr 12, 2022

clarkzinzow commented Apr 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianoaix commented Apr 13, 2022

[Datasets] [Out-of-Band Serialization: 1/3] Refactor `LazyBlockList`. #23821

[Datasets] [Out-of-Band Serialization: 1/3] Refactor `LazyBlockList`. #23821

clarkzinzow commented Apr 10, 2022 •

edited

clarkzinzow Apr 11, 2022 •

edited

ericl commented Apr 10, 2022 •

edited

clarkzinzow commented Apr 11, 2022 •

edited