[Data] Postpone `reader.get_read_tasks` until execution #38373

c21 · 2023-08-11T22:36:56Z

Why are these changes needed?

This PR is to postpone reader.get_read_tasks() (i.e. generate the List[ReadTask]) until Dataset is executed. Also introduce a hook to allow post processing input files inside reader.get_read_tasks(), so we can have custom logic to do post processing of input files, before returning the List[ReadTask].

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

c21 · 2023-08-11T22:38:03Z

python/ray/data/tests/test_consumption.py

@@ -1413,12 +1413,6 @@ def test_unsupported_pyarrow_versions_check_disabled(
    except ImportError as e:
        pytest.fail(f"_check_pyarrow_version failed unexpectedly: {e}")

-    # Test read_parquet.


Pyarrow 5 does not support pickling the Parquet reader class. Given we do not support Pyarrow 5, remove the test code here. Already verified for Pyarrow 6+, it's not an issue.

should we just remove the PyArrow 5 CI?

We already removed. CI only tests 6 and 12. This test manually install 5.

raulchen · 2023-08-11T23:35:36Z

python/ray/data/read_api.py

@@ -384,6 +387,7 @@ def read_datasource(
    # Compute the number of blocks the read will return. If the number of blocks is
    # expected to be less than the requested parallelism, boost the number of blocks
    # by adding an additional split into `k` pieces to each read task.
+    additional_split_factor = None


One small issue. For the new code path, get_read_tasks has actually created the read tasks. But the read tasks are only used for the following calculations and then discarded.
I'm wondering if we can move the following calculation code to the reader.
So here we only create the reader, but not the read tasks. Also we don't need to expose this additional_split_factor to the operator.

yes, that's the major weird thing here. The other code path LazyBlockList has all code paths depending on List[ReadTask], so I don't spend time to refactoring LazyBlockList. Shall we just delete DatasetPipeline and LazyBlockList/BlockList during 2.8?

that's fine. can you leave a todo here?

raulchen · 2023-08-11T23:41:18Z

python/ray/data/tests/test_consumption.py

@@ -1413,12 +1413,6 @@ def test_unsupported_pyarrow_versions_check_disabled(
    except ImportError as e:
        pytest.fail(f"_check_pyarrow_version failed unexpectedly: {e}")

-    # Test read_parquet.


should we just remove the PyArrow 5 CI?

c21 · 2023-08-14T23:28:55Z

python/ray/data/_default_config.py

@@ -0,0 +1,2 @@
+# Default file shuffler class to use.


I plan to move all of default config value (https://github.com/ray-project/ray/blob/master/python/ray/data/context.py#L17-L133) to this file. But that involves more code change because other code paths directly use them. I plan to do it as a separate PR.

c21 · 2023-08-14T23:29:46Z

python/ray/data/_internal/util.py

@@ -513,3 +513,20 @@ def unify_block_metadata_schema(
        # return the first schema.
        return schemas_to_unify[0]
    return None
+
+
+def get_attribute_from_class_name(class_name: str) -> Any:


Checked online, it looks like it's the recommended way to do it - https://stackoverflow.com/questions/452969/does-python-have-an-equivalent-to-java-class-forname .

amogkam · 2023-08-14T19:07:03Z

python/ray/data/datasource/_default_file_post_processor.py

+    paths: List[str],
+    file_sizes: List[int],
+    reader_args: Dict[str, Any]
+) -> Tuple[List[str], List[int]]:


Add docstring for this?

Sorry but this file is deleted.

amogkam · 2023-08-15T01:57:50Z

python/ray/data/read_api.py

@@ -370,6 +372,7 @@ def read_datasource(
            min_safe_parallelism,
            inmemory_size,
            read_tasks,
+            reader,
        ) = ray.get(
            get_read_tasks.remote(


how is get_read_tasks being postponed if it's still being called here?

Added a comment below, this is needed for LazyBlockList code path.

amogkam · 2023-08-15T01:58:59Z

python/ray/data/_internal/logical/operators/read_operator.py

        super().__init__(f"Read{datasource.get_name()}{suffix}", None, ray_remote_args)
        self._datasource = datasource
-        self._estimated_num_blocks = estimated_num_blocks
-        self._read_tasks = read_tasks
+        self._reader = reader


where does self._reader get used here?

c21 · 2023-08-15T04:24:17Z

python/ray/data/_internal/planner/plan_read_op.py

@@ -42,7 +42,10 @@ def _plan_read_op(op: Read) -> PhysicalOperator:
    """

    def get_input_data() -> List[RefBundle]:
-        read_tasks = op._read_tasks
+        read_tasks = op._reader.get_read_tasks(op._parallelism)


where does self._reader get used here?

@amogkam - this is used in planner here.

raulchen · 2023-08-15T18:44:22Z

python/ray/data/_internal/logical/operators/read_operator.py

@@ -30,4 +34,4 @@ def fusable(self) -> bool:
        as fusion would prevent the blocks from being dispatched to multiple processes
        for parallel processing in downstream operators.
        """
-        return self._estimated_num_blocks == len(self._read_tasks)
+        return self._parallelism == self._estimated_num_blocks


just self._additional_split_factor is None?

if _additional_split_factor == 1, we can still do the fusion right?

raulchen · 2023-08-15T18:50:43Z

python/ray/data/datasource/mongo_datasource.py

    def get_read_tasks(self, parallelism: int) -> List[ReadTask]:
        from bson.objectid import ObjectId

+        self._create_client()


nit, this would be clearer client = self._get_or_create_client()

raulchen · 2023-08-15T18:51:57Z

python/ray/data/_default_config.py

@@ -0,0 +1,2 @@
+# Default file shuffler class to use.
+DEFAULT_FILE_SHUFFLER = "ray.data.datasource.file_shuffler.SequentialFileShuffler"


Can we store the concrete class here? not just a string.

I prefer to store the class name here, because in the runtime, different objects will have different parameters. For example, Spark is doing this - https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2317-L2322 .

raulchen · 2023-08-15T18:55:38Z

python/ray/data/context.py

@@ -204,6 +206,7 @@ def __init__(
        self.use_ray_tqdm = use_ray_tqdm
        self.use_legacy_iter_batches = use_legacy_iter_batches
        self.enable_progress_bars = enable_progress_bars
+        self.file_shuffler = file_shuffler


nit, there is no need to add add this __init__ argument.

@property def file_shuffler_cls(self): # import here to avoid cyclic dependencies. return DEFAULT_FILE_SHUFFLER;

I guess so, we can change the whole __init__ later. Let's do a separate PR? Just follow other configs here.

raulchen · 2023-08-15T18:57:03Z

python/ray/data/datasource/file_shuffler.py

+from typing import Any, Dict, List, Tuple
+
+
+class FileShuffler:


maybe call it FileMetadataShuffler. The current name sounds like it is shuffling the files.

Sure, updated.

scottjlee · 2023-08-15T20:23:46Z

python/ray/data/datasource/file_metadata_shuffler.py

+    def __init__(self, reader_args: Dict[str, Any]):
+        self._reader_args = reader_args
+
+    def shuffle_files(


nit; similar to class name update, should we update the method name + docstrings to something like shuffle_file_metadatas or shuffle_metadatas? Is it possibly confusing with shuffling files instead of their metadata?

hmm I guess it's probably fine? given we already have FileMetadataShuffler as class name.

Signed-off-by: Cheng Su <scnju13@gmail.com>

This is a followup of #38373, to change interface of `FileMetadataShuffler` to take tuple instead of two lists. This guarantees the paths and sizes are the same length from API perspective. Signed-off-by: Cheng Su <scnju13@gmail.com>

…38373) This PR is to postpone `reader.get_read_tasks()` (i.e. generate the `List[ReadTask]`) until Dataset is executed. Also introduce a hook to allow post processing input files inside `reader.get_read_tasks()`, so we can have custom logic to do post processing of input files, before returning the `List[ReadTask]`. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

…ct#38508) This is a followup of ray-project#38373, to change interface of `FileMetadataShuffler` to take tuple instead of two lists. This guarantees the paths and sizes are the same length from API perspective. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: harborn <gangsheng.wu@intel.com>

…38373) This PR is to postpone `reader.get_read_tasks()` (i.e. generate the `List[ReadTask]`) until Dataset is executed. Also introduce a hook to allow post processing input files inside `reader.get_read_tasks()`, so we can have custom logic to do post processing of input files, before returning the `List[ReadTask]`. Signed-off-by: Cheng Su <scnju13@gmail.com>

…ct#38508) This is a followup of ray-project#38373, to change interface of `FileMetadataShuffler` to take tuple instead of two lists. This guarantees the paths and sizes are the same length from API perspective. Signed-off-by: Cheng Su <scnju13@gmail.com>

…38373) This PR is to postpone `reader.get_read_tasks()` (i.e. generate the `List[ReadTask]`) until Dataset is executed. Also introduce a hook to allow post processing input files inside `reader.get_read_tasks()`, so we can have custom logic to do post processing of input files, before returning the `List[ReadTask]`. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…ct#38508) This is a followup of ray-project#38373, to change interface of `FileMetadataShuffler` to take tuple instead of two lists. This guarantees the paths and sizes are the same length from API perspective. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…38373) This PR is to postpone `reader.get_read_tasks()` (i.e. generate the `List[ReadTask]`) until Dataset is executed. Also introduce a hook to allow post processing input files inside `reader.get_read_tasks()`, so we can have custom logic to do post processing of input files, before returning the `List[ReadTask]`. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>

…ct#38508) This is a followup of ray-project#38373, to change interface of `FileMetadataShuffler` to take tuple instead of two lists. This guarantees the paths and sizes are the same length from API perspective. Signed-off-by: Cheng Su <scnju13@gmail.com> Signed-off-by: Victor <vctr.y.m@example.com>

c21 requested review from ericl, scv119, amogkam, scottjlee, bveeramani and raulchen as code owners August 11, 2023 22:36

c21 commented Aug 11, 2023

View reviewed changes

c21 assigned raulchen and amogkam Aug 11, 2023

raulchen approved these changes Aug 11, 2023

View reviewed changes

c21 force-pushed the read-api branch from b2a7343 to c093540 Compare August 14, 2023 23:21

c21 commented Aug 14, 2023

View reviewed changes

amogkam reviewed Aug 15, 2023

View reviewed changes

c21 commented Aug 15, 2023

View reviewed changes

raulchen approved these changes Aug 15, 2023

View reviewed changes

c21 force-pushed the read-api branch 3 times, most recently from 2fb516a to 214c438 Compare August 15, 2023 20:00

scottjlee approved these changes Aug 15, 2023

View reviewed changes

c21 added 6 commits August 15, 2023 14:08

Postpone until execution

1f48457

Signed-off-by: Cheng Su <scnju13@gmail.com>

fix lint

d8dc92a

Signed-off-by: Cheng Su <scnju13@gmail.com>

Add file shuffler and context config

df4d973

Signed-off-by: Cheng Su <scnju13@gmail.com>

Address comments

09283de

Signed-off-by: Cheng Su <scnju13@gmail.com>

Fix mongo datasource

64a5754

Signed-off-by: Cheng Su <scnju13@gmail.com>

Address comments

8b889f4

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 force-pushed the read-api branch from 214c438 to 8b889f4 Compare August 15, 2023 21:08

c21 merged commit a2e0ce1 into ray-project:master Aug 16, 2023
51 of 53 checks passed

c21 deleted the read-api branch August 16, 2023 00:47

c21 mentioned this pull request Aug 16, 2023

[Data] Change FileMetadataShuffler interface to take tuple #38508

Merged

8 tasks

amogkam mentioned this pull request Aug 23, 2023

[Data] Regression on parquet_metadata_resolution and other parquet read release tests #38790

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Postpone `reader.get_read_tasks` until execution #38373

[Data] Postpone `reader.get_read_tasks` until execution #38373

c21 commented Aug 11, 2023

c21 Aug 11, 2023

raulchen Aug 11, 2023

c21 Aug 12, 2023

raulchen Aug 11, 2023

c21 Aug 11, 2023

raulchen Aug 12, 2023

c21 Aug 14, 2023

raulchen Aug 11, 2023

c21 Aug 14, 2023

c21 Aug 14, 2023

amogkam Aug 14, 2023

c21 Aug 15, 2023

amogkam Aug 15, 2023

c21 Aug 15, 2023

amogkam Aug 15, 2023

c21 Aug 15, 2023 •

edited

Loading

raulchen Aug 15, 2023

c21 Aug 15, 2023

raulchen Aug 15, 2023

c21 Aug 15, 2023

raulchen Aug 15, 2023

c21 Aug 15, 2023

raulchen Aug 15, 2023

c21 Aug 15, 2023 •

edited

Loading

raulchen Aug 15, 2023

c21 Aug 15, 2023

scottjlee Aug 15, 2023

c21 Aug 15, 2023

		@@ -0,0 +1,2 @@
		# Default file shuffler class to use.
		DEFAULT_FILE_SHUFFLER = "ray.data.datasource.file_shuffler.SequentialFileShuffler"

		from typing import Any, Dict, List, Tuple


		class FileShuffler:

[Data] Postpone reader.get_read_tasks until execution #38373

[Data] Postpone reader.get_read_tasks until execution #38373

Conversation

c21 commented Aug 11, 2023

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Data] Postpone `reader.get_read_tasks` until execution #38373

[Data] Postpone `reader.get_read_tasks` until execution #38373

c21 Aug 15, 2023 •

edited

Loading

c21 Aug 15, 2023 •

edited

Loading