[Data] Time streaming split overhead #43477

omatthew98 · 2024-02-27T23:20:19Z

Why are these changes needed?

This adds timing for the overhead created by using the streaming_split iterator. There are two primary overheads that are measured:

The OutputSplitter operator. This operator is responsible for distributing data to different splits. It runs in the scheduler thread.
The SplitCoordinator actor. The overhead is mainly from the get and the barrier, which happen in some background threads that handle the remote task requests.

The first overhead is tracked at the operator level in the OutputSplitter operator. The second overhead is tracked at the Dataset level in the SplitCoordinator actor.

Related issue number

Closes #42802

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Owen <mowen@anyscale.com>

…inal stats object Signed-off-by: Matthew Owen <mowen@anyscale.com>

Signed-off-by: Matthew Owen <mowen@anyscale.com>

python/ray/data/_internal/execution/operators/output_splitter.py

scottjlee · 2024-02-29T05:55:29Z

python/ray/data/_internal/iterator/stream_split_iterator.py

+            stats = (
+                self._executor.get_stats()
+                if self._executor
+                else self._base_dataset._plan.stats()
+            )


maybe we can rewrite SplitCoordinator.stats() to return this instead of DatasetStatsSummary, and use the method here

Signed-off-by: Matthew Owen <mowen@anyscale.com>

…dd to streaming test Signed-off-by: Matthew Owen <mowen@anyscale.com>

Signed-off-by: Matthew Owen <mowen@anyscale.com>

scottjlee

could you include an example of the new additions to the output string in the PR description?

scottjlee · 2024-03-01T00:20:06Z

python/ray/data/_internal/stats.py

@@ -1256,6 +1259,9 @@ def to_string(self) -> str:
                out += "    * Num blocks unknown location: {}\n".format(
                    self.iter_unknown_location
                )
+            if self.streaming_split_coord_time.get() != 0:
+                out += "* Streaming split coordinator overhead time: "


for formatting, should we omit the * since this is the start of a new section?

scottjlee · 2024-03-01T01:07:35Z

python/ray/data/_internal/stats.py

@@ -631,6 +631,9 @@ def __init__(
        self.global_bytes_restored: int = 0
        self.dataset_bytes_spilled: int = 0

+        # Streaming split iterator stats


Let's also note in the comment that this is measured at Dataset level (as opposed to operator level)

Suggested change

# Streaming split iterator stats

# Streaming split coordinator stats

scottjlee · 2024-03-01T01:08:03Z

python/ray/data/_internal/execution/operators/output_splitter.py

@@ -47,6 +48,8 @@ def __init__(
        self._output_queue: deque[RefBundle] = deque()
        # The number of rows output to each output split so far.
        self._num_output: List[int] = [0 for _ in range(n)]
+        # The time of the overhead for the output splitter


Let's also note in the comment that this is measured at operator level (as opposed to Dataset level)

Signed-off-by: Matthew Owen <mowen@anyscale.com>

…mments Signed-off-by: Matthew Owen <mowen@anyscale.com>

Signed-off-by: Matthew Owen <mowen@anyscale.com>

raulchen · 2024-03-04T18:37:29Z

python/ray/data/_internal/iterator/stream_split_iterator.py

@@ -204,7 +208,7 @@ def get(

        This is intended to be called concurrently from multiple clients.
        """
-
+        start_time = time.perf_counter()


nit, consider creating a context manager or decorator util to make the code cleaner.

I considered that, I think best to leave as is in case the stats object is None. I think for my previous PR switching to a context manager caused some integration tests to fail (#43283). I could use the context manager only when stats is not None but not sure if that would end up being cleaner.

This adds timing for the overhead created by using the [`streaming_split`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.streaming_split.html) iterator. There are two primary overheads that are measured: 1. The [OutputSplitter](https://github.com/ray-project/ray/blob/1cc18ea622a6e1899a17f4548aa3734c3656a90f/python/ray/data/_internal/execution/operators/output_splitter.py#L18) operator. This operator is responsible for distributing data to different splits. It runs in the scheduler thread. 2. The [SplitCoordinator actor](https://github.com/ray-project/ray/blob/783da640a20ddbd3b41b893485abf187f0f27223/python/ray/data/_internal/iterator/stream_split_iterator.py#L124). The overhead is mainly from the get and the barrier, which happen in some background threads that handle the remote task requests. The first overhead is tracked at the operator level in the OutputSplitter operator. The second overhead is tracked at the Dataset level in the SplitCoordinator actor. --------- Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 added 5 commits February 27, 2024 09:17

adding in streaming overhead timing, fix typing

06acc4e

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding in timing of split coordinator

ca422e7

Signed-off-by: Matthew Owen <mowen@anyscale.com>

move output splitter timing out of physical operator, move stats to f…

5aafc45

…inal stats object Signed-off-by: Matthew Owen <mowen@anyscale.com>

switch on executor for where to log stats to

6a78139

Signed-off-by: Matthew Owen <mowen@anyscale.com>

cleaning up

f84a3cd

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 force-pushed the time-streaming-split branch from 2de1cb5 to f84a3cd Compare February 27, 2024 23:29

Merge branch 'master' into time-streaming-split

d868699

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 force-pushed the time-streaming-split branch from a6248fa to d868699 Compare February 27, 2024 23:46

Merge branch 'master' into time-streaming-split

55d8de3

Signed-off-by: Matthew Owen <mowen@anyscale.com>

scottjlee self-assigned this Feb 29, 2024

scottjlee reviewed Feb 29, 2024

View reviewed changes

omatthew98 added 5 commits February 29, 2024 10:05

refactor SplitCoordinator.stats

dd0af24

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding streaming split coordinator overhead time to iterator stats, a…

85040e5

…dd to streaming test Signed-off-by: Matthew Owen <mowen@anyscale.com>

Merge branch 'master' into time-streaming-split

cb81ae6

Signed-off-by: Matthew Owen <mowen@anyscale.com>

canonicalize scientific notation to unflake test

dc17045

Signed-off-by: Matthew Owen <mowen@anyscale.com>

reorder canonicalize to properly replace scientific notation

42dd5ab

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 changed the title ~~Time streaming split overhead~~ [Data] Time streaming split overhead Feb 29, 2024

Merge branch 'master' into time-streaming-split

0e630ba

Signed-off-by: Matthew Owen <mowen@anyscale.com>

scottjlee approved these changes Mar 1, 2024

View reviewed changes

scottjlee reviewed Mar 1, 2024

View reviewed changes

omatthew98 marked this pull request as ready for review March 1, 2024 22:47

omatthew98 requested review from ericl, scv119, c21, amogkam, bveeramani, raulchen and stephanie-wang as code owners March 1, 2024 22:47

omatthew98 added 2 commits March 1, 2024 14:56

Merge branch 'master' into time-streaming-split

7d09456

Signed-off-by: Matthew Owen <mowen@anyscale.com>

moving streaming coord stats to own section, specify stat level in co…

7096de8

…mments Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 added 2 commits March 4, 2024 09:03

fixing test after modified output

4e83749

Signed-off-by: Matthew Owen <mowen@anyscale.com>

Merge branch 'master' into time-streaming-split

c76519f

Signed-off-by: Matthew Owen <mowen@anyscale.com>

raulchen approved these changes Mar 4, 2024

View reviewed changes

omatthew98 added the release-blocker P0 Issue that blocks the release label Mar 4, 2024

raulchen merged commit 164fc16 into ray-project:master Mar 4, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Time streaming split overhead #43477

[Data] Time streaming split overhead #43477

omatthew98 commented Feb 27, 2024 •

edited

Loading

scottjlee Feb 29, 2024

scottjlee left a comment

scottjlee Mar 1, 2024

scottjlee Mar 1, 2024

scottjlee Mar 1, 2024

raulchen Mar 4, 2024

omatthew98 Mar 4, 2024

	# Streaming split iterator stats
	# Streaming split coordinator stats

[Data] Time streaming split overhead #43477

[Data] Time streaming split overhead #43477

Conversation

omatthew98 commented Feb 27, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

scottjlee Feb 29, 2024

Choose a reason for hiding this comment

scottjlee left a comment

Choose a reason for hiding this comment

scottjlee Mar 1, 2024

Choose a reason for hiding this comment

scottjlee Mar 1, 2024

Choose a reason for hiding this comment

scottjlee Mar 1, 2024

Choose a reason for hiding this comment

raulchen Mar 4, 2024

Choose a reason for hiding this comment

omatthew98 Mar 4, 2024

Choose a reason for hiding this comment

omatthew98 commented Feb 27, 2024 •

edited

Loading