-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Time streaming split overhead #43477
Conversation
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
…inal stats object Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
2de1cb5
to
f84a3cd
Compare
Signed-off-by: Matthew Owen <mowen@anyscale.com>
a6248fa
to
d868699
Compare
Signed-off-by: Matthew Owen <mowen@anyscale.com>
stats = ( | ||
self._executor.get_stats() | ||
if self._executor | ||
else self._base_dataset._plan.stats() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can rewrite SplitCoordinator.stats()
to return this instead of DatasetStatsSummary
, and use the method here
Signed-off-by: Matthew Owen <mowen@anyscale.com>
…dd to streaming test Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you include an example of the new additions to the output string in the PR description?
python/ray/data/_internal/stats.py
Outdated
@@ -1256,6 +1259,9 @@ def to_string(self) -> str: | |||
out += " * Num blocks unknown location: {}\n".format( | |||
self.iter_unknown_location | |||
) | |||
if self.streaming_split_coord_time.get() != 0: | |||
out += "* Streaming split coordinator overhead time: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for formatting, should we omit the * since this is the start of a new section?
python/ray/data/_internal/stats.py
Outdated
@@ -631,6 +631,9 @@ def __init__( | |||
self.global_bytes_restored: int = 0 | |||
self.dataset_bytes_spilled: int = 0 | |||
|
|||
# Streaming split iterator stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also note in the comment that this is measured at Dataset level (as opposed to operator level)
# Streaming split iterator stats | |
# Streaming split coordinator stats |
@@ -47,6 +48,8 @@ def __init__( | |||
self._output_queue: deque[RefBundle] = deque() | |||
# The number of rows output to each output split so far. | |||
self._num_output: List[int] = [0 for _ in range(n)] | |||
# The time of the overhead for the output splitter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also note in the comment that this is measured at operator level (as opposed to Dataset level)
Signed-off-by: Matthew Owen <mowen@anyscale.com>
…mments Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
Signed-off-by: Matthew Owen <mowen@anyscale.com>
@@ -204,7 +208,7 @@ def get( | |||
|
|||
This is intended to be called concurrently from multiple clients. | |||
""" | |||
|
|||
start_time = time.perf_counter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, consider creating a context manager or decorator util to make the code cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that, I think best to leave as is in case the stats object is None
. I think for my previous PR switching to a context manager caused some integration tests to fail (#43283). I could use the context manager only when stats is not None
but not sure if that would end up being cleaner.
This adds timing for the overhead created by using the [`streaming_split`](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.streaming_split.html) iterator. There are two primary overheads that are measured: 1. The [OutputSplitter](https://github.com/ray-project/ray/blob/1cc18ea622a6e1899a17f4548aa3734c3656a90f/python/ray/data/_internal/execution/operators/output_splitter.py#L18) operator. This operator is responsible for distributing data to different splits. It runs in the scheduler thread. 2. The [SplitCoordinator actor](https://github.com/ray-project/ray/blob/783da640a20ddbd3b41b893485abf187f0f27223/python/ray/data/_internal/iterator/stream_split_iterator.py#L124). The overhead is mainly from the get and the barrier, which happen in some background threads that handle the remote task requests. The first overhead is tracked at the operator level in the OutputSplitter operator. The second overhead is tracked at the Dataset level in the SplitCoordinator actor. --------- Signed-off-by: Matthew Owen <mowen@anyscale.com>
Why are these changes needed?
This adds timing for the overhead created by using the
streaming_split
iterator. There are two primary overheads that are measured:The first overhead is tracked at the operator level in the OutputSplitter operator. The second overhead is tracked at the Dataset level in the SplitCoordinator actor.
Related issue number
Closes #42802
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.