[data] create StatsManager to manage _StatsActor remote calls #40913

Zandew · 2023-11-02T21:28:00Z

Why are these changes needed?

Creates a StatsManager class to manage remote calls to _StatsActor.

This singleton manager controls the time interval for reporting metrics to _StatsActor:

Runs a single background thread that reports metrics to _StatsActor every 5s
This thread is stopped after being inactive for too long, and will be restarted if there is a new update afterwards

Also logs op metrics for _debug_dump_topology.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Zandew · 2023-11-03T16:14:48Z

Should we still submit the stats update tasks on a different thread? If we're doing it every 10s the overhead is a lot less significant

stephanie-wang · 2023-11-03T16:41:30Z

Should we still submit the stats update tasks on a different thread? If we're doing it every 10s the overhead is a lot less significant

Could you measure the latency for the task submission? You can use one of the release tests. We just have to be careful here because even if the % time is small, it can block GPU time in training scenarios (i.e. tail latency is important, not just average).

If we put it on a background thread, we could also reduce the interval a bit to get more interactivity in the metrics.

Zandew · 2023-11-03T21:31:11Z

Ran iter_tensor_batches_benchmark_multi_node.aws here. The task submission overheads for update_stats_actor_iter_metrics:

total time: 104.429s, task submission time: 1.0405s
total time: 105.1504s, task submission time: 0.8603s
total time: 103.7166s, task submission time: 0.7929s
total time: 104.0425s, task submission time: 1.0805s
total time: 65.9899s, task submission time: 0.2991s
total time: 65.9907s, task submission time: 0.29166s
total time: 59.3974s, task submission time: 0.1291s
total time: 59.0898s, task submission time: 0.13269s
total time: 155.84426s, task submission time: 0.5009s
total time: 172.6297s, task submission time: 0.28232s
total time:  157.827s, task submission time: 1.6452s,
total time: 171.9112s, task submission time: 0.28234s

Zandew · 2023-11-06T18:08:01Z

Also timed the overhead for starting a thread on the same test:

total time: 104.5556s, thread submission time: 0.0591s
total time: 105.8269s, thread submission time: 0.05925s
total time: 104.9419s,  thread submission time: 0.05904s
total time: 104.4344s, thread submission time: 0.05956s
total time: 67.5454s, thread submission time: 0.03783s
total time: 67.0104s, thread submission time: 0.04410s
total time: 61.5307s, thread submission time: 0.03878s
total time: 61.5185s, thread submission time: 0.05433s
total time: 158.3527s thread submission time: 0.08616s
total time: 175.7912s, thread submission time: 0.1032s
total time: 160.6391s, thread submission time: 0.08617s
total time: 174.0594s, thread submission time: 0.09124s

stephanie-wang · 2023-11-06T18:52:04Z

Hmm the 1% overhead is not too bad, but I think it's probably best to put it in a background thread still (to allow shorter interval and make the code a bit more robust).

No need to start new threads; you can put the stats update in the same loop as format_batches, which is already run on a threadpool: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/block_batching/iter_batches.py#L225

Zandew · 2023-11-06T19:04:10Z

Hmm the 1% overhead is not too bad, but I think it's probably best to put it in a background thread still (to allow shorter interval and make the code a bit more robust).

Sounds good 👍 .

No need to start new threads; you can put the stats update in the same loop as format_batches, which is already run on a threadpool: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/block_batching/iter_batches.py#L225

What about the update done in _scheduling_loop_step for the execution codepath?

stephanie-wang · 2023-11-06T21:18:44Z

What about the update done in _scheduling_loop_step for the execution codepath?

Ah, that one is probably okay since it's running on the driver. For latency-critical scenarios, the iter_batches loop is usually run on a different process. Also, the granularity in the scheduling loop is coarser - there can be lots of batches produced by one Data op task.

python/ray/data/_internal/block_batching/util.py

python/ray/data/_internal/execution/streaming_executor.py

Zandew · 2023-11-07T20:16:55Z

python/ray/data/_internal/block_batching/iter_batches.py

+
+        for batch in formatted_batch_iter:
+            yield batch
+            # Update stats in here to avoid blocking main
+            # iteration thread with task submission overhead.
+            if stats_update_lock.acquire(blocking=False):
+                if (
+                    time.time() - last_stats_update[0]
+                    >= STATS_ACTOR_UPDATE_INTERVAL_SECONDS
+                ):
+                    update_stats_actor_iter_metrics(stats, metrics_tag)
+                    last_stats_update[0] = time.time()
+                stats_update_lock.release()


I ended up adding this here since format_batches is also used in other places, please let me know if there's anything wrong with this

it's kinda weird to couple stats updating code with batch formatting code. Especially the thread pool and lock would make the code hard to read. I'd prefer keeping it in the main thread given the 1% overhead, or using a dedicated component (e.g., StatsManager) with a dedicated thread to update the stats periodically.

I asked for this:

Hmm the 1% overhead is not too bad, but I think it's probably best to put it in a background thread still (to allow shorter interval and make the code a bit more robust).

The robustness benefit is worth the code readability to me but happy to sync offline about it. I also think the readability is a bit subjective and it seems fine to me at the moment.

Zandew · 2023-11-07T21:11:08Z

python/ray/data/_internal/stats.py

-def update_stats_actor_dataset(dataset_tag, state):
-    global _stats_actor
-    _check_cluster_stats_actor()
-    _stats_actor.update_dataset.remote(dataset_tag, state)
-
-


This function is now merged with the general update_stats_actor_metrics

Zandew · 2023-11-07T21:12:02Z

python/ray/data/_internal/iterator/iterator_impl.py

-        return (self._base_dataset._plan._dataset_name or "") + self._base_dataset._uuid
+        return (
+            self._base_dataset._plan._dataset_name or "dataset"
+        ) + self._base_dataset._uuid


I forgot to update this part back when we started using statsactor to generate dataset uuids.

raulchen · 2023-11-07T21:17:40Z

python/ray/data/_internal/block_batching/iter_batches.py

@@ -162,6 +161,7 @@ def _async_iter_batches(
            batch_format=batch_format,
            collate_fn=collate_fn,
            num_threadpool_workers=prefetch_batches,
+            metrics_tag=metrics_tag,


(not blocking this PR) Does it make more sense to save this metrics_tag in DatasetStats? So we don't need to pass them around together

raulchen · 2023-11-07T21:24:26Z

python/ray/data/_internal/block_batching/iter_batches.py

+
+        for batch in formatted_batch_iter:
+            yield batch
+            # Update stats in here to avoid blocking main
+            # iteration thread with task submission overhead.
+            if stats_update_lock.acquire(blocking=False):
+                if (
+                    time.time() - last_stats_update[0]
+                    >= STATS_ACTOR_UPDATE_INTERVAL_SECONDS
+                ):
+                    update_stats_actor_iter_metrics(stats, metrics_tag)
+                    last_stats_update[0] = time.time()
+                stats_update_lock.release()


it's kinda weird to couple stats updating code with batch formatting code. Especially the thread pool and lock would make the code hard to read. I'd prefer keeping it in the main thread given the 1% overhead, or using a dedicated component (e.g., StatsManager) with a dedicated thread to update the stats periodically.

stephanie-wang

Nice!

Zandew · 2023-11-08T21:37:41Z

python/ray/data/_internal/stats.py

-def clear_stats_actor_iter_metrics(tags: Dict[str, str]):
-    global _stats_actor
-    _check_cluster_stats_actor()
+    def update_stats_actor_iter_metrics(


Is it possible for this to be called by multiple iterators at once?

In theory yes, but the common case is that iterators will be in different processes.

What can happen if they're in the same process?

This isn't thread safe right now

It's not a common practice, but theoretically possible. Can we pass in a tag to distinguish them? This tag will be useful for the dashboard as well. Basically, I think we need the dataset id + execution id + StreamSplitDataIterator index in this tag.

I added support for multiple iterators: we will store the last call from each iterator, and the background thread will send all of them to the stats actor at once.

For multiple iterators of the same dataset on the same process, it should be uncommon for their iteration to overlap right? if their iteration is non-overlapping then it should be pretty easy to see in the time-series view the separate iterations. And even if we do use execution id, is there a way for the user to get the id for themselves?

For streaming split iterator, I added the index.

python/ray/data/_internal/stats.py

python/ray/data/tests/test_stats.py

python/ray/data/_internal/stats.py

raulchen · 2023-11-09T22:33:23Z

python/ray/data/_internal/stats.py


+    def update_stats_actor_iter_metrics(
+        self, stats: "DatasetStats", tags: Dict[str, str]
+    ):


should we also force-update when the iterator finishes?

The point of force-update for the other codepath is to update the _StatsActor.datasets, not to update the metrics, since we clear the metrics right after anyways, its unlikely those last metrics will even be emitted since prometheus only samples every couple seconds

This is a separate issue that would require support for actually stopping the emitting of a metric for a certain tag (right now we just set the metric to 0). A simple way to make sure that the last metrics get emitted is to also just wait 10s before clearing.

raulchen · 2023-11-09T22:35:22Z

python/ray/data/_internal/execution/streaming_executor.py

@@ -449,6 +445,7 @@ def _debug_dump_topology(topology: Topology, log_to_stdout: bool = True) -> None
    for i, (op, state) in enumerate(topology.items()):
        logger.get_logger(log_to_stdout).info(
            f"{i}: {state.summary_str()}, "
-            f"Blocks Outputted: {state.num_completed_tasks}/{op.num_outputs_total()}"
+            f"Blocks Outputted: {state.num_completed_tasks}/{op.num_outputs_total()}\n"
+            f"{op.metrics.as_dict()}"


nit, maybe print the metrics in a different loop. mixing them together may make the logs harder to read.

Also, we moved the interval checking out of this part of the code, do we still want to log this in an interval?

python/ray/data/_internal/iterator/iterator_impl.py

Signed-off-by: Andrew Xue <andewzxue@gmail.com>

raulchen

Looks great after this iteration. I left some small comments.

python/ray/data/_internal/stats.py

python/ray/data/_internal/execution/streaming_executor.py

python/ray/data/_internal/stats.py

python/ray/data/tests/test_stats.py

Signed-off-by: Andrew Xue <andewzxue@gmail.com>

…oject#40913) Creates a `StatsManager` class to manage remote calls to `_StatsActor`. This singleton manager controls the time interval for reporting metrics to `_StatsActor`: - Runs a single background thread that reports metrics to `_StatsActor` every 5s - This thread is stopped after being inactive for too long, and will be restarted if there is a new update afterwards Also logs op metrics for `_debug_dump_topology`. --------- Signed-off-by: Andrew Xue <andewzxue@gmail.com>

Zandew marked this pull request as ready for review November 3, 2023 16:16

Zandew requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen and stephanie-wang as code owners November 3, 2023 16:16

Zandew assigned stephanie-wang and raulchen Nov 3, 2023

Zandew force-pushed the streaming-stats-interval branch from eda3e67 to 8431a5d Compare November 6, 2023 19:39

stephanie-wang reviewed Nov 6, 2023

View reviewed changes

python/ray/data/_internal/block_batching/util.py Outdated Show resolved Hide resolved

raulchen approved these changes Nov 6, 2023

View reviewed changes

python/ray/data/_internal/execution/streaming_executor.py Outdated Show resolved Hide resolved

Zandew force-pushed the streaming-stats-interval branch 2 times, most recently from 9b06ae1 to ea31013 Compare November 7, 2023 20:12

Zandew commented Nov 7, 2023

View reviewed changes

raulchen reviewed Nov 7, 2023

View reviewed changes

stephanie-wang approved these changes Nov 7, 2023

View reviewed changes

Zandew assigned scottjlee Nov 8, 2023

Zandew changed the title ~~[data] add interval for streaming stats report~~ [data] create StatsManager to manage _StatsActor remote calls Nov 8, 2023

Zandew commented Nov 8, 2023

View reviewed changes

Zandew force-pushed the streaming-stats-interval branch from 2cd09a8 to dd4f333 Compare November 8, 2023 22:55

raulchen reviewed Nov 9, 2023

View reviewed changes

Zandew force-pushed the streaming-stats-interval branch 4 times, most recently from b741a47 to dd03bb2 Compare November 13, 2023 17:56

stats manager

cb3bdaf

Signed-off-by: Andrew Xue <andewzxue@gmail.com>

Zandew force-pushed the streaming-stats-interval branch from a3db5ca to cb3bdaf Compare November 15, 2023 03:13

Zandew added 2 commits November 14, 2023 19:47

fix

cf81dfd

Signed-off-by: Andrew Xue <andewzxue@gmail.com>

comment

60cde64

Signed-off-by: Andrew Xue <andewzxue@gmail.com>

raulchen approved these changes Nov 17, 2023

View reviewed changes

fix

3872a6d

Signed-off-by: Andrew Xue <andewzxue@gmail.com>

raulchen merged commit ebc7a39 into ray-project:master Nov 20, 2023
15 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] create StatsManager to manage _StatsActor remote calls #40913

[data] create StatsManager to manage _StatsActor remote calls #40913

Zandew commented Nov 2, 2023 •

edited

Loading

Zandew commented Nov 3, 2023

stephanie-wang commented Nov 3, 2023 •

edited

Loading

Zandew commented Nov 3, 2023

Zandew commented Nov 6, 2023

stephanie-wang commented Nov 6, 2023

Zandew commented Nov 6, 2023 •

edited

Loading

stephanie-wang commented Nov 6, 2023

Zandew Nov 7, 2023

raulchen Nov 7, 2023

stephanie-wang Nov 7, 2023 •

edited

Loading

Zandew Nov 7, 2023

Zandew Nov 7, 2023

raulchen Nov 7, 2023 •

edited

Loading

raulchen Nov 7, 2023

stephanie-wang left a comment

Zandew Nov 8, 2023

stephanie-wang Nov 8, 2023

Zandew Nov 9, 2023

raulchen Nov 9, 2023

Zandew Nov 9, 2023

raulchen Nov 9, 2023

Zandew Nov 9, 2023

Zandew Nov 15, 2023

raulchen Nov 9, 2023

Zandew Nov 10, 2023 •

edited

Loading

raulchen left a comment

[data] create StatsManager to manage _StatsActor remote calls #40913

[data] create StatsManager to manage _StatsActor remote calls #40913

Conversation

Zandew commented Nov 2, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Zandew commented Nov 3, 2023

stephanie-wang commented Nov 3, 2023 • edited Loading

Zandew commented Nov 3, 2023

Zandew commented Nov 6, 2023

stephanie-wang commented Nov 6, 2023

Zandew commented Nov 6, 2023 • edited Loading

stephanie-wang commented Nov 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zandew Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

raulchen left a comment

Choose a reason for hiding this comment

Zandew commented Nov 2, 2023 •

edited

Loading

stephanie-wang commented Nov 3, 2023 •

edited

Loading

Zandew commented Nov 6, 2023 •

edited

Loading

stephanie-wang Nov 7, 2023 •

edited

Loading

raulchen Nov 7, 2023 •

edited

Loading

Zandew Nov 10, 2023 •

edited

Loading