[Data] Fix throughput time calculations #44138

omatthew98 · 2024-03-19T20:48:53Z

Why are these changes needed?

When doing some documentation for the new stats we added, I noticed some quirks with the dataset throughput reporting. Dataset throughput was often reported as worse through Ray Data than the single node approximation, but the single node approximation should be a floor for how fast it should run through Ray Data (i.e. they should match if a single node / single block was used).

We want compute the Ray Data dataset throughput by dividing the total number of rows produced by the total wall time it took for the operation to run (i.e. from starting the process to process completion). We want to compute the single node approximation by dividing the total number of rows produced by the total wall time of all blocks across all operators (i.e. the total amount of time if there were no concurrent processes).

Upon digging, I found that the calculation was incorrect. For the Ray Data dataset throughput, we were dividing by self.time_total_s which is the total time for the last operator in the chain (sometimes correct, but incorrect with multiple non-fused operators). For the single node approximation we were dividing by self.get_total_wall_time(), which is actually what we want for the Ray Data dataset throughput, but incorrect for the single node approximation as it does not total time over the multiple blocks.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

scottjlee

LGTM, maybe add a unit test to cover the case with multiple non-fused ops

scottjlee · 2024-03-19T20:58:26Z

python/ray/data/_internal/stats.py

+            total_time = self.get_total_wall_time()
+            total_time_all_blocks = self.get_total_time_all_blocks()


can we add docstrings for get_total_wall_time() and get_total_time_all_blocks() to explain their return value / differentiate them?

Will do and explain a bit more why we are doing the calculation the way we are for both this and the operator throughput.

scottjlee · 2024-03-19T21:00:27Z

python/ray/data/_internal/stats.py

+        return parent_time_total + sum(
+            ss.wall_time.get("sum", 0) if ss.wall_time else 0
+            for ss in self.operators_stats
+        )


this is really getting the total wall time across all operators, not across blocks right? would a more appropriate name for the method be get_total_time_all_operators?

I think that name makes more sense, but because we use ss.wall_time.get("sum", 0) that is a sum across all blocks within the operator.

Signed-off-by: Matthew Owen <mowen@anyscale.com>

scottjlee · 2024-03-25T21:43:03Z

python/ray/data/_internal/stats.py

+            # For throughput, we compute both an observed Ray Data operator throughput
+            # and an estimated single node operator throughput.
+
+            # The observed Ray Data operator throughput is computed by dividing the
+            # total number of rows produced by the wall time of the operator,
+            # time_total_s.
+
+            # The estimated single node operator throughput is computed by dividing the
+            # total number of rows produced by the the sum of the wall times across all
+            # blocks of the operator. This assumes that on a single node the work done
+            # would be equivalent, with no concurrency.


i think this comment is kind of copied from above? or should they be in both places?

One is for the dataset throughput and the other is for operator throughput, they are similar but have some differences. Either way, I wanted to have some information about the throughput in the two places that we are calculating it, but if that feels redundant / there is a common place to put the shared info that makes sense too.

scottjlee · 2024-03-25T21:48:40Z

python/ray/data/tests/test_stats.py

+    total_time, total_percent = metrics_dict["Total"]
+    metrics_dict.pop("Total")


nit

Suggested change

total_time, total_percent = metrics_dict["Total"]

metrics_dict.pop("Total")

total_time, total_percent = metrics_dict.pop("Total")

scottjlee · 2024-03-25T21:52:23Z

python/ray/data/tests/test_stats.py

+        r"Operator (\d+).*?Ray Data throughput: (\d+\.\d+) rows/s.*?Estimated single node throughput: (\d+\.\d+) rows/s",  # noqa: E501
+        re.DOTALL,
+    )
+


just adding a clarifying comment

Suggested change

# Ray data throughput should always be better than single node throughput for multi-cpu case.

Signed-off-by: Matthew Owen <mowen@anyscale.com>

When doing some documentation for the new stats we added, I noticed some quirks with the dataset throughput reporting. Dataset throughput was often reported as worse through Ray Data than the single node approximation, but the single node approximation should be a floor for how fast it should run through Ray Data (i.e. they should match if a single node / single block was used). We want compute the Ray Data dataset throughput by dividing the total number of rows produced by the total wall time it took for the operation to run (i.e. from starting the process to process completion). We want to compute the single node approximation by dividing the total number of rows produced by the total wall time of all blocks across all operators (i.e. the total amount of time if there were no concurrent processes). Upon digging, I found that the calculation was incorrect. For the Ray Data dataset throughput, we were dividing by `self.time_total_s` which is the total time for the last operator in the chain (sometimes correct, but incorrect with multiple non-fused operators). For the single node approximation we were dividing by `self.get_total_wall_time()`, which is actually what we want for the Ray Data dataset throughput, but incorrect for the single node approximation as it does not total time over the multiple blocks. Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen and stephanie-wang as code owners March 19, 2024 20:48

scottjlee approved these changes Mar 19, 2024

View reviewed changes

omatthew98 assigned c21 and scottjlee Mar 19, 2024

omatthew98 added 2 commits March 21, 2024 14:27

fix throughput time calculations

7c95ee4

Signed-off-by: Matthew Owen <mowen@anyscale.com>

update the throughput calculation further, add comments to document

327169f

Signed-off-by: Matthew Owen <mowen@anyscale.com>

omatthew98 force-pushed the fix-throughput-times branch from c0065ae to 327169f Compare March 21, 2024 23:36

omatthew98 added 8 commits March 21, 2024 16:53

fix bugs

363d4ff

Signed-off-by: Matthew Owen <mowen@anyscale.com>

refactor to use collect dataset stats summaries fn

aa29bdd

Signed-off-by: Matthew Owen <mowen@anyscale.com>

fix existing test

e2a21c8

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding in test for total time all blocks

9530448

Signed-off-by: Matthew Owen <mowen@anyscale.com>

fixing logic for runtime metrics (do not use sum)

b3201fa

Signed-off-by: Matthew Owen <mowen@anyscale.com>

fix bug, simplify method

55c1e1c

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding in sanity check test for throughput

9587fde

Signed-off-by: Matthew Owen <mowen@anyscale.com>

adding in test for runtime

39f1b85

Signed-off-by: Matthew Owen <mowen@anyscale.com>

scottjlee reviewed Mar 25, 2024

View reviewed changes

scottjlee approved these changes Mar 25, 2024

View reviewed changes

omatthew98 added 3 commits March 25, 2024 15:24

fix bug for test_sort

436cda3

Signed-off-by: Matthew Owen <mowen@anyscale.com>

respond to pr feedback

38804a5

Signed-off-by: Matthew Owen <mowen@anyscale.com>

fix lint for robocop

e1e1678

Signed-off-by: Matthew Owen <mowen@anyscale.com>

c21 approved these changes Mar 26, 2024

View reviewed changes

c21 merged commit a7e3a2e into ray-project:master Mar 26, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Fix throughput time calculations #44138

[Data] Fix throughput time calculations #44138

omatthew98 commented Mar 19, 2024

scottjlee left a comment

scottjlee Mar 19, 2024

omatthew98 Mar 19, 2024

scottjlee Mar 19, 2024

omatthew98 Mar 19, 2024

scottjlee Mar 25, 2024

omatthew98 Mar 25, 2024

scottjlee Mar 25, 2024

scottjlee Mar 25, 2024

		total_time = self.get_total_wall_time()
		total_time_all_blocks = self.get_total_time_all_blocks()

		total_time, total_percent = metrics_dict["Total"]
		metrics_dict.pop("Total")


	# Ray data throughput should always be better than single node throughput for multi-cpu case.

[Data] Fix throughput time calculations #44138

[Data] Fix throughput time calculations #44138

Conversation

omatthew98 commented Mar 19, 2024

Why are these changes needed?

Related issue number

Checks

scottjlee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment