[serve] Stabilize metrics pusher #38349

zcin · 2023-08-11T05:26:21Z

Why are these changes needed?

This PR solves two problems.
First, suppose there is one task registered on the metrics pusher with interval 2 seconds.
What the metrics pusher does repeatedly:

Record start = time.time()
Execute task
Record task last_call_succeeded_time
Sleep until time = start + 2

The problem:
Strictly speaking, start + 2 < task.last_call_succeeded_time because start < task.last_call_succeeded_time. This is not always a problem because the task execution is very fast (1-5ms) and time.sleep(2) sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after task.last_call_succeeded_time + 2. However sometimes the thread wakes up before task.last_call_succeeded_time + 2 meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals.
Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least last_call_succeeded_time + 2.

Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds.

Instead of setting least_interval_s to min(task.interval_s for all tasks), we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat.

Related issue number

Closes #38360
Closes #38361

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

zcin · 2023-08-11T05:27:54Z

python/ray/serve/tests/test_util.py

@@ -665,6 +667,21 @@ def test_get_all_live_placement_group_names(ray_instance):
    assert set(get_all_live_placement_group_names()) == {"pg3", "pg4", "pg5", "pg6"}


+def test_metrics_pusher():


Without this change, this test almost never passes. counter["val"] at the end of the test is 16-17.

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

sihanwang41 · 2023-08-11T16:15:38Z

Nice to catch this issue! (make our autoscaler more robust!)

sync offline to respect last success time to sleep.

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

python/ray/serve/tests/test_util.py

shrekris-anyscale

Nice fix! This approach looks good to me, pending @edoakes's comment.

shrekris-anyscale · 2023-08-11T18:22:11Z

python/ray/serve/_private/utils.py

+
+                # For all tasks, check when the task should be executed
+                # next. Sleep until the next closest time.
+                least_interval_s = math.inf


[Nit] Could we raise an error if the MetricsPusher is started without any tasks registered? As written, it silently sleeps forever since least_interval_s gets set to math.inf.

That makes sense! I've added this to MetricsPusher.start()

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

edoakes

Looks great. Thank you for updating the unit tests

shrekris-anyscale

Nice work!

shrekris-anyscale · 2023-08-11T22:20:54Z

python/ray/serve/tests/test_util.py

+        for _ in range(10000000):
+            for key in result.keys():
+                assert result[key] == expected_results[key]
+            if len(result) == 3:


Should the test fail if len(result) is never 3? We should add an assertion after the for loop in that case. Same question for test_metrics_pusher_basic.

Yeah good call, I've added this in both tests!

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

This PR solves two problems. First, suppose there is one task registered on the metrics pusher with interval 2 seconds. What the metrics pusher does repeatedly: 1. Record `start = time.time()` 2. Execute task 3. Record task `last_call_succeeded_time` 9. Sleep until `time = start + 2` The problem: Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals. Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`. Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds. Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat. Signed-off-by: NripeshN <nn2012@hw.ac.uk>

This PR solves two problems. First, suppose there is one task registered on the metrics pusher with interval 2 seconds. What the metrics pusher does repeatedly: 1. Record `start = time.time()` 2. Execute task 3. Record task `last_call_succeeded_time` 9. Sleep until `time = start + 2` The problem: Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals. Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`. Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds. Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat. Signed-off-by: harborn <gangsheng.wu@intel.com>

This PR solves two problems. First, suppose there is one task registered on the metrics pusher with interval 2 seconds. What the metrics pusher does repeatedly: 1. Record `start = time.time()` 2. Execute task 3. Record task `last_call_succeeded_time` 9. Sleep until `time = start + 2` The problem: Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals. Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`. Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds. Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat.

This PR solves two problems. First, suppose there is one task registered on the metrics pusher with interval 2 seconds. What the metrics pusher does repeatedly: 1. Record `start = time.time()` 2. Execute task 3. Record task `last_call_succeeded_time` 9. Sleep until `time = start + 2` The problem: Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals. Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`. Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds. Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat. Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

This PR solves two problems. First, suppose there is one task registered on the metrics pusher with interval 2 seconds. What the metrics pusher does repeatedly: 1. Record `start = time.time()` 2. Execute task 3. Record task `last_call_succeeded_time` 9. Sleep until `time = start + 2` The problem: Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals. Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`. Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds. Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat. Signed-off-by: Victor <vctr.y.m@example.com>

zcin commented Aug 11, 2023

View reviewed changes

zcin marked this pull request as ready for review August 11, 2023 05:29

stabilize metrics pusher

56bc3d9

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

fix case when there are multiple tasks with offset intervals

9b3dd84

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin force-pushed the metrics-pusher branch from b7394cf to 9b3dd84 Compare August 11, 2023 16:33

add comments

cc87bde

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin requested a review from a team August 11, 2023 16:39

zcin force-pushed the metrics-pusher branch from b0a9934 to cc87bde Compare August 11, 2023 17:06

edoakes requested changes Aug 11, 2023

View reviewed changes

python/ray/serve/tests/test_util.py Outdated Show resolved Hide resolved

shrekris-anyscale reviewed Aug 11, 2023

View reviewed changes

zcin added 3 commits August 11, 2023 11:57

make unit tests real unit tests

34a1749

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

improve

6a405b2

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

address comments

1d823cd

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin requested a review from edoakes August 11, 2023 19:15

edoakes approved these changes Aug 11, 2023

View reviewed changes

shrekris-anyscale approved these changes Aug 11, 2023

View reviewed changes

akshay-anyscale assigned zcin Aug 12, 2023

add check at end of test

facee76

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

edoakes approved these changes Aug 14, 2023

View reviewed changes

edoakes merged commit 0dbaa0a into ray-project:master Aug 14, 2023
34 of 36 checks passed

zcin deleted the metrics-pusher branch August 25, 2023 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Stabilize metrics pusher #38349

[serve] Stabilize metrics pusher #38349

zcin commented Aug 11, 2023 •

edited

zcin Aug 11, 2023

sihanwang41 commented Aug 11, 2023

shrekris-anyscale left a comment

shrekris-anyscale Aug 11, 2023

zcin Aug 11, 2023 •

edited

edoakes left a comment

shrekris-anyscale left a comment

shrekris-anyscale Aug 11, 2023

zcin Aug 12, 2023

		@@ -665,6 +667,21 @@ def test_get_all_live_placement_group_names(ray_instance):
		assert set(get_all_live_placement_group_names()) == {"pg3", "pg4", "pg5", "pg6"}


		def test_metrics_pusher():

[serve] Stabilize metrics pusher #38349

[serve] Stabilize metrics pusher #38349

Conversation

zcin commented Aug 11, 2023 • edited

Why are these changes needed?

Related issue number

Checks

zcin Aug 11, 2023

Choose a reason for hiding this comment

sihanwang41 commented Aug 11, 2023

shrekris-anyscale left a comment

Choose a reason for hiding this comment

shrekris-anyscale Aug 11, 2023

Choose a reason for hiding this comment

zcin Aug 11, 2023 • edited

Choose a reason for hiding this comment

edoakes left a comment

Choose a reason for hiding this comment

shrekris-anyscale left a comment

Choose a reason for hiding this comment

shrekris-anyscale Aug 11, 2023

Choose a reason for hiding this comment

zcin Aug 12, 2023

Choose a reason for hiding this comment

zcin commented Aug 11, 2023 •

edited

zcin Aug 11, 2023 •

edited