Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Stabilize metrics pusher #38349

Merged
merged 7 commits into from Aug 14, 2023
Merged

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Aug 11, 2023

Why are these changes needed?

This PR solves two problems.
First, suppose there is one task registered on the metrics pusher with interval 2 seconds.
What the metrics pusher does repeatedly:

  1. Record start = time.time()
  2. Execute task
  3. Record task last_call_succeeded_time
  4. Sleep until time = start + 2

The problem:
Strictly speaking, start + 2 < task.last_call_succeeded_time because start < task.last_call_succeeded_time. This is not always a problem because the task execution is very fast (1-5ms) and time.sleep(2) sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after task.last_call_succeeded_time + 2. However sometimes the thread wakes up before task.last_call_succeeded_time + 2 meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals.
Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least last_call_succeeded_time + 2.

Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds.

Instead of setting least_interval_s to min(task.interval_s for all tasks), we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat.

Related issue number

Closes #38360
Closes #38361

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -665,6 +667,21 @@ def test_get_all_live_placement_group_names(ray_instance):
assert set(get_all_live_placement_group_names()) == {"pg3", "pg4", "pg5", "pg6"}


def test_metrics_pusher():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, this test almost never passes. counter["val"] at the end of the test is 16-17.

@zcin zcin marked this pull request as ready for review August 11, 2023 05:29
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@sihanwang41
Copy link
Contributor

Nice to catch this issue! (make our autoscaler more robust!)

sync offline to respect last success time to sleep.

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin requested a review from a team August 11, 2023 16:39
python/ray/serve/tests/test_util.py Outdated Show resolved Hide resolved
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix! This approach looks good to me, pending @edoakes's comment.


# For all tasks, check when the task should be executed
# next. Sleep until the next closest time.
least_interval_s = math.inf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] Could we raise an error if the MetricsPusher is started without any tasks registered? As written, it silently sleeps forever since least_interval_s gets set to math.inf.

Copy link
Contributor Author

@zcin zcin Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! I've added this to MetricsPusher.start()

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin requested a review from edoakes August 11, 2023 19:15
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thank you for updating the unit tests

Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

for _ in range(10000000):
for key in result.keys():
assert result[key] == expected_results[key]
if len(result) == 3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the test fail if len(result) is never 3? We should add an assertion after the for loop in that case. Same question for test_metrics_pusher_basic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good call, I've added this in both tests!

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@edoakes edoakes merged commit 0dbaa0a into ray-project:master Aug 14, 2023
34 of 36 checks passed
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
This PR solves two problems.
First, suppose there is one task registered on the metrics pusher with interval 2 seconds.
What the metrics pusher does repeatedly:
1. Record `start = time.time()`
2. Execute task
3. Record task `last_call_succeeded_time`
9. Sleep until `time = start + 2`

The problem:
Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals.
Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`.

Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds.

Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat.

Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
This PR solves two problems.
First, suppose there is one task registered on the metrics pusher with interval 2 seconds.
What the metrics pusher does repeatedly:
1. Record `start = time.time()`
2. Execute task
3. Record task `last_call_succeeded_time`
9. Sleep until `time = start + 2`

The problem:
Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals.
Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`.

Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds.

Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat.

Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
This PR solves two problems.
First, suppose there is one task registered on the metrics pusher with interval 2 seconds.
What the metrics pusher does repeatedly:
1. Record `start = time.time()`
2. Execute task
3. Record task `last_call_succeeded_time`
9. Sleep until `time = start + 2`

The problem:
Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals.
Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`.


Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds.

Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat.
@zcin zcin deleted the metrics-pusher branch August 25, 2023 17:33
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
This PR solves two problems.
First, suppose there is one task registered on the metrics pusher with interval 2 seconds.
What the metrics pusher does repeatedly:
1. Record `start = time.time()`
2. Execute task
3. Record task `last_call_succeeded_time`
9. Sleep until `time = start + 2`

The problem:
Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals.
Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`.

Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds.

Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat.

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
This PR solves two problems.
First, suppose there is one task registered on the metrics pusher with interval 2 seconds.
What the metrics pusher does repeatedly:
1. Record `start = time.time()`
2. Execute task
3. Record task `last_call_succeeded_time`
9. Sleep until `time = start + 2`

The problem:
Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals.
Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`.

Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds.

Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat.

Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants