New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Stabilize metrics pusher #38349
Conversation
python/ray/serve/tests/test_util.py
Outdated
@@ -665,6 +667,21 @@ def test_get_all_live_placement_group_names(ray_instance): | |||
assert set(get_all_live_placement_group_names()) == {"pg3", "pg4", "pg5", "pg6"} | |||
|
|||
|
|||
def test_metrics_pusher(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change, this test almost never passes. counter["val"]
at the end of the test is 16-17.
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Nice to catch this issue! (make our autoscaler more robust!) sync offline to respect last success time to sleep. |
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix! This approach looks good to me, pending @edoakes's comment.
|
||
# For all tasks, check when the task should be executed | ||
# next. Sleep until the next closest time. | ||
least_interval_s = math.inf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Nit] Could we raise an error if the MetricsPusher
is started without any tasks registered? As written, it silently sleeps forever since least_interval_s
gets set to math.inf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense! I've added this to MetricsPusher.start()
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thank you for updating the unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
for _ in range(10000000): | ||
for key in result.keys(): | ||
assert result[key] == expected_results[key] | ||
if len(result) == 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the test fail if len(result)
is never 3? We should add an assertion after the for
loop in that case. Same question for test_metrics_pusher_basic
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good call, I've added this in both tests!
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
This PR solves two problems. First, suppose there is one task registered on the metrics pusher with interval 2 seconds. What the metrics pusher does repeatedly: 1. Record `start = time.time()` 2. Execute task 3. Record task `last_call_succeeded_time` 9. Sleep until `time = start + 2` The problem: Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals. Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`. Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds. Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat. Signed-off-by: NripeshN <nn2012@hw.ac.uk>
This PR solves two problems. First, suppose there is one task registered on the metrics pusher with interval 2 seconds. What the metrics pusher does repeatedly: 1. Record `start = time.time()` 2. Execute task 3. Record task `last_call_succeeded_time` 9. Sleep until `time = start + 2` The problem: Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals. Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`. Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds. Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat. Signed-off-by: harborn <gangsheng.wu@intel.com>
This PR solves two problems. First, suppose there is one task registered on the metrics pusher with interval 2 seconds. What the metrics pusher does repeatedly: 1. Record `start = time.time()` 2. Execute task 3. Record task `last_call_succeeded_time` 9. Sleep until `time = start + 2` The problem: Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals. Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`. Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds. Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat.
This PR solves two problems. First, suppose there is one task registered on the metrics pusher with interval 2 seconds. What the metrics pusher does repeatedly: 1. Record `start = time.time()` 2. Execute task 3. Record task `last_call_succeeded_time` 9. Sleep until `time = start + 2` The problem: Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals. Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`. Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds. Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat. Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
This PR solves two problems. First, suppose there is one task registered on the metrics pusher with interval 2 seconds. What the metrics pusher does repeatedly: 1. Record `start = time.time()` 2. Execute task 3. Record task `last_call_succeeded_time` 9. Sleep until `time = start + 2` The problem: Strictly speaking, `start + 2 < task.last_call_succeeded_time` because `start < task.last_call_succeeded_time`. This is not always a problem because the task execution is very fast (1-5ms) and `time.sleep(2)` sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is after `task.last_call_succeeded_time + 2`. However sometimes the thread wakes up before `task.last_call_succeeded_time + 2` meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals. Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least `last_call_succeeded_time + 2`. Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds. Instead of setting `least_interval_s` to `min(task.interval_s for all tasks)`, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat. Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR solves two problems.
First, suppose there is one task registered on the metrics pusher with interval 2 seconds.
What the metrics pusher does repeatedly:
start = time.time()
last_call_succeeded_time
time = start + 2
The problem:
Strictly speaking,
start + 2 < task.last_call_succeeded_time
becausestart < task.last_call_succeeded_time
. This is not always a problem because the task execution is very fast (1-5ms) andtime.sleep(2)
sleeps for "at least 2 seconds", so oftentimes the thread actually wakes up at a time that is aftertask.last_call_succeeded_time + 2
. However sometimes the thread wakes up beforetask.last_call_succeeded_time + 2
meaning it doesn't satisfy the if statement and doesn't execute the task to push the metric. This causes the metrics pusher to "skip" intervals.Since all our tasks and callback functions are essentially no-ops, we should just sleep until at least
last_call_succeeded_time + 2
.Second, suppose there is more than one task registered on the shared metrics pusher, one with interval 8s and one with interval 10s. Currently, the thread on the metrics pusher will wake up every 8 seconds to check if any tasks need executing. This means the first task will be executed every 8 seconds, and the second task every 16 seconds.
Instead of setting
least_interval_s
tomin(task.interval_s for all tasks)
, we should check at every interval when is the next time any task needs to be executed, so the sleep time is always varying. E.g. with an 8s interval task and 10s interval task, the thread should roughly sleep 8 seconds then 2 seconds and repeat.Related issue number
Closes #38360
Closes #38361
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.