Local timeouts for remote activities by Sushisource · Pull Request #620 · temporalio/sdk-rust

Sushisource · 2023-10-27T17:44:09Z

What was changed

Core now will send cancellations if an activity would have already timed out according to server but we haven't learned of that (typically, because it isn't heartbeating). There is a configurable buffer defaulting to 5s, to avoid races with server (we would prefer to learn from server).

Why?

Nice to have your activities get cancelled when they would have been even if they don't heartbeat

Checklist

Closes [Feature Request] Worker enforcement of activity timeouts features#170
How was this tested:
Unit & Integ
Any docs updates needed?

bergundy · 2023-10-27T18:08:45Z

+                                let timeout_at = [
+                                    task.resp.heartbeat_timeout.clone(),
+                                    task.resp.start_to_close_timeout.clone(),
+                                ]
+                                .into_iter()
+                                .flatten()
+                                .map(Duration::try_from)
+                                .filter_map(Result::ok)
+                                .chain(sched)
+                                .filter(|d| !d.is_zero())
+                                .min();


This is the rustiest piece of code I've ever seen 😆

Got a 10-chainer at https://github.com/temporalio/sdk-python/blob/2c15ed30159242d7f93534e8263e221de1959e1e/temporalio/bridge/src/metric.rs#L273-L283

cretz · 2023-10-27T17:46:25Z

+    /// after one of them elapses. This is to avoid racing with server's own tracking of the
+    /// timeout.
+    #[builder(default = "Duration::from_secs(5)")]
+    pub local_timeout_buffer_for_activities: Duration,


👍 From a lang POV, I would wait to ever expose this until a user has a problem with this default

cretz · 2023-10-27T22:13:33Z

+                                let local_timeout_buffer = self.local_timeout_buffer;
+                                let sched = task
+                                    .resp
+                                    .schedule_to_close_timeout


I don't think schedule to close should ever come into play. I think only start to close because this is only about one attempt right? And does server set start to close on the task response even if a user didn't set it? I'm also scared of doing anything with server provided times like scheduled_time. I think we need heartbeat and start-to-close only.

But maybe I didn't read/understand this algorithm close enough to understand what schedule time is for here.

Because we use the scheduled time coming in the task, we know if this attempt would hit the overall schedule timeout - so we can handle that knowing it will have already elapsed on the server in the same way.

As for heartbeat, I got rid of it per Roey's point, b/c there's a good amount of extra machinery involved to refresh it per-heartbeat, and that seems not worth it since the only situation it matters in is when you've specified a HB timeout but aren't actually heartbeating, which would be weird.

Because we use the scheduled time coming in the task, we know if this attempt would hit the overall schedule timeout - so we can handle that knowing it will have already elapsed on the server in the same way.

I don't think we need to protect against that, only the attempt. I was under the impression start to close is always bound by the schedule to close time server side anyways.

As for heartbeat, I got rid of it per Roey's point, b/c there's a good amount of extra machinery involved to refresh it per-heartbeat, and that seems not worth it since the only situation it matters in is when you've specified a HB timeout but aren't actually heartbeating, which would be weird.

It's not that weird, this code is very normal with a large start-to-close timeout:

while True: activity.heartbeat() await my_http_client.do_hopefully_short_http_poll()

In fact, that's kinda the genesis of this whole issue, that users are not setting good short timeouts on HTTP calls they hoped were quick. If do_hopefully_short_http_poll hangs, the entire system gums up. I didn't read the logic in the PR, but ideally we can have a start to close timer that will fail task and a heartbeat timer that will fail the task, the latter simply getting reset every heartbeat call.

There is a start to close timer. So that example still gets ended by that. But, sure, I'll see how bad it is to refresh the heartbeat and have a timer around that too

👍 If too much of a mess, can defer it. I wonder if the start-to-close timer is enough to never care about schedule-to-close.

I think we want to keep schedule to close because a user might only have a schedule timeout but not a start one.

Hrmm, I thought server always set a start-to-close based on schedule-to-close if user does not explicitly have a start-to-close. But now I understand the logic, I wasn't reading it close enough (zip, slice of two, min, etc threw me a bit). All good can resolve. I wish Rust made it easier to read this if/else simple logic.

I thought server always set a start-to-close based on schedule-to-close if user does not explicitly have a start-to-close

I mentioned this in one of the outdated comments, I would check what server does here and avoid this SDK complexity if possible.
I'm also a bit worried about clock skew when referencing system time, you will decrease the overall timeout if the local clock is ahead of the server clock.

Hrmm, do we need to up that buffer leeway time from 5s to 10s, because it needs to represent the skew between server and client and give the server time to timeout first?

I added heartbeat timeouts back with resetting. Schedule doesn't really add any complexity compared to the others, so I don't see why not to keep it.

cretz · 2023-10-27T22:15:31Z

+                                let timeout_at = [
+                                    task.resp.heartbeat_timeout.clone(),
+                                    task.resp.start_to_close_timeout.clone(),
+                                ]
+                                .into_iter()
+                                .flatten()
+                                .map(Duration::try_from)
+                                .filter_map(Result::ok)
+                                .chain(sched)
+                                .filter(|d| !d.is_zero())
+                                .min();


Got a 10-chainer at https://github.com/temporalio/sdk-python/blob/2c15ed30159242d7f93534e8263e221de1959e1e/temporalio/bridge/src/metric.rs#L273-L283

This reverts commit 012545b.

cretz

A bit scared of local timeouts for users that have something like:

@activity.defn
def my_activity() -> None:
    do_sync_thing_that_never_fails_because_I_never_heartbeat()
    do_cleanup_guaranteed_to_run_even_on_server_timeout()

The second call used to run for them, but maybe not anymore. But I guess they could never gracefully shutdown workers then either. I guess we just need to be real clear in release notes.

cretz · 2023-11-02T19:21:53Z

+                                            } else {
+                                                tokio::time::sleep(sleep_time).await;
+                                            }
+                                            let _ = cancel_tx.send(PendingActivityCancel {


Do we want to increment a metric here or log or somehow let lang know via the proto that this was due to local timeout? I'm just starting to get scared that a user's clock is gonna skew like 10s off of server's and things are gonna just start cancelling everywhere.

Or I guess we can look at all the activity failures and if they are failures for untriggered cancel we can assume it was either worker shutdown or this timeout? That's probably good enough.

I suppose skew is only an issue with the scheduled timeout, since that's the only one that uses an absolute time. We could remove it for just that reason, since you're more likely to fail an attempt timeout or heartbeat anyway.

I don't want to add another metric until asked for... we've already got so many. I'll add a debug log line though.

👍 Very happy to see schedule timeout removed. I think it's much safer to do single-attempt duration than clock-based. May be worth checking whether server always sets start-to-close, I forget, but I do feel safer now w/ what is here.

Sushisource requested a review from a team as a code owner October 27, 2023 17:44

Sushisource force-pushed the local-timeouts branch from f2a3e9f to e763b7b Compare October 27, 2023 17:44

bergundy reviewed Oct 27, 2023

View reviewed changes

Comment thread core/src/worker/activities.rs Outdated

bergundy reviewed Oct 27, 2023

View reviewed changes

Comment thread core/src/worker/activities.rs Outdated

cretz reviewed Oct 27, 2023

View reviewed changes

Sushisource force-pushed the local-timeouts branch from 393325c to 539a95a Compare October 27, 2023 22:26

Sushisource and others added 7 commits November 2, 2023 09:50

Implement local timeouts for remote activities

caa9453

Add integration test

811c8f0

Remove todo, it works.

d47950d

Don't do heartbeat timeout

012545b

Abort tasks when activities complete

6606a89

Revert "Don't do heartbeat timeout"

5330583

This reverts commit 012545b.

Do local heartbeat timeouts w/ resetting of heartbeat upon record

e25f38d

Sushisource force-pushed the local-timeouts branch from 539a95a to e25f38d Compare November 2, 2023 17:58

Less flaky poll protection for unit test

eb0726f

cretz approved these changes Nov 2, 2023

View reviewed changes

Remove schedule-to-close timeout. Log on timeout firing.

db6bff4

Sushisource force-pushed the local-timeouts branch from 1137e9b to db6bff4 Compare November 2, 2023 22:49

Merge branch 'master' into local-timeouts

618b873

Sushisource enabled auto-merge (squash) November 3, 2023 16:09

Sushisource merged commit f362717 into temporalio:master Nov 3, 2023

Sushisource deleted the local-timeouts branch November 3, 2023 23:50

Sushisource mentioned this pull request Feb 4, 2025

[Feature Request] Auto cancel activities after they timeout #490

Closed

zeev-finaloop mentioned this pull request Mar 30, 2026

Activity not cancelled after start_to_close_timeout when heartbeat_timeout is shorter #1188

Open

Conversation

Sushisource commented Oct 27, 2023

What was changed

Why?

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cretz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sushisource Nov 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cretz Oct 30, 2023 •

edited

Loading

cretz Oct 30, 2023 •

edited

Loading

cretz left a comment •

edited

Loading

Sushisource Nov 2, 2023 •

edited

Loading