Skip to content

Local timeouts for remote activities#620

Merged
Sushisource merged 10 commits intotemporalio:masterfrom
Sushisource:local-timeouts
Nov 3, 2023
Merged

Local timeouts for remote activities#620
Sushisource merged 10 commits intotemporalio:masterfrom
Sushisource:local-timeouts

Conversation

@Sushisource
Copy link
Copy Markdown
Member

What was changed

Core now will send cancellations if an activity would have already timed out according to server but we haven't learned of that (typically, because it isn't heartbeating). There is a configurable buffer defaulting to 5s, to avoid races with server (we would prefer to learn from server).

Why?

Nice to have your activities get cancelled when they would have been even if they don't heartbeat

Checklist

  1. Closes [Feature Request] Worker enforcement of activity timeouts features#170

  2. How was this tested:
    Unit & Integ

  3. Any docs updates needed?

@Sushisource Sushisource requested a review from a team as a code owner October 27, 2023 17:44
Comment thread core/src/worker/activities.rs Outdated
Comment on lines +555 to +565
let timeout_at = [
task.resp.heartbeat_timeout.clone(),
task.resp.start_to_close_timeout.clone(),
]
.into_iter()
.flatten()
.map(Duration::try_from)
.filter_map(Result::ok)
.chain(sched)
.filter(|d| !d.is_zero())
.min();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the rustiest piece of code I've ever seen 😆

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦀

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread core/src/worker/activities.rs Outdated
Comment thread core/src/worker/activities.rs Outdated
Comment thread core-api/src/worker.rs
/// after one of them elapses. This is to avoid racing with server's own tracking of the
/// timeout.
#[builder(default = "Duration::from_secs(5)")]
pub local_timeout_buffer_for_activities: Duration,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 From a lang POV, I would wait to ever expose this until a user has a problem with this default

Comment thread core/src/worker/activities.rs Outdated
Comment thread core/src/worker/activities.rs Outdated
let local_timeout_buffer = self.local_timeout_buffer;
let sched = task
.resp
.schedule_to_close_timeout
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think schedule to close should ever come into play. I think only start to close because this is only about one attempt right? And does server set start to close on the task response even if a user didn't set it? I'm also scared of doing anything with server provided times like scheduled_time. I think we need heartbeat and start-to-close only.

But maybe I didn't read/understand this algorithm close enough to understand what schedule time is for here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we use the scheduled time coming in the task, we know if this attempt would hit the overall schedule timeout - so we can handle that knowing it will have already elapsed on the server in the same way.

As for heartbeat, I got rid of it per Roey's point, b/c there's a good amount of extra machinery involved to refresh it per-heartbeat, and that seems not worth it since the only situation it matters in is when you've specified a HB timeout but aren't actually heartbeating, which would be weird.

Copy link
Copy Markdown
Contributor

@cretz cretz Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we use the scheduled time coming in the task, we know if this attempt would hit the overall schedule timeout - so we can handle that knowing it will have already elapsed on the server in the same way.

I don't think we need to protect against that, only the attempt. I was under the impression start to close is always bound by the schedule to close time server side anyways.

As for heartbeat, I got rid of it per Roey's point, b/c there's a good amount of extra machinery involved to refresh it per-heartbeat, and that seems not worth it since the only situation it matters in is when you've specified a HB timeout but aren't actually heartbeating, which would be weird.

It's not that weird, this code is very normal with a large start-to-close timeout:

while True:
    activity.heartbeat()
    await my_http_client.do_hopefully_short_http_poll()

In fact, that's kinda the genesis of this whole issue, that users are not setting good short timeouts on HTTP calls they hoped were quick. If do_hopefully_short_http_poll hangs, the entire system gums up. I didn't read the logic in the PR, but ideally we can have a start to close timer that will fail task and a heartbeat timer that will fail the task, the latter simply getting reset every heartbeat call.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a start to close timer. So that example still gets ended by that. But, sure, I'll see how bad it is to refresh the heartbeat and have a timer around that too

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 If too much of a mess, can defer it. I wonder if the start-to-close timer is enough to never care about schedule-to-close.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to keep schedule to close because a user might only have a schedule timeout but not a start one.

Copy link
Copy Markdown
Contributor

@cretz cretz Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, I thought server always set a start-to-close based on schedule-to-close if user does not explicitly have a start-to-close. But now I understand the logic, I wasn't reading it close enough (zip, slice of two, min, etc threw me a bit). All good can resolve. I wish Rust made it easier to read this if/else simple logic.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought server always set a start-to-close based on schedule-to-close if user does not explicitly have a start-to-close

I mentioned this in one of the outdated comments, I would check what server does here and avoid this SDK complexity if possible.
I'm also a bit worried about clock skew when referencing system time, you will decrease the overall timeout if the local clock is ahead of the server clock.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, do we need to up that buffer leeway time from 5s to 10s, because it needs to represent the skew between server and client and give the server time to timeout first?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added heartbeat timeouts back with resetting. Schedule doesn't really add any complexity compared to the others, so I don't see why not to keep it.

Comment thread core/src/worker/activities.rs Outdated
Comment on lines +555 to +565
let timeout_at = [
task.resp.heartbeat_timeout.clone(),
task.resp.start_to_close_timeout.clone(),
]
.into_iter()
.flatten()
.map(Duration::try_from)
.filter_map(Result::ok)
.chain(sched)
.filter(|d| !d.is_zero())
.min();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit scared of local timeouts for users that have something like:

@activity.defn
def my_activity() -> None:
    do_sync_thing_that_never_fails_because_I_never_heartbeat()
    do_cleanup_guaranteed_to_run_even_on_server_timeout()

The second call used to run for them, but maybe not anymore. But I guess they could never gracefully shutdown workers then either. I guess we just need to be real clear in release notes.

} else {
tokio::time::sleep(sleep_time).await;
}
let _ = cancel_tx.send(PendingActivityCancel {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to increment a metric here or log or somehow let lang know via the proto that this was due to local timeout? I'm just starting to get scared that a user's clock is gonna skew like 10s off of server's and things are gonna just start cancelling everywhere.

Or I guess we can look at all the activity failures and if they are failures for untriggered cancel we can assume it was either worker shutdown or this timeout? That's probably good enough.

Copy link
Copy Markdown
Member Author

@Sushisource Sushisource Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose skew is only an issue with the scheduled timeout, since that's the only one that uses an absolute time. We could remove it for just that reason, since you're more likely to fail an attempt timeout or heartbeat anyway.

I don't want to add another metric until asked for... we've already got so many. I'll add a debug log line though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Very happy to see schedule timeout removed. I think it's much safer to do single-attempt duration than clock-based. May be worth checking whether server always sets start-to-close, I forget, but I do feel safer now w/ what is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Worker enforcement of activity timeouts

3 participants