Local timeouts for remote activities#620
Conversation
f2a3e9f to
e763b7b
Compare
| let timeout_at = [ | ||
| task.resp.heartbeat_timeout.clone(), | ||
| task.resp.start_to_close_timeout.clone(), | ||
| ] | ||
| .into_iter() | ||
| .flatten() | ||
| .map(Duration::try_from) | ||
| .filter_map(Result::ok) | ||
| .chain(sched) | ||
| .filter(|d| !d.is_zero()) | ||
| .min(); |
There was a problem hiding this comment.
This is the rustiest piece of code I've ever seen 😆
There was a problem hiding this comment.
| /// after one of them elapses. This is to avoid racing with server's own tracking of the | ||
| /// timeout. | ||
| #[builder(default = "Duration::from_secs(5)")] | ||
| pub local_timeout_buffer_for_activities: Duration, |
There was a problem hiding this comment.
👍 From a lang POV, I would wait to ever expose this until a user has a problem with this default
| let local_timeout_buffer = self.local_timeout_buffer; | ||
| let sched = task | ||
| .resp | ||
| .schedule_to_close_timeout |
There was a problem hiding this comment.
I don't think schedule to close should ever come into play. I think only start to close because this is only about one attempt right? And does server set start to close on the task response even if a user didn't set it? I'm also scared of doing anything with server provided times like scheduled_time. I think we need heartbeat and start-to-close only.
But maybe I didn't read/understand this algorithm close enough to understand what schedule time is for here.
There was a problem hiding this comment.
Because we use the scheduled time coming in the task, we know if this attempt would hit the overall schedule timeout - so we can handle that knowing it will have already elapsed on the server in the same way.
As for heartbeat, I got rid of it per Roey's point, b/c there's a good amount of extra machinery involved to refresh it per-heartbeat, and that seems not worth it since the only situation it matters in is when you've specified a HB timeout but aren't actually heartbeating, which would be weird.
There was a problem hiding this comment.
Because we use the scheduled time coming in the task, we know if this attempt would hit the overall schedule timeout - so we can handle that knowing it will have already elapsed on the server in the same way.
I don't think we need to protect against that, only the attempt. I was under the impression start to close is always bound by the schedule to close time server side anyways.
As for heartbeat, I got rid of it per Roey's point, b/c there's a good amount of extra machinery involved to refresh it per-heartbeat, and that seems not worth it since the only situation it matters in is when you've specified a HB timeout but aren't actually heartbeating, which would be weird.
It's not that weird, this code is very normal with a large start-to-close timeout:
while True:
activity.heartbeat()
await my_http_client.do_hopefully_short_http_poll()In fact, that's kinda the genesis of this whole issue, that users are not setting good short timeouts on HTTP calls they hoped were quick. If do_hopefully_short_http_poll hangs, the entire system gums up. I didn't read the logic in the PR, but ideally we can have a start to close timer that will fail task and a heartbeat timer that will fail the task, the latter simply getting reset every heartbeat call.
There was a problem hiding this comment.
There is a start to close timer. So that example still gets ended by that. But, sure, I'll see how bad it is to refresh the heartbeat and have a timer around that too
There was a problem hiding this comment.
👍 If too much of a mess, can defer it. I wonder if the start-to-close timer is enough to never care about schedule-to-close.
There was a problem hiding this comment.
I think we want to keep schedule to close because a user might only have a schedule timeout but not a start one.
There was a problem hiding this comment.
Hrmm, I thought server always set a start-to-close based on schedule-to-close if user does not explicitly have a start-to-close. But now I understand the logic, I wasn't reading it close enough (zip, slice of two, min, etc threw me a bit). All good can resolve. I wish Rust made it easier to read this if/else simple logic.
There was a problem hiding this comment.
I thought server always set a start-to-close based on schedule-to-close if user does not explicitly have a start-to-close
I mentioned this in one of the outdated comments, I would check what server does here and avoid this SDK complexity if possible.
I'm also a bit worried about clock skew when referencing system time, you will decrease the overall timeout if the local clock is ahead of the server clock.
There was a problem hiding this comment.
Hrmm, do we need to up that buffer leeway time from 5s to 10s, because it needs to represent the skew between server and client and give the server time to timeout first?
There was a problem hiding this comment.
I added heartbeat timeouts back with resetting. Schedule doesn't really add any complexity compared to the others, so I don't see why not to keep it.
| let timeout_at = [ | ||
| task.resp.heartbeat_timeout.clone(), | ||
| task.resp.start_to_close_timeout.clone(), | ||
| ] | ||
| .into_iter() | ||
| .flatten() | ||
| .map(Duration::try_from) | ||
| .filter_map(Result::ok) | ||
| .chain(sched) | ||
| .filter(|d| !d.is_zero()) | ||
| .min(); |
There was a problem hiding this comment.
393325c to
539a95a
Compare
This reverts commit 012545b.
539a95a to
e25f38d
Compare
There was a problem hiding this comment.
A bit scared of local timeouts for users that have something like:
@activity.defn
def my_activity() -> None:
do_sync_thing_that_never_fails_because_I_never_heartbeat()
do_cleanup_guaranteed_to_run_even_on_server_timeout()The second call used to run for them, but maybe not anymore. But I guess they could never gracefully shutdown workers then either. I guess we just need to be real clear in release notes.
| } else { | ||
| tokio::time::sleep(sleep_time).await; | ||
| } | ||
| let _ = cancel_tx.send(PendingActivityCancel { |
There was a problem hiding this comment.
Do we want to increment a metric here or log or somehow let lang know via the proto that this was due to local timeout? I'm just starting to get scared that a user's clock is gonna skew like 10s off of server's and things are gonna just start cancelling everywhere.
Or I guess we can look at all the activity failures and if they are failures for untriggered cancel we can assume it was either worker shutdown or this timeout? That's probably good enough.
There was a problem hiding this comment.
I suppose skew is only an issue with the scheduled timeout, since that's the only one that uses an absolute time. We could remove it for just that reason, since you're more likely to fail an attempt timeout or heartbeat anyway.
I don't want to add another metric until asked for... we've already got so many. I'll add a debug log line though.
There was a problem hiding this comment.
👍 Very happy to see schedule timeout removed. I think it's much safer to do single-attempt duration than clock-based. May be worth checking whether server always sets start-to-close, I forget, but I do feel safer now w/ what is here.
1137e9b to
db6bff4
Compare
What was changed
Core now will send cancellations if an activity would have already timed out according to server but we haven't learned of that (typically, because it isn't heartbeating). There is a configurable buffer defaulting to 5s, to avoid races with server (we would prefer to learn from server).
Why?
Nice to have your activities get cancelled when they would have been even if they don't heartbeat
Checklist
Closes [Feature Request] Worker enforcement of activity timeouts features#170
How was this tested:
Unit & Integ
Any docs updates needed?