Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix schedule workflow to CAN after signals #5799

Merged
merged 2 commits into from
Apr 25, 2024
Merged

Fix schedule workflow to CAN after signals #5799

merged 2 commits into from
Apr 25, 2024

Conversation

dnr
Copy link
Member

@dnr dnr commented Apr 25, 2024

What changed and why?

If a schedule was paused and received a large number of signals (trigger, backfill, etc.), it wouldn't be able to continue-as-new and history size could grow past the suggested limit.

For various reasons the main loop receives and processes signals across iterations, with checking for CAN in between, so a wakeup due to a signal (instead of a timer) would prevent CAN. Refactoring is hard due to the determinism requirement, so the simplest fix is to do a second check.

How did you test it?

New unit test

@dnr dnr requested a review from a team as a code owner April 25, 2024 22:03
@dnr dnr merged commit 90a809e into temporalio:main Apr 25, 2024
46 checks passed
@dnr dnr deleted the sched83 branch April 25, 2024 23:25
yycptt pushed a commit that referenced this pull request Apr 25, 2024
## What changed and why?
If a schedule was paused and received a large number of signals
(trigger, backfill, etc.), it wouldn't be able to continue-as-new and
history size could grow past the suggested limit.

For various reasons the main loop receives and processes signals across
iterations, with checking for CAN in between, so a wakeup due to a
signal (instead of a timer) would prevent CAN. Refactoring is hard due
to the determinism requirement, so the simplest fix is to do a second
check.

## How did you test it?
New unit test
yycptt pushed a commit that referenced this pull request Apr 27, 2024
## What changed and why?
If a schedule was paused and received a large number of signals
(trigger, backfill, etc.), it wouldn't be able to continue-as-new and
history size could grow past the suggested limit.

For various reasons the main loop receives and processes signals across
iterations, with checking for CAN in between, so a wakeup due to a
signal (instead of a timer) would prevent CAN. Refactoring is hard due
to the determinism requirement, so the simplest fix is to do a second
check.

## How did you test it?
New unit test
dnr added a commit to dnr/temporal that referenced this pull request May 1, 2024
dnr added a commit that referenced this pull request May 31, 2024
## What changed?
In the schedule workflow, we should always reprocess from the last
action in order to handle getting woken up by a signal in between the
nominal time and actual time. This is basically what #5381 should have
been in the first place, and applies that logic to all iterations.

Note that this adds the logic but doesn't activate it yet.

## Why?
Fixes bug: signals (including refresh) in between nominal and actual
time could lead to dropped actions. This only happens if the cache runs
out or we do a CaN at just the right time, so it's not that easy to
reproduce.

This has also blocked activating #5799 since that makes this bug more
likely.

## How did you test it?
Added new unit test and replay test.
Reproduced locally by disabling cache and sending frequent signals to
try to disrupt a schedule. Verified that new version did not drop
actions.

## Potential risks
This changes workflow logic, but is pretty easy to see that the old
control flow is unaffected.

There's one more potential situation which isn't always handled
correctly: unpause in between nominal and actual times will usually run
the jittered action (this is probably the less surprising behavior), but
rarely, if a CaN happens at the same time, it can get skipped because
the cache will be regenerated.

---------

Co-authored-by: Rodrigo Zhou <rodrigozhou@users.noreply.github.com>
pdoerner pushed a commit that referenced this pull request May 31, 2024
## What changed?
In the schedule workflow, we should always reprocess from the last
action in order to handle getting woken up by a signal in between the
nominal time and actual time. This is basically what #5381 should have
been in the first place, and applies that logic to all iterations.

Note that this adds the logic but doesn't activate it yet.

## Why?
Fixes bug: signals (including refresh) in between nominal and actual
time could lead to dropped actions. This only happens if the cache runs
out or we do a CaN at just the right time, so it's not that easy to
reproduce.

This has also blocked activating #5799 since that makes this bug more
likely.

## How did you test it?
Added new unit test and replay test.
Reproduced locally by disabling cache and sending frequent signals to
try to disrupt a schedule. Verified that new version did not drop
actions.

## Potential risks
This changes workflow logic, but is pretty easy to see that the old
control flow is unaffected.

There's one more potential situation which isn't always handled
correctly: unpause in between nominal and actual times will usually run
the jittered action (this is probably the less surprising behavior), but
rarely, if a CaN happens at the same time, it can get skipped because
the cache will be regenerated.

---------

Co-authored-by: Rodrigo Zhou <rodrigozhou@users.noreply.github.com>
dnr added a commit to dnr/temporal that referenced this pull request Jun 6, 2024
dnr added a commit that referenced this pull request Jun 7, 2024
## What changed?
Bump version to turn on recent workflow changes.

## Why?
Fix bugs.

## How did you test it?
existing tests

## Potential risks
nondeterminism errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants