Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recalculate schedule times from previous action on update #5381

Merged
merged 4 commits into from
Mar 2, 2024

Conversation

dnr
Copy link
Member

@dnr dnr commented Feb 1, 2024

What changed?

After an update, the schedule workflow goes back to the previous action/update and recomputes next times from there, instead of from the current time.

Why?

Fixes #4866. Fixes the situation where an update happens in between the nominal time and the jittered time of an action.

How did you test it?

new unit test

@dnr dnr requested a review from a team as a code owner February 1, 2024 08:27
@dnr
Copy link
Member Author

dnr commented Feb 1, 2024

Actually this replay test is not effective here since it doesn't exercise the bug. I'll remove it and try to create one that does, but we can review and merge in the meantime.

@dnr dnr merged commit b51b899 into temporalio:main Mar 2, 2024
57 of 58 checks passed
@dnr dnr deleted the sched79 branch March 2, 2024 01:13
stephanos pushed a commit to stephanos/temporal that referenced this pull request Mar 21, 2024
…#5381)

## What changed?
After an update, the schedule workflow goes back to the previous
action/update and recomputes next times from there, instead of from the
current time.

## Why?
Fixes temporalio#4866. Fixes the situation where an update happens in between the
nominal time and the jittered time of an action.

## How did you test it?
new unit test
dnr added a commit to dnr/temporal that referenced this pull request Apr 9, 2024
dnr added a commit that referenced this pull request Apr 10, 2024
…5698)

## What changed?
Activate schedule workflow logic changes. Some of this code is not in
1.23.0, but we can patch those PRs into 1.23.1 to support downgrades to
the 1.23 series.

## Why?
Fix bugs, support new features, make more efficient.

## How did you test it?
existing tests (on those PRs)

## Potential risks
schedule workflow determinism errors
dnr added a commit that referenced this pull request Apr 10, 2024
…5698)

## What changed?
Activate schedule workflow logic changes. Some of this code is not in
1.23.0, but we can patch those PRs into 1.23.1 to support downgrades to
the 1.23 series.

## Why?
Fix bugs, support new features, make more efficient.

## How did you test it?
existing tests (on those PRs)

## Potential risks
schedule workflow determinism errors
yycptt pushed a commit that referenced this pull request Apr 27, 2024
## What changed?
After an update, the schedule workflow goes back to the previous
action/update and recomputes next times from there, instead of from the
current time.

## Why?
Fixes #4866. Fixes the situation where an update happens in between the
nominal time and the jittered time of an action.

## How did you test it?
new unit test
dnr added a commit that referenced this pull request May 31, 2024
## What changed?
In the schedule workflow, we should always reprocess from the last
action in order to handle getting woken up by a signal in between the
nominal time and actual time. This is basically what #5381 should have
been in the first place, and applies that logic to all iterations.

Note that this adds the logic but doesn't activate it yet.

## Why?
Fixes bug: signals (including refresh) in between nominal and actual
time could lead to dropped actions. This only happens if the cache runs
out or we do a CaN at just the right time, so it's not that easy to
reproduce.

This has also blocked activating #5799 since that makes this bug more
likely.

## How did you test it?
Added new unit test and replay test.
Reproduced locally by disabling cache and sending frequent signals to
try to disrupt a schedule. Verified that new version did not drop
actions.

## Potential risks
This changes workflow logic, but is pretty easy to see that the old
control flow is unaffected.

There's one more potential situation which isn't always handled
correctly: unpause in between nominal and actual times will usually run
the jittered action (this is probably the less surprising behavior), but
rarely, if a CaN happens at the same time, it can get skipped because
the cache will be regenerated.

---------

Co-authored-by: Rodrigo Zhou <rodrigozhou@users.noreply.github.com>
pdoerner pushed a commit that referenced this pull request May 31, 2024
## What changed?
In the schedule workflow, we should always reprocess from the last
action in order to handle getting woken up by a signal in between the
nominal time and actual time. This is basically what #5381 should have
been in the first place, and applies that logic to all iterations.

Note that this adds the logic but doesn't activate it yet.

## Why?
Fixes bug: signals (including refresh) in between nominal and actual
time could lead to dropped actions. This only happens if the cache runs
out or we do a CaN at just the right time, so it's not that easy to
reproduce.

This has also blocked activating #5799 since that makes this bug more
likely.

## How did you test it?
Added new unit test and replay test.
Reproduced locally by disabling cache and sending frequent signals to
try to disrupt a schedule. Verified that new version did not drop
actions.

## Potential risks
This changes workflow logic, but is pretty easy to see that the old
control flow is unaffected.

There's one more potential situation which isn't always handled
correctly: unpause in between nominal and actual times will usually run
the jittered action (this is probably the less surprising behavior), but
rarely, if a CaN happens at the same time, it can get skipped because
the cache will be regenerated.

---------

Co-authored-by: Rodrigo Zhou <rodrigozhou@users.noreply.github.com>
lina-temporal added a commit that referenced this pull request Jun 14, 2024
… update time, respect RemainingActions counter (#6122)

## What changed/why?
In #6028/#5381, our scheduler's run loop accounted for update requests
that occur between nominal and actual/jittered times. The run loop will
correctly skip over action times who are scheduled prior to the
schedule's update time (such as when a schedule is recalculated to a new
spec). This PR updates `getFutureActionTimes` (used as part of the
scheduler's describe and list operations) to account for action times
preceding the update time, as well as to properly reflect the number of
remaining actions in a schedule.

## How did you test it?
- Updated unit tests for both limited actions (`TestLimitedActions`) as
well as filtered update times (`TestUpdateNotRetroactive`).
`TestUpdateNotRetroactive` is the test that goes red when the [parallel
condition in the
scheduler](https://github.com/temporalio/temporal/blob/44a6fe777e1dc950dd278cb1b7f6ffc8b3ba4328/service/worker/scheduler/workflow.go#L655-L660)
is commented out, so it seemed the appropriate place to also test
`getFutureActionTimes` lines up with the fix in the run loop.
- Verified both updated tests only pass when new version
(`AccurateFutureActionTimes`) is set; verified no other tests break
(`make test-unit`)

## Potential risks
- We break a use case that relied on speculative future action times/a
constant number of action times returned

## Documentation
N/A

## Is hotfix candidate?
No
stephanos pushed a commit that referenced this pull request Jun 14, 2024
… update time, respect RemainingActions counter (#6122)

## What changed/why?
In #6028/#5381, our scheduler's run loop accounted for update requests
that occur between nominal and actual/jittered times. The run loop will
correctly skip over action times who are scheduled prior to the
schedule's update time (such as when a schedule is recalculated to a new
spec). This PR updates `getFutureActionTimes` (used as part of the
scheduler's describe and list operations) to account for action times
preceding the update time, as well as to properly reflect the number of
remaining actions in a schedule.

## How did you test it?
- Updated unit tests for both limited actions (`TestLimitedActions`) as
well as filtered update times (`TestUpdateNotRetroactive`).
`TestUpdateNotRetroactive` is the test that goes red when the [parallel
condition in the
scheduler](https://github.com/temporalio/temporal/blob/44a6fe777e1dc950dd278cb1b7f6ffc8b3ba4328/service/worker/scheduler/workflow.go#L655-L660)
is commented out, so it seemed the appropriate place to also test
`getFutureActionTimes` lines up with the fix in the run loop.
- Verified both updated tests only pass when new version
(`AccurateFutureActionTimes`) is set; verified no other tests break
(`make test-unit`)

## Potential risks
- We break a use case that relied on speculative future action times/a
constant number of action times returned

## Documentation
N/A

## Is hotfix candidate?
No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

schedule update should start from previous action instead of current time
3 participants