Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do schedule backfills incrementally #5344

Merged
merged 4 commits into from
Jan 29, 2024
Merged

Do schedule backfills incrementally #5344

merged 4 commits into from
Jan 29, 2024

Conversation

dnr
Copy link
Member

@dnr dnr commented Jan 24, 2024

What changed?

Previously schedule backfills were processed synchronously: the workflow would run through the whole given time range and either start workflows or add them to the buffer. This is changed to process them incrementally and keep track of the time range processed so far. There can be up to 1000 buffered backfills (reusing same limit as action buffer).

Why?

  • This allows backfills to be arbitrarily long, instead of limited to 1000 actions.
  • In some situations, evaluating a long backfill could take more than 1s of elapsed time, causing the SDK to think that the workflow was deadlocked and failing the workflow task, effectively causing the scheduler workflow to get stuck.

How did you test it?

new unit tests, replay test

@dnr dnr requested a review from a team as a code owner January 24, 2024 05:53
AllowZeroSleep: true,
ReuseTimer: true,
NextTimeCacheV2Size: 14, // see note below
Version: DontTrackOverlapping, // TODO: upgrade to InclusiveBackfillStartTime
Version: DontTrackOverlapping, // TODO: upgrade to IncrementalBackfill
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this no longer be a TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to make changes to workflow logic in two stages, write the logic first but not enabled, then release that, then enable it, then release that. Otherwise we can't roll back an upgrade without possibly breaking schedules that made an update during that time, which would add too much risk to releases.

Comment on lines +759 to +762
for len(s.State.OngoingBackfills) > 0 &&
limit > 0 &&
// use only half the buffer for backfills
len(s.State.BufferedStarts) < s.tweakables.MaxBufferSize/2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a maximum size why not use a circular buffer for s.State.OngoingBackfills so we don't ever need to reallocate it. Dropping the front reduces the capacity, meaning we'll need to reallocate regularly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I don't expect people to do more than one backfill at once. That seems like a premature optimization. If there's some existing data structure that would let the code remain as simple but have better allocation behavior, we could switch to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do backfills happen, only when a schedule is created? I don't consider good data structure selection to be premature optimization, but if this is a one-off thing that happens then I care far less

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens whenever a user requests it by api call, but it is generally a one-off or few-off thing, not regular.

Anyway, I did look into replacing it with a linked list. The problem is that the slices (BufferedStarts and OngoingBackfills) are in a proto-generated struct (that's used as the current state, to make continue-as-new easy). So we'd need to shuffle them back and forth... possible, but it's getting more complicated.

Overall I think the overhead of some extra allocation is going to be in the noise considering workflow tasks, SDK overhead, etc. If it was an every-iteration thing, then it'd be worth paying more attention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree wholly. Let’s ship it

Comment on lines +1393 to +1395
// This has been run for up to 3000, but it takes a very long time. Run only 100 normally.
const backfillRuns = 100
const backfills = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we inject or configure the sleep duration when testing so that we don't have to care?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already running in the workflow test framework which does time skipping. The bookkeeping in the test framework is just really really slow (and there are some quadratic bits in there.. elapsed time scales with the square of backfullRuns). Probably at least some of the quadraticness is from the mocks.

@@ -1311,6 +1320,140 @@ func (s *workflowSuite) TestBackfillInclusiveStartEnd() {
)
}

func (s *workflowSuite) TestHugeBackfillAllowAll() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the test framework guarantee that this test will never run concurrently with other tests in the file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because it's not marked t.Parallel(). We could probably arrange things to work in parallel with some refactoring, but it doesn't seem urgent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it is not urgent to make the tests run in parallel. I just wanted to know what the framework will do since we modify global state in the test. Thank you for the explanation.

@@ -1311,6 +1320,140 @@ func (s *workflowSuite) TestBackfillInclusiveStartEnd() {
)
}

func (s *workflowSuite) TestHugeBackfillAllowAll() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected behavior tested by this test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When backfillRuns exceeds the max buffer size, the new logic will successfully run all the requested workflows, the old logic will fail (due to hitting the buffer size).

In the other test (line 1390) I reduce the buffer size to make the test more meaningful, I should do the same here

@dnr dnr merged commit 40387b9 into temporalio:main Jan 29, 2024
56 of 58 checks passed
@dnr dnr deleted the sched78 branch January 29, 2024 21:28
dnr added a commit to dnr/temporal that referenced this pull request Apr 9, 2024
dnr added a commit that referenced this pull request Apr 10, 2024
…5698)

## What changed?
Activate schedule workflow logic changes. Some of this code is not in
1.23.0, but we can patch those PRs into 1.23.1 to support downgrades to
the 1.23 series.

## Why?
Fix bugs, support new features, make more efficient.

## How did you test it?
existing tests (on those PRs)

## Potential risks
schedule workflow determinism errors
dnr added a commit that referenced this pull request Apr 10, 2024
…5698)

## What changed?
Activate schedule workflow logic changes. Some of this code is not in
1.23.0, but we can patch those PRs into 1.23.1 to support downgrades to
the 1.23 series.

## Why?
Fix bugs, support new features, make more efficient.

## How did you test it?
existing tests (on those PRs)

## Potential risks
schedule workflow determinism errors
yycptt pushed a commit that referenced this pull request Apr 27, 2024
Previously schedule backfills were processed synchronously: the workflow
would run through the whole given time range and either start workflows
or add them to the buffer. This is changed to process them incrementally
and keep track of the time range processed so far. There can be up to
1000 buffered backfills (reusing same limit as action buffer).

- This allows backfills to be arbitrarily long, instead of limited to
1000 actions.
- In some situations, evaluating a long backfill could take more than 1s
of elapsed time, causing the SDK to think that the workflow was
deadlocked and failing the workflow task, effectively causing the
scheduler workflow to get stuck.

new unit tests, replay test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants