Do schedule backfills incrementally #5344

dnr · 2024-01-24T05:53:39Z

What changed?

Previously schedule backfills were processed synchronously: the workflow would run through the whole given time range and either start workflows or add them to the buffer. This is changed to process them incrementally and keep track of the time range processed so far. There can be up to 1000 buffered backfills (reusing same limit as action buffer).

Why?

This allows backfills to be arbitrarily long, instead of limited to 1000 actions.
In some situations, evaluating a long backfill could take more than 1s of elapsed time, causing the SDK to think that the workflow was deadlocked and failing the workflow task, effectively causing the scheduler workflow to get stuck.

How did you test it?

new unit tests, replay test

tdeebswihart · 2024-01-25T00:06:35Z

service/worker/scheduler/workflow.go

 		AllowZeroSleep:                    true,
 		ReuseTimer:                        true,
 		NextTimeCacheV2Size:               14,                   // see note below
-		Version:                           DontTrackOverlapping, // TODO: upgrade to InclusiveBackfillStartTime
+		Version:                           DontTrackOverlapping, // TODO: upgrade to IncrementalBackfill


When will this no longer be a TODO?

We have to make changes to workflow logic in two stages, write the logic first but not enabled, then release that, then enable it, then release that. Otherwise we can't roll back an upgrade without possibly breaking schedules that made an update during that time, which would add too much risk to releases.

tdeebswihart · 2024-01-25T00:11:49Z

service/worker/scheduler/workflow.go

+	for len(s.State.OngoingBackfills) > 0 &&
+		limit > 0 &&
+		// use only half the buffer for backfills
+		len(s.State.BufferedStarts) < s.tweakables.MaxBufferSize/2 {


If we have a maximum size why not use a circular buffer for s.State.OngoingBackfills so we don't ever need to reallocate it. Dropping the front reduces the capacity, meaning we'll need to reallocate regularly

In general I don't expect people to do more than one backfill at once. That seems like a premature optimization. If there's some existing data structure that would let the code remain as simple but have better allocation behavior, we could switch to that.

When do backfills happen, only when a schedule is created? I don't consider good data structure selection to be premature optimization, but if this is a one-off thing that happens then I care far less

It happens whenever a user requests it by api call, but it is generally a one-off or few-off thing, not regular.

Anyway, I did look into replacing it with a linked list. The problem is that the slices (BufferedStarts and OngoingBackfills) are in a proto-generated struct (that's used as the current state, to make continue-as-new easy). So we'd need to shuffle them back and forth... possible, but it's getting more complicated.

Overall I think the overhead of some extra allocation is going to be in the noise considering workflow tasks, SDK overhead, etc. If it was an every-iteration thing, then it'd be worth paying more attention.

I agree wholly. Let’s ship it

tdeebswihart · 2024-01-25T00:13:32Z

service/worker/scheduler/workflow_test.go

+	// This has been run for up to 3000, but it takes a very long time. Run only 100 normally.
+	const backfillRuns = 100
+	const backfills = 4


Could we inject or configure the sleep duration when testing so that we don't have to care?

It's already running in the workflow test framework which does time skipping. The bookkeeping in the test framework is just really really slow (and there are some quadratic bits in there.. elapsed time scales with the square of backfullRuns). Probably at least some of the quadraticness is from the mocks.

ast2023 · 2024-01-29T20:23:12Z

service/worker/scheduler/workflow_test.go

@@ -1311,6 +1320,140 @@ func (s *workflowSuite) TestBackfillInclusiveStartEnd() {
 	)
 }

+func (s *workflowSuite) TestHugeBackfillAllowAll() {


does the test framework guarantee that this test will never run concurrently with other tests in the file?

Yes, because it's not marked t.Parallel(). We could probably arrange things to work in parallel with some refactoring, but it doesn't seem urgent

I agree, it is not urgent to make the tests run in parallel. I just wanted to know what the framework will do since we modify global state in the test. Thank you for the explanation.

ast2023 · 2024-01-29T20:25:51Z

service/worker/scheduler/workflow_test.go

@@ -1311,6 +1320,140 @@ func (s *workflowSuite) TestBackfillInclusiveStartEnd() {
 	)
 }

+func (s *workflowSuite) TestHugeBackfillAllowAll() {


What is the expected behavior tested by this test?

When backfillRuns exceeds the max buffer size, the new logic will successfully run all the requested workflows, the old logic will fail (due to hitting the buffer size).

In the other test (line 1390) I reduce the buffer size to make the test more meaningful, I should do the same here

…5381

…5698) ## What changed? Activate schedule workflow logic changes. Some of this code is not in 1.23.0, but we can patch those PRs into 1.23.1 to support downgrades to the 1.23 series. ## Why? Fix bugs, support new features, make more efficient. ## How did you test it? existing tests (on those PRs) ## Potential risks schedule workflow determinism errors

Previously schedule backfills were processed synchronously: the workflow would run through the whole given time range and either start workflows or add them to the buffer. This is changed to process them incrementally and keep track of the time range processed so far. There can be up to 1000 buffered backfills (reusing same limit as action buffer). - This allows backfills to be arbitrarily long, instead of limited to 1000 actions. - In some situations, evaluating a long backfill could take more than 1s of elapsed time, causing the SDK to think that the workflow was deadlocked and failing the workflow task, effectively causing the scheduler workflow to get stuck. new unit tests, replay test

dnr added 3 commits January 23, 2024 21:53

backfill incrementally

100c243

unit test

bca0c90

replay test

f738dcd

dnr assigned ast2023 Jan 24, 2024

dnr requested a review from a team as a code owner January 24, 2024 05:53

tdeebswihart reviewed Jan 25, 2024

View reviewed changes

tdeebswihart approved these changes Jan 25, 2024

View reviewed changes

ast2023 reviewed Jan 29, 2024

View reviewed changes

ast2023 approved these changes Jan 29, 2024

View reviewed changes

reduce buffer size in allow all test too

cd14946

dnr merged commit 40387b9 into temporalio:main Jan 29, 2024
56 of 58 checks passed

dnr deleted the sched78 branch January 29, 2024 21:28

dnr added a commit to dnr/temporal that referenced this pull request Apr 9, 2024

Activate changes in temporalio#5179, temporalio#5344, and temporalio#…

a2e9ae3

…5381

dnr added the release/1.23.1 label Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do schedule backfills incrementally #5344

Do schedule backfills incrementally #5344

dnr commented Jan 24, 2024

tdeebswihart Jan 25, 2024

dnr Jan 25, 2024

tdeebswihart Jan 25, 2024

dnr Jan 25, 2024

tdeebswihart Jan 25, 2024

dnr Jan 25, 2024

tdeebswihart Jan 25, 2024

tdeebswihart Jan 25, 2024

dnr Jan 25, 2024

ast2023 Jan 29, 2024

dnr Jan 29, 2024

ast2023 Jan 29, 2024

ast2023 Jan 29, 2024

dnr Jan 29, 2024

Do schedule backfills incrementally #5344

Do schedule backfills incrementally #5344

Conversation

dnr commented Jan 24, 2024

What changed?

Why?

How did you test it?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment