-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do schedule backfills incrementally #5344
Conversation
AllowZeroSleep: true, | ||
ReuseTimer: true, | ||
NextTimeCacheV2Size: 14, // see note below | ||
Version: DontTrackOverlapping, // TODO: upgrade to InclusiveBackfillStartTime | ||
Version: DontTrackOverlapping, // TODO: upgrade to IncrementalBackfill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will this no longer be a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to make changes to workflow logic in two stages, write the logic first but not enabled, then release that, then enable it, then release that. Otherwise we can't roll back an upgrade without possibly breaking schedules that made an update during that time, which would add too much risk to releases.
for len(s.State.OngoingBackfills) > 0 && | ||
limit > 0 && | ||
// use only half the buffer for backfills | ||
len(s.State.BufferedStarts) < s.tweakables.MaxBufferSize/2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a maximum size why not use a circular buffer for s.State.OngoingBackfills
so we don't ever need to reallocate it. Dropping the front reduces the capacity, meaning we'll need to reallocate regularly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I don't expect people to do more than one backfill at once. That seems like a premature optimization. If there's some existing data structure that would let the code remain as simple but have better allocation behavior, we could switch to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do backfills happen, only when a schedule is created? I don't consider good data structure selection to be premature optimization, but if this is a one-off thing that happens then I care far less
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens whenever a user requests it by api call, but it is generally a one-off or few-off thing, not regular.
Anyway, I did look into replacing it with a linked list. The problem is that the slices (BufferedStarts and OngoingBackfills) are in a proto-generated struct (that's used as the current state, to make continue-as-new easy). So we'd need to shuffle them back and forth... possible, but it's getting more complicated.
Overall I think the overhead of some extra allocation is going to be in the noise considering workflow tasks, SDK overhead, etc. If it was an every-iteration thing, then it'd be worth paying more attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree wholly. Let’s ship it
// This has been run for up to 3000, but it takes a very long time. Run only 100 normally. | ||
const backfillRuns = 100 | ||
const backfills = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we inject or configure the sleep duration when testing so that we don't have to care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already running in the workflow test framework which does time skipping. The bookkeeping in the test framework is just really really slow (and there are some quadratic bits in there.. elapsed time scales with the square of backfullRuns). Probably at least some of the quadraticness is from the mocks.
@@ -1311,6 +1320,140 @@ func (s *workflowSuite) TestBackfillInclusiveStartEnd() { | |||
) | |||
} | |||
|
|||
func (s *workflowSuite) TestHugeBackfillAllowAll() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the test framework guarantee that this test will never run concurrently with other tests in the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because it's not marked t.Parallel(). We could probably arrange things to work in parallel with some refactoring, but it doesn't seem urgent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it is not urgent to make the tests run in parallel. I just wanted to know what the framework will do since we modify global state in the test. Thank you for the explanation.
@@ -1311,6 +1320,140 @@ func (s *workflowSuite) TestBackfillInclusiveStartEnd() { | |||
) | |||
} | |||
|
|||
func (s *workflowSuite) TestHugeBackfillAllowAll() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected behavior tested by this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When backfillRuns exceeds the max buffer size, the new logic will successfully run all the requested workflows, the old logic will fail (due to hitting the buffer size).
In the other test (line 1390) I reduce the buffer size to make the test more meaningful, I should do the same here
…5698) ## What changed? Activate schedule workflow logic changes. Some of this code is not in 1.23.0, but we can patch those PRs into 1.23.1 to support downgrades to the 1.23 series. ## Why? Fix bugs, support new features, make more efficient. ## How did you test it? existing tests (on those PRs) ## Potential risks schedule workflow determinism errors
…5698) ## What changed? Activate schedule workflow logic changes. Some of this code is not in 1.23.0, but we can patch those PRs into 1.23.1 to support downgrades to the 1.23 series. ## Why? Fix bugs, support new features, make more efficient. ## How did you test it? existing tests (on those PRs) ## Potential risks schedule workflow determinism errors
Previously schedule backfills were processed synchronously: the workflow would run through the whole given time range and either start workflows or add them to the buffer. This is changed to process them incrementally and keep track of the time range processed so far. There can be up to 1000 buffered backfills (reusing same limit as action buffer). - This allows backfills to be arbitrarily long, instead of limited to 1000 actions. - In some situations, evaluating a long backfill could take more than 1s of elapsed time, causing the SDK to think that the workflow was deadlocked and failing the workflow task, effectively causing the scheduler workflow to get stuck. new unit tests, replay test
What changed?
Previously schedule backfills were processed synchronously: the workflow would run through the whole given time range and either start workflows or add them to the buffer. This is changed to process them incrementally and keep track of the time range processed so far. There can be up to 1000 buffered backfills (reusing same limit as action buffer).
Why?
How did you test it?
new unit tests, replay test