-
Notifications
You must be signed in to change notification settings - Fork 1.2k
✨ Implement warm replica support for controllers #3192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @godwinpang. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
5d1be84
to
5db61c7
Compare
/retest |
5db61c7
to
be1b1c2
Compare
/retest |
/retest |
/retest |
/retest |
e05677a
to
1d07efc
Compare
/retest |
1 similar comment
/retest |
1c32d37
to
7cc29dc
Compare
@godwinpang Pleae keep the conversations open (going forward). It's really hard to find the just resolved/fixed conversations to check that everything was addressed |
@godwinpang Heyho, discussed this with @alvaroaleman. I think we're making really good progress on this PR and getting very close to merging it. We would just like to take a bit more time to ensure we don't break anything (I"ll also try to test it with Cluster API). So we'll go ahead and cut the CR v0.21 release now already (a few folks are waiting for that :)). Once this PR merges we'll cherry-pick and cut a v0.21.1 patch release. (I also modified the PR title as this PR is not breaking) |
A heads up that I am OOO this week so there will not be much progress on this PR. We're going to enable this warmup feature internally on June 2nd when I get back for some soak time. |
We've been running a forked version of c-r with I'll take a stab at
Anything else we are looking to do to gain more confidence around this feature and get it merged? 😄 |
Sounds fine to me |
@godwinpang any plans on wrapping this up? |
Sorry, I missed this Github notification. I've updated the PR description and also duplicated the tests between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
/hold
in case Stefan has anything to add
LGTM label has been added. Git tree hash: 781742318cfc9c5c72173ae51ebe9fb41673044d
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, godwinpang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
case err := <-sourceStartErrChan: | ||
return err | ||
case <-sourceStartCtx.Done(): | ||
defer func() { <-hasAccessedQueueChan }() // Ensure that watch.Start has been called to avoid prematurely releasing lock before accessing c.Queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this (i.e. hasAccessedQueueChan) since we are starting the queue inside the sync.Once? (xref previous discussion: #3192 (comment))
We write c.Queue only once and directly above, I'm not sure if there is still a race condition around this field
If we actually need it, would it be a simpler way to achieve the same by passing in c.Queue as a parameter into the func in l.357?
nit for the godoc: this does not ensure that watch.Start was called only that q := c.Queue
is done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree we don't need this synchronization anymore with the move into sync.Once, looks like this was leftover as a change before the suggestions were taken into account from the linked discussion. I'll take out the code that protects against races on c.Queue
access
}() | ||
|
||
<-m.Elected() | ||
Expect(<-runnableExecutionOrderChan).To(Equal(warmupRunnableName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry we might have discussed this before, how does this proof that warmupRunnable was run before leader election was won?
(This checks the order but not before/after leader election was won, correct?)
@@ -48,30 +53,46 @@ var _ = Describe("controller", func() { | |||
Describe("controller", func() { | |||
// TODO(directxman12): write a whole suite of controller-client interaction tests | |||
|
|||
It("should reconcile", func() { | |||
// Since all tests are run in parallel and share the same testenv, we namespace the objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are tests actually run in parallel? I thought they are run sequentially only tests of different packages are run in parallel (but each package has its own testenv)
|
||
err := ctrl.startEventSourcesAndQueueLocked(ctx) | ||
Expect(err).NotTo(HaveOccurred()) | ||
Expect(ctrl.startWatches).To(BeNil(), "startWatches should be reset to nil after returning") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're here, maybe let's also check startedEventSourcesAndQueue is true
}() | ||
|
||
<-m.Elected() | ||
Expect(<-runnableExecutionOrderChan).To(Equal(warmupRunnableName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here, how does this verify warmupRunnable was run before leader election is done?
} | ||
|
||
By("calling Warmup and Start concurrently") | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add a channel for start and block on it at the end of the test to ensure it was actually called?
ctrl.CacheSyncTimeout = time.Second | ||
|
||
var watchStartedCount atomic.Int32 | ||
ctrl.startWatches = []source.TypedSource[reconcile.Request]{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about adding more watches for some easy additional coverage? (e.g. add 10 of these in a for loop)
@godwinpang Thank you very much! We're almost there (only one idea for simplification in the prod code, otherwise just a few comments on tests) |
This change implements the proposal for warm replicas as proposed in #3121.
It adds an
EnableWarmup
option for controllers to optionally start as warmed replicas, which means that sources will be started before leader election has been won by the instance.It also adds a new internal runnable interface called
warmupRunnable
with a singleWarmup
method that will be called before leader election. There is no guarantee that theWarmup
method returns before leader election, just that it is called.The controller's implementation of the
Warmup
method for thewarmupRunnable
starts the sources with a threadsafe methodstartEventSourcesAndQueueLocked
that also adds events to the queue. In the case of a non-leader elected controller, it is intended for theWarmup
method to race with theStart
method to start the sources. The methods are synchronized by thedidStartEventSourcesOnce
sync.Once used instartEventSourcesAndQueueLocked
.For the most part,
Warmup
behaves exactly the same asStart
with the following differencesWarmup
can be called, vs.Start
which can only be called once.Warmup
doesn't initialize metrics for the controller, nor does it start the worker goroutines.Shutdown
Warmup runnables are shutdown in a separate goroutine from the one that stops the leader election runnables, because there is no guarantee as to whether or not the
Warmup
or theStart
method is holding the lock for thedidStartEventSourcesOnce
sync.Once instance.Testing
On top of the unit + integration tests, this PR was tested with an internal project that ran this branch of controller-runtime for a week with (EnableWarmup=true`, and verified that the controllers on the non-leader elected replica had a populated workqueue.