Description
I'm thinking about checkpointing controller state to blob storage for some of our high-load controllers, where starting from the snapshot (even with leader elected) can lead to long startup times.
There seems to be a similar issue in #728, but wasn't properly spec'd.
I'm thinking about this on a per-controller basis, optimistically storing:
- Latest source watch bookmark
- Current workqueue
On startup, optimistically load the previous state, configure the source with the last bookmark, and reload the workqueue.
The restrictions:
- On the user to indicate when to invalidate checkpoints and start from snapshot (e.g., when controller logic changes)
- Don't record the requeue times per item in the workqueues
- Don't snapshot the informer state
In #728 (comment), the suggestion was to use a separate Runnable. This works for the Start case (easy to wrap and load state), but not for the shutdown case, as we need to ensure the runnable is no longer running before checkpointing the state.
Hooking into the Manager also doesn't seem to be an option as it exits Start(ctx)
as soon as the context is cancelled, and also doesn't provide access to the underlying runnables.
Controllers do wait to return from Start(ctx)
until they are fully finished
The least intrusive way I think we could expose this is via an opt-in extension to the
Runnable
interface, e.g. FinishableRunnable
// FinishableRunnable is a Runnable that can indicate when it has finished running.
type FinishableRunnable interface {
Runnable
Done() <-chan struct{}
}
In the manager, it could block until all such runnables are complete (or until an error is reported via errChan).
controller-runtime/pkg/manager/internal.go
Lines 468 to 469 in 5af1f3e
Does anyone have any other less-intrusive ideas?
I'd be open to contributing the full checkpointing solution, but would like to see if this can be done in user-space first. I'd briefly talked with @alvaroaleman about this, though he's out for a few more months IIRC.