Controller Checkpointing

I'm thinking about checkpointing controller state to blob storage for some of our high-load controllers, where starting from the snapshot (even with leader elected) can lead to long startup times.

There seems to be a similar issue in https://github.com/kubernetes-sigs/controller-runtime/issues/728, but wasn't properly spec'd. 

I'm thinking about this on a per-controller basis, optimistically storing:
* Latest source watch bookmark
* Current workqueue

On startup, optimistically load the previous state, configure the source with the last bookmark, and reload the workqueue.

The restrictions:
* On the user to indicate when to invalidate checkpoints and start from snapshot (e.g., when controller logic changes)
* Don't record the requeue times per item in the workqueues
* Don't snapshot the informer state 


In https://github.com/kubernetes-sigs/controller-runtime/issues/728#issuecomment-571352448, the suggestion was to use a separate [Runnable](https://github.com/kubernetes-sigs/controller-runtime/blob/d9ff283bfe844e8e3806eb2d264b2a6fa7815f66/pkg/manager/manager.go#L290-L298). This works for the Start case (easy to wrap and load state), but not for the shutdown case, as we need to ensure the runnable is no longer running before checkpointing the state.
Hooking into the Manager also doesn't seem to be an option as it exits `Start(ctx)` as soon as the context is cancelled, and also doesn't provide access to the underlying runnables.

Controllers do wait to return from `Start(ctx)` until they are fully finished https://github.com/kubernetes-sigs/controller-runtime/blob/5f5daf39228530d7c5ed8f54e9a80c0e4528c9f6/pkg/internal/controller/controller.go#L216
The least intrusive way I think we could expose this is via an opt-in extension to the `Runnable` interface, e.g. `FinishableRunnable`

```go
// FinishableRunnable is a Runnable that can indicate when it has finished running.
type FinishableRunnable interface {
	Runnable
	Done() <-chan struct{}
}
``` 

In the manager, it could block until all such runnables are complete (or until an error is reported via errChan).

https://github.com/kubernetes-sigs/controller-runtime/blob/5af1f3ebd472b62a4a708ad3aa2d252489b91b27/pkg/manager/internal.go#L468-L469

Does anyone have any other less-intrusive ideas? 


I'd be open to contributing the full checkpointing solution, but would like to see if this can be done in user-space first. I'd briefly talked with @alvaroaleman about this, though he's out for a few more months IIRC.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Controller Checkpointing #3234

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Controller Checkpointing #3234

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions