Skip to content

Controller Checkpointing #3234

Open
Open
@austince

Description

@austince

I'm thinking about checkpointing controller state to blob storage for some of our high-load controllers, where starting from the snapshot (even with leader elected) can lead to long startup times.

There seems to be a similar issue in #728, but wasn't properly spec'd.

I'm thinking about this on a per-controller basis, optimistically storing:

  • Latest source watch bookmark
  • Current workqueue

On startup, optimistically load the previous state, configure the source with the last bookmark, and reload the workqueue.

The restrictions:

  • On the user to indicate when to invalidate checkpoints and start from snapshot (e.g., when controller logic changes)
  • Don't record the requeue times per item in the workqueues
  • Don't snapshot the informer state

In #728 (comment), the suggestion was to use a separate Runnable. This works for the Start case (easy to wrap and load state), but not for the shutdown case, as we need to ensure the runnable is no longer running before checkpointing the state.
Hooking into the Manager also doesn't seem to be an option as it exits Start(ctx) as soon as the context is cancelled, and also doesn't provide access to the underlying runnables.

Controllers do wait to return from Start(ctx) until they are fully finished


The least intrusive way I think we could expose this is via an opt-in extension to the Runnable interface, e.g. FinishableRunnable

// FinishableRunnable is a Runnable that can indicate when it has finished running.
type FinishableRunnable interface {
	Runnable
	Done() <-chan struct{}
}

In the manager, it could block until all such runnables are complete (or until an error is reported via errChan).

// We are done
return nil

Does anyone have any other less-intrusive ideas?

I'd be open to contributing the full checkpointing solution, but would like to see if this can be done in user-space first. I'd briefly talked with @alvaroaleman about this, though he's out for a few more months IIRC.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions