Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[design] Proposal of triggering backups based on Kubernetes events. #2119

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
35 changes: 35 additions & 0 deletions design/trigger-backups-api-events.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Trigger backups based on Kubernetes events

Triggering backups with Velero based on Kubernetes api events (Terminating, deleting, CrashLoopBackOff, etc..).

## Goals

- Rather than performing backups manually or by scheduling them, backups are created based on events.
- Recovering the components of the cluster when they were deleted accidently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this some? How is this different from what Velero already does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Velero, we create backups manually or by scheduling them, what is different for waht Velero already does is that we trigger backups for the components that are being called to be deleted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're envisioning that this would intercept a delete request and back up the components before allowing them to delete?

- Blocking the delete process of a component in the cluster if it was a human intervention and trigger a backup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea - can you go into detail on how you'd determined whether the deletion was triggered by a human or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about watching all the delete events and listening to the kubectl delete commands, this way we could distinguish if the action has been performed by a human or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Do you know how to distinguish this via the events emitted on the Kubernetes API server? They may labelled or annotated differently, but I'm not sure.

If they are, I'd like to see that called out in the document.


## Non Goals

- N/A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some non-goals here might be useful in providing constraints around the design.


## Background

## High-Level Design

A custom controller that list and watch all the events of Kubenretes and based on that it trigger backups.

## Detailed Design

A custom controller where it have its own refelctor to list and watch a specific components using the kubernetes watch api.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would users specify what components to watch? How would users specify what events to watch for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be achieved through namespace selectors. And for specifying what events to watch; this should be by default watching any delete events..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespaces are one way, sure. I was also thinking about specifying per resource type.

I'd recommend providing explicit examples of what the user would type to list out events and namespaces. What would the configuration file or CRD look like? Would it be an addition to an existing CRD? These kinds of questions would be good to be answered in the proposal itself.

It will add the object mentored with its status/current event to a queue.
Then pop up the object based on its status (criticity and priority).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the priority determined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say based on labels. E.g status: critical

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So each object would have to be labelled? Would this be a standard label, or are we introducing a new label just for this?

You can incorporate your answers into the proposal doc.

Block the delete process if it was performed from a human being and trigger a backup for that component.
Trying to catch the errors happening on the components so it predicts when they will be down and trigger backups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this part connects to backups - there's something I'm missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, we could use the help of other monitoring tools (promtail and loki), based on metrics we can for example predict if the application is going to be down (consuming many resources e.g memory, disk, etc..).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's getting outside of Velero's scope and into machine learning. While I think it's useful functionality for sure, I don't think I would accept this portion into Velero core.


## Alternatives Considered

Apply the same design in a separate plugin as a new kind.

## Security Considerations

N/A