Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[design] Proposal of triggering backups based on Kubernetes events. #2119

Closed
wants to merge 1 commit into from
Closed

Conversation

ezzoueidi
Copy link
Contributor

@ezzoueidi ezzoueidi commented Dec 11, 2019

Closes #2111

Any new ideas/thoughts would be much appreciated.

Signed-off-by: Naeil Ezzoueidi naeilzoueidi@ubuntu.com

cc @nrb

@nrb nrb added the Area/Design Design Documents label Dec 11, 2019
Signed-off-by: Naeil Ezzoueidi <naeilzoueidi@ubuntu.com>
@carlisia
Copy link
Contributor

👀

@carlisia
Copy link
Contributor

I'm still not quite sure that this belongs in Velero core. At the very least, I'd like to see more users requesting this feature.

My initial comment: This proposal has as a goal to trigger a backup in case of an accidental deletion, and to block a deletion of a component if it was performed by a human. I'd like to see it address how the system would distinguish an accidental from an intentional deletion.

The part that proposes blocking deletions could also be addressed with more details. It states it would happen in the case of human intervention (and trigger a backup), but how would the system distinguish this?

Copy link
Contributor

@nrb nrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think triggering of backups based on kubernetes events is definitely an interesting feature, I think it's also one that's complicated.

I'd like more details on how users define the particular resources and actions to watch for before accepting this proposal, and even if accepted, I don't believe that's a commitment from the core team to implement it any time soon.

That being said, I also believe this is something that could very well be implemented as an external controller by someone else, and deployed alongside Velero.

Also, the 3rd goal mentioned doesn't seem to fit to me - I would feel better about this proposal if it was excluded.


## Non Goals

- N/A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some non-goals here might be useful in providing constraints around the design.

## Goals

- Rather than performing backups manually or by scheduling them, backups are created based on events.
- Recovering the components of the cluster when they were deleted accidently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this some? How is this different from what Velero already does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Velero, we create backups manually or by scheduling them, what is different for waht Velero already does is that we trigger backups for the components that are being called to be deleted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're envisioning that this would intercept a delete request and back up the components before allowing them to delete?


- Rather than performing backups manually or by scheduling them, backups are created based on events.
- Recovering the components of the cluster when they were deleted accidently.
- Blocking the delete process of a component in the cluster if it was a human intervention and trigger a backup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea - can you go into detail on how you'd determined whether the deletion was triggered by a human or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about watching all the delete events and listening to the kubectl delete commands, this way we could distinguish if the action has been performed by a human or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Do you know how to distinguish this via the events emitted on the Kubernetes API server? They may labelled or annotated differently, but I'm not sure.

If they are, I'd like to see that called out in the document.


## Detailed Design

A custom controller where it have its own refelctor to list and watch a specific components using the kubernetes watch api.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would users specify what components to watch? How would users specify what events to watch for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be achieved through namespace selectors. And for specifying what events to watch; this should be by default watching any delete events..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespaces are one way, sure. I was also thinking about specifying per resource type.

I'd recommend providing explicit examples of what the user would type to list out events and namespaces. What would the configuration file or CRD look like? Would it be an addition to an existing CRD? These kinds of questions would be good to be answered in the proposal itself.


A custom controller where it have its own refelctor to list and watch a specific components using the kubernetes watch api.
It will add the object mentored with its status/current event to a queue.
Then pop up the object based on its status (criticity and priority).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the priority determined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say based on labels. E.g status: critical

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So each object would have to be labelled? Would this be a standard label, or are we introducing a new label just for this?

You can incorporate your answers into the proposal doc.

It will add the object mentored with its status/current event to a queue.
Then pop up the object based on its status (criticity and priority).
Block the delete process if it was performed from a human being and trigger a backup for that component.
Trying to catch the errors happening on the components so it predicts when they will be down and trigger backups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this part connects to backups - there's something I'm missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, we could use the help of other monitoring tools (promtail and loki), based on metrics we can for example predict if the application is going to be down (consuming many resources e.g memory, disk, etc..).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's getting outside of Velero's scope and into machine learning. While I think it's useful functionality for sure, I don't think I would accept this portion into Velero core.

@ezzoueidi
Copy link
Contributor Author

Thank you @carlisia and @nrb for the review and for the comments, sorry for the delay though, I will reply and push some modifications this week.

@ezzoueidi
Copy link
Contributor Author

I'm still not quite sure that this belongs in Velero core. At the very least, I'd like to see more users requesting this feature.

This also could be implemented as a plugin with new kind. IMHO, this is a good feature as it really covers a real scenario that can happen.

My initial comment: This proposal has as a goal to trigger a backup in case of an accidental deletion, and to block a deletion of a component if it was performed by a human. I'd like to see it address how the system would distinguish an accidental from an intentional deletion.

I would say that this is the role of a custom controller that uses the watch kubernetes api and listens on any kubectl delete commands to distinguish if the actions has been performed by a human or not.
Either ways, I see that backing up the components in both situations is good, what do you think?

Copy link
Contributor

@nrb nrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also could be implemented as a plugin with new kind. IMHO, this is a good feature as it really covers a real scenario that can happen.

I honestly see this proposal as being far larger than a plugin. It's a new type of controller, possibly multiple controllers. These controllers could also easily live outside the Velero codebase, I think, triggering backups by submitting a Backup CRD to the kubernetes API server.

Either ways, I see that backing up the components in both situations is good, what do you think?

I think that this is a complicated thing to achieve. I would like a lot more detail on exactly how a system would determine whether a deletion was from a human or automated - was it a cascading delete? Are there finalizers blocking deletion? And when I say detail, I think here I would really like examples of data structure fields or functions showing how the decision is made.

I'd also like to know - if this is based on deletion events, is the intention to force the delete to pause until the backup runs and completes? What happens in the case of a finalizer that isn't removed and the deletion never actually happens?

These scenarios can get pretty complicated, and I'd like to see them at least mentioned in the proposal.

@carlisia
Copy link
Contributor

I'm putting this on the agenda for tomorrow's meeting. @nzoueidi if you could join it would be great: https://velero.io/community/.

@carlisia
Copy link
Contributor

We discussed this in our community meeting today. The general consensus is that this request overall is useful but it is a great use-case for an operator, and not so much for inclusion into the Velero code base.

We will leave this PR open for a couple weeks with the intent of welcoming additional opinions in favor of this request, in case we are missing the needs of more users and how they would be using this. Else we'll close it but this conversation can always be picked up later.

Will post a link to the meeting recording once it's up on YT.

@carlisia
Copy link
Contributor

@skriss
Copy link
Member

skriss commented Feb 11, 2020

we haven't gotten any further input here; I'm going close this out.

If someone works on this outside of core Velero, we'd love to see it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Design Design Documents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trigger backups based on k8s events
4 participants