Velero should support incremental restores #4066

pradeep288 · 2021-08-24T18:18:24Z

Describe the problem/challenge you have
Currently, Velero doesn't support incremental restore. What i mean by incremental is if I have restored X already and after some time if i want to restore X+ delta X, velero doesn't have support for this.

Describe the solution you'd like
During restore, Velero should find difference between X and X+ DeltaX and should restore only DeltaX.

Environment:

Velero version (use velero version): 1.6.1
Kubernetes version (use kubectl version):1.20
Kubernetes installer & version:1.20
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):MAC

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "The project would be better with this feature added"
👎 for "This feature will not enhance the project in a meaningful way"

The text was updated successfully, but these errors were encountered:

reasonerjt · 2021-08-25T06:21:10Z

@pradeep288 thanks for the issue.
I'm relatively new to this project, but I found this piece of code:
https://github.com/vmware-tanzu/velero/blob/main/pkg/restore/restore.go#L1240
It seems velero DOES try to modify the resource even it exists during the restore.

Could you please give us a more concrete example regarding the incremental restore what kind of resource is failed to be restored when you tried it and what's in the delta X ?

sseago · 2021-08-25T13:30:51Z

@reasonerjt If it already exists, the code below that line determines whether the in-cluster resource is different than the resource in the backup. However, only service accounts are actually patched when differences are found. For everything else, the difference between the cluster version being the same and it being different from the backup just changes the log message. Velero doesn't currently attempt to update resources in the general case. When velero runs a restore and there are non-serviceaccount resources that already exist in the cluster, you'll see in velero logs the message from this line:
https://github.com/vmware-tanzu/velero/blob/main/pkg/restore/restore.go#L1290

pradeep288 · 2021-08-25T15:18:00Z

Backup of X:

Create a source cluster called A
Create a namespace called pradeep-draas
Create a deployment called example with replicas =3
Install Velero and take the backup(b1) using Velero
Create a target Target Cluster called B
Install velero and verify backup b1 exists.
restore the backup (b1) using velero
Ensure namespace pradeep-draas and example with replicas =3 gets restored.

Backup of X + Delta:

Switch to cluster A.
Switch to namespace pradeep-draas
Increase the replicas of the example from 3 to 4
Take the backup*(b2)* using Velero
Switch to cluster B.
verify backup b2 exists.
restore the backup(b2) using velero
Ensure namespace pradeep-draas and example with replicas =4 gets restored.
Actual Result:

Velero restores 4 pods but doesn't change the replicaSet with this hence we end up with 3 replicas as replicaSet kills extra pod

Namespaces:
    pradeep-draas:  could not restore, configmaps "kube-root-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version.
                    could not restore, pods "example-d6dcb985-2k6q8" already exists. Warning: the in-cluster version is different than the backed-up version.
                    could not restore, pods "example-d6dcb985-jpsdv" already exists. Warning: the in-cluster version is different than the backed-up version.
                    could not restore, pods "example-d6dcb985-n6kbf" already exists. Warning: the in-cluster version is different than the backed-up version.
                    could not restore, pods "example-d6dcb985-xhvjn" already exists. Warning: the in-cluster version is different than the backed-up version.
                    could not restore, replicasets.apps "example-d6dcb985" already exists. Warning: the in-cluster version is different than the backed-up version.
                    could not restore, deployments.apps "example" already exists. Warning: the in-cluster version is different than the backed-up version.

dsu-igeek · 2021-08-30T20:10:13Z

In general we have avoided this because it has the potential to seriously disrupt a working cluster. Our general advice has been to restore into a new cluster/namespace rather than trying to overwrite an existing installation. The opposite problem of what you are describing may happen in that case - for example you have increased your replica set to 5 after the backup and 4 was in the backup. At that point, if we are modifying resources, we would set it back to 4.

@pradeep288 Would you give some examples of use cases for why you would to modify an existing installation with a restore rather than simply creating a new install?

codegold79 · 2021-10-29T00:36:17Z

@dsu-igeek asked,

Would you give some examples of use cases for why you would to modify an existing installation with a restore rather than simply creating a new install?

We would like to set up a backup cluster that is identical to a running production cluster, but with one difference: all the replicas in the backup cluster has been scaled down to zero; it has been quiesced. Essentially, the backup cluster is a dormant version of the production cluster.

The reason for a current, yet dormant cluster is in the case of a disaster, this backup cluster can be quickly unquiesced and become the new production cluster.

In order to keep such a quiesced backup cluster updated, we would like to be able to take backups on a schedule and use Velero restore to update the backup cluster on a similar schedule.

What you suggested, "simply creating a new install" would not work for us as that would require clearing an entire cluster only to restore it again in its entirety and repeatedly. We would like a solution where an incremental restore is possible. It would be much faster to update a dormant copy of a production cluster than restore it from scratch based on an updated backup.

I hope that answers your question.

codegold79 · 2021-11-01T23:19:38Z

Solution Brainstorming

Summary

Though not perfect, I suggest when --force is specified for Velero restore, resources already found on cluster should attempt at an update. The downside to this approach is users will encounter a warning that a resource was not updated because the resource or a field in a resource is immutable.

The alternative is not to attempt an update, but rather a delete followed by a create. However, delete has potential for too much disruption to a cluster. It is feasible to be able to do a delete/create but perhaps in a future issue/PR after finalizers and owner references with cascading deletes can be considered.

For now, being able to force restore and consequently update a destination cluster with a small number of immutable resources left untouched will be good enough.

Consider Updating Resources

Velero restore uses dynamic client from client-go to create backed up objects on the destination cluster.

Currently, the method called on the dynamic client is "Create" located in restore.go line 1234.

Perhaps for this issue, I can create a velero restore command flag, --override, --force, or --overwrite.

If one of those flags is used, and the "Create" method returns a "Already Exists" error, then restore will call client-go's Update method: https://github.com/kubernetes/client-go/blob/2f5d8b0c528db5467159baccddc3f634fcea0747/dynamic/simple.go#L146. The dynamic client update method does an HTTP PUT call to the Kubernetes server.

In short, if the --force or other similar flag is used, the following will happen:

Try to create a resource.
If creating a resource fails because resource already exists, then check if the resource has changed.
i. If the resource on the destination cluster is the same as the object being restored, then no-op.
ii. If the resources are different, attempt an update (PUT call to Kubernetes server).

Consider Deleting and Recreating Resources

This issue is similar to #469 and the idea there is to delete and create a resource if the backup object and what's on the destination cluster are the same. See the comments in the issue for thoughts on this method of tackling the solution.

Concerns and Questions Regarding PUT

Q: Does PUT (replacing an object) trigger the finalizers like deletion does? A: No, finalizer doesn't interfere with the PUT operation.
Q: Is PUT able to bypass the immutable field validation protection? A: No. I added "immutable: true" to a secret manifest. Here was the resulting error when I tried to update the secret using PUT:

Failure: Secret \"immutable\" is invalid: data: Forbidden: field is immutable when `immutable` is set

Q: Is PUT able to bypass inherently immutable fields (e.g. fields like service.spec.ipFamilies)? A: No, not for Service ipFamilies. I tested changing the ipFamilies in a service resource using PUT and got these errors:

Failure: Service \"my-nginx-2\" is invalid: [metadata.resourceVersion: Invalid value: \"\": must be specified for an update, spec.clusterIPs[0]: Invalid value: []string(nil): primary clusterIP can not be unset, spec.ipFamilies[0]: Invalid value: []core.IPFamily{\"IPv6\"}: may not change once set]

It seems the error originates from Kubernetes validation, but I can't track down what is doing the validation. Is it the admissions mutating webhooks? It seems the error ultimately is reported here.

Q: What does PUT do with PVs?
Q: Is it easy to call for an incremental Restic restore for data in PVs that already exist on the cluster?
Concern: Objects with controllers (e.g. pods) will be terminated and created during the times that the deployment and pods are inconsistent. The inconsistency will happen because, in a pod/deployment example, I'll be replacing old pod with new ones, but the deployment will not be updated. So the old deployment will terminate the newly restored pods and bring back the old pods. When Velero gets to replacing the deployment, it sees the old pods are inconsistent with the new spec and they will be terminated. Response: I think this concern should not stop having a --force feature. It would be reasonable to expect that people implement hooks to quiesce pods before doing a backup. And unquiesce after a restore.

Concerns and Questions Regarding Delete and Create

Q: When attempting to delete (and recreate) a resource that has a finalizer, how do we deal with the finalizers that might prevent deletion of the resource until cleanup events happen? A: Not sure what the consequences of this is, but I can use PATCH to remove finalizer, then DELETE, then CREATE the resource from restore, then re-add the finalizer. This avoids the cleanup logic that finalizers are designed to wait for. But is that an acceptable thing to do?
Q: How do we update a PV instead of deleting everything in that part of storage and recreating it?
Q: Can we take advantage of Restic incremental PV updates when we see a PV being restored already on the cluster?
Concern: Consider resources with owner references. Say there is a parent resource and a child one. Given an owner reference link, deleting the parent will cause deletion of the child. If I go the delete/create route, I'll need to ensure that any child that was deleted is restored. Seems like a lot of complicated logic to add just to do an incremental restore.

stale · 2022-01-08T20:26:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dymurray · 2022-01-13T13:17:14Z

I would like to see this feature pushed forwards. I would see some benefit in being able to have more granular control over the backup/restore process to force the updates to a certain subset of resources.

dymurray · 2022-05-31T19:19:19Z

What work is left here for 1.10? My understanding is that ExistingResourcePolicy satisfies this use case as you can now restore the delta of the resources in the cluster.

reasonerjt · 2022-06-13T11:38:18Z

I agree with @dymurray

@pradeep288 could you clarify?

reasonerjt · 2022-06-21T09:06:40Z

I'm closing this issue as the ExistingResourcePolicy is introduced in v1.9.

lxs137 · 2023-03-08T09:25:40Z

Is ExistingResourcePolicy=update means bypass the immutable field validation protection

shubham-pampattiwar · 2023-03-08T15:49:52Z

@lxs137 Not exactly, if you set it to update and you are trying to update an immutable spec then the restore will error for that particular k8s resource.

boedy · 2023-04-04T12:23:16Z

Our use-case is exactly the same as described by @codegold79. For me it's unclear if ExistingResourcePolicy actually provides a solution for this use-case. Some questions I have:

Does this allow for incremental restores using restic, only downloading the backup delta?
Having quiesced cluster would need to support design for data-only restores #504 I presume?

@codegold79 Since about 1.5 year has passed since your last post. Have you pursued an alternative solution?

sseago · 2023-04-11T21:00:15Z

For restic, you'd need to delete the mounting pod prior to restore, since restic requires an initcontainer added at pod startup. If you do that, velero should perform a restic restore for any volumes mounted by the newly-created pod. With that done, existingResourcePolicy is probably irrelevant for Restic, since that field addresses kubernetes metadata update, not content on volumes.

codegold79 · 2023-04-24T16:22:40Z

@codegold79 Since about 1.5 year has passed since your last post. Have you pursued an alternative solution?

Sorry for the delayed response @boedy. Shortly after I made the post, I moved to a different team and project and haven't been working on backup/restore since.

panpan0000 · 2023-06-20T03:35:11Z

@codegold79 your scenario is very good use case of velero: to build a Active-standby cluster. so far thanks to ExistingResourcePolicy we come more closer to that goal !

@dymurray I think the other thing left is some kind of more flexiable "override-policy" ---
so far only image/sc can be altered when restore(refer to https://velero.io/docs/main/restore-reference/#changing-poddeploymentstatefulsetdaemonsetreplicasetreplicationcontrollerjobcronjob-image-repositories)
but changing the pod replicas to zero or smaller value, is the need for this scenario.

we can learn from karmada , which provides powerful "override policy"

lizhifengones · 2023-10-19T04:16:50Z

#6867 (comment)

Restoring data in situ should be a very common scenario? Would we consider it?

reasonerjt added kind/requirement Needs info Waiting for information labels Aug 25, 2021

reasonerjt added Needs info Waiting for information Needs Product Blocked needing input or feedback from Product and removed Needs info Waiting for information labels Aug 31, 2021

codegold79 self-assigned this Oct 29, 2021

stale bot added the staled label Jan 8, 2022

stale bot removed the staled label Jan 13, 2022

sseago mentioned this issue Feb 1, 2022

Add design for enabling support for ExistingResourcePolicy to restore API #4613

Merged

shubham-pampattiwar mentioned this issue Mar 17, 2022

Add existingResourcePolicy to Restore API #4628

Merged

3 tasks

shubham-pampattiwar mentioned this issue Apr 18, 2022

Add ExistingResourcePolicy support to restore API #4842

Closed

eleanor-millman added the 1.10-candidate The label used for 1.10 planning discussion. label May 25, 2022

reasonerjt unassigned codegold79 Jun 21, 2022

reasonerjt removed the 1.10-candidate The label used for 1.10 planning discussion. label Jun 21, 2022

reasonerjt closed this as completed Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero should support incremental restores #4066

Velero should support incremental restores #4066

pradeep288 commented Aug 24, 2021 •

edited

reasonerjt commented Aug 25, 2021

sseago commented Aug 25, 2021

pradeep288 commented Aug 25, 2021

dsu-igeek commented Aug 30, 2021

codegold79 commented Oct 29, 2021

codegold79 commented Nov 1, 2021 •

edited

stale bot commented Jan 8, 2022

dymurray commented Jan 13, 2022

dymurray commented May 31, 2022

reasonerjt commented Jun 13, 2022

reasonerjt commented Jun 21, 2022

lxs137 commented Mar 8, 2023

shubham-pampattiwar commented Mar 8, 2023

boedy commented Apr 4, 2023

sseago commented Apr 11, 2023

codegold79 commented Apr 24, 2023

panpan0000 commented Jun 20, 2023

lizhifengones commented Oct 19, 2023

Velero should support incremental restores #4066

Velero should support incremental restores #4066

Comments

pradeep288 commented Aug 24, 2021 • edited

reasonerjt commented Aug 25, 2021

sseago commented Aug 25, 2021

pradeep288 commented Aug 25, 2021

dsu-igeek commented Aug 30, 2021

codegold79 commented Oct 29, 2021

codegold79 commented Nov 1, 2021 • edited

Solution Brainstorming

Summary

Consider Updating Resources

Consider Deleting and Recreating Resources

Concerns and Questions Regarding PUT

Concerns and Questions Regarding Delete and Create

stale bot commented Jan 8, 2022

dymurray commented Jan 13, 2022

dymurray commented May 31, 2022

reasonerjt commented Jun 13, 2022

reasonerjt commented Jun 21, 2022

lxs137 commented Mar 8, 2023

shubham-pampattiwar commented Mar 8, 2023

boedy commented Apr 4, 2023

sseago commented Apr 11, 2023

codegold79 commented Apr 24, 2023

panpan0000 commented Jun 20, 2023

lizhifengones commented Oct 19, 2023

pradeep288 commented Aug 24, 2021 •

edited

codegold79 commented Nov 1, 2021 •

edited