Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc for Velero Restore progress reporting #3016

Merged
merged 1 commit into from Nov 19, 2020

Conversation

pranavgaikwad
Copy link
Contributor

This is an initial design doc explaining how progress reporting can be implemented for Velero Restores.

@pranavgaikwad pranavgaikwad changed the title Propose progress reporting for Velero Restore Design doc for Velero Restore progress reporting Oct 19, 2020

## High-Level Design

We propose to divide the restore process in two steps. The first step will collect all the items to be restored from the backup tarball. It will apply the label selector and include/exclude rules on the resources / items and store them (preserving the priority order) in an in-memory data structure. The second step will read the collected items and restore them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, glad we think we can achieve the 2-step approach

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
}
```

We propose to remove the call to `restoreItem()` in the inner most loop and instead store the item in a data structure. Once all the items are collected, we loop through the array of collected items and make a call to `restoreItem()`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was concerned this would vastly slow down the restore progress, but thinking through it we aren't making API calls to get these resources this first loop process should be fairly fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's really a tradeoff between computation and memory here:

  1. First step reads from tarball, stores the resources in-memory, second step uses the resources directly from memory
    vs
  2. First step reads from tarball, only stores the path to the resource in-memory, second step reads from the tarball again

Both approaches are possible. Depends on what we want weigh more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm included to lean towards option 2 for now - loading all the JSON into memory right now isn't really feasible given our default memory constraints, especially on very large clusters.

Copy link
Contributor

@jenting jenting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

[...]

for namespace, items := range resourceList.ItemsByNamespace {
[...]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During this section of the loop in the current implementation, namespaces are handled differently than other resources. We don't explicitly restore them unless a resource is going to be restored into that namespace. They are created if needed, then the resources are restored into that namespace. Are you intending to capture namespaces in the count of items to restore? Will they be restored during the count of items, or will they be left out of the count and restored as they currently are during the main restore loop?

func (ctx *context) getOrderedResourceCollection(...) {
collectedResources := []restoreResource
for _, resource := range getOrderedResources(...) {
[...]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

De-duplication of resources happens in this section of the loop. This will also need to be included in this function.

We introduce two new structs to hold the collected items:

```go
type restoreResource struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, but both restoreResource and restoreItem are used as function names in restore.go so they will need to be updated or different names should be used here.


As an alternative, we have considered an approach which doesn't divide the restore process in two steps.

With that approach, the total number of items will be read from the Backup CR. We will keep three counters, `totalItems`, `skippedItems` and `restoredItems`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this approach would work as the restore skips over resources/namespaces without iterating through to find out how many items are being skipped. We could know the number of items available in the backup to be restored, and how many have been restored so far, but without iterating, I don't know if there's a way to determine how many are being skipped.

Copy link
Contributor

@nrb nrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty reasonable to me for a first approach! I think longer term, we'll want to flesh out a more complete directed graph structured, but I think for progress reporting, this is a perfectly workable solution at the moment, and I'd be willing to proceed with it.

Copy link
Contributor

@carlisia carlisia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @zubron made a couple observations worth noting.

@carlisia carlisia merged commit a757304 into vmware-tanzu:main Nov 19, 2020
georgettica pushed a commit to georgettica/velero that referenced this pull request Dec 23, 2020
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
georgettica pushed a commit to georgettica/velero that referenced this pull request Jan 26, 2021
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
vadasambar pushed a commit to vadasambar/velero that referenced this pull request Feb 3, 2021
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
dharmab pushed a commit to dharmab/velero that referenced this pull request May 25, 2021
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
ywk253100 pushed a commit to ywk253100/velero that referenced this pull request Jun 29, 2021
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
gyaozhou pushed a commit to gyaozhou/velero-read that referenced this pull request May 14, 2022
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Design Design Documents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants