Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore progress reporting #3125

Merged
merged 3 commits into from Mar 4, 2021
Merged

Conversation

pranavgaikwad
Copy link
Contributor

@pranavgaikwad pranavgaikwad commented Dec 2, 2020

This PR introduces progress reporting for Velero Restore.

Closes #21

The changes in this PR follow the design doc : https://github.com/vmware-tanzu/velero/blob/main/design/restore-progress.md

Summary of changes :

  • pkg/apis/velero/v1/restore.go : Introduces new field RestoreProgress in RestoreStatus
  • config/crd/bases/velero.io_restores.yaml & config/crd/crds/crds.go: Updates Restore CRD to include above field.
  • pkg/cmd/server/server.go : Passes velero client when creating a new Restore. The client is needed to send updates to RestoreStatus during an ongoing restore
  • pkg/restore/restore.go: Modifies restore flow as per https://github.com/vmware-tanzu/velero/blob/main/design/restore-progress.md#modifications-to-restorego.
  • pkg/cmd/util/output/restore_describer.go: Updates velero describe command to show restore progress information
  • pkg/restore/restore_test.go: Adds missing velero client in one of the Restore tests
  • pkg/util/kube/utils.go: Updates EnsureNamespaceExistsAndIsReady method to return an additional boolean to indicate whether a new Namespace was created or not. If new ns created, it will be added to list of restored items.
  • pkg/util/kube/utils_test.go: Updates test to verify value of the new boolean

@pranavgaikwad pranavgaikwad force-pushed the restore-progress branch 4 times, most recently from 2d764c0 to 0aeba45 Compare December 16, 2020 23:07
@pranavgaikwad pranavgaikwad marked this pull request as ready for review December 16, 2020 23:09
@pranavgaikwad pranavgaikwad changed the title WIP - Restore progress reporting Restore progress reporting Dec 16, 2020
Copy link
Contributor

@jenting jenting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Could we add the restore progress reporting at velero restore describe, same as the backup progress report at velero backup describe, something like:

    Total items to be restored from:
    Items restored from:
    
  2. I create a backup resource, at backup reporting, the total items are 106. However, when I restore from the backup resource, the total times are 22, why? 🤔

@pranavgaikwad
Copy link
Contributor Author

@jenting Thanks for pointing that out. I will take a look, my first guess is that maybe those resources were skipped during restore. I will confirm though.

@pranavgaikwad
Copy link
Contributor Author

pranavgaikwad commented Jan 22, 2021

@jenting In my environment, I didn't see a huge difference in the number of items backed up vs restored:

apiVersion: velero.io/v1
kind: Backup
metadata:
  [...]
  labels:
    velero.io/storage-location: default
  name: bkp-1
  namespace: velero
  [...]
spec:
  [...]
  includedNamespaces:
  - test-bkp
  - test-bkp-2
status:
  [...]
  phase: Completed
  progress:
    itemsBackedUp: 389
    totalItems: 389
  startTimestamp: "2021-01-22T19:13:06Z"
apiVersion: velero.io/v1
kind: Restore
metadata:
  name: bkp-1-20210122141607
  namespace: velero
  [...]
spec:
  backupName: bkp-1
  excludedResources:
  - nodes
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  - resticrepositories.velero.io
  hooks: {}
  includedNamespaces:
  - '*'
status:
  completionTimestamp: "2021-01-22T19:16:29Z"
  phase: Completed
  progress:
    itemsRestored: 387
    totalItems: 387
  startTimestamp: "2021-01-22T19:16:08Z"

The difference of 2 items in Backup and Restore comes from the fact that this PR doesn't include namespaces in the count of items restored. There is a way to update this function to return the number of ns it creates so that ns'es can be included in the count of items. I will look into that.

I'd be interested to know what kind of resources you tried to restore when you saw this mismatch in numbers.

@jenting
Copy link
Contributor

jenting commented Jan 26, 2021

I forgot the original test steps but I could provide a new one.

Steps:

  1. Create a new namespace called nginx
  2. Install nginx kubectl apply -f https://k8s.io/examples/application/deployment.yaml
  3. Backup namespace nginx velero backup create backup-nginx --include-namespaces=nginx
  4. Describe backup velero backup describe backup-nginx --details
Total items to be backed up:  31
Items backed up:              31

Resource List:
  apps/v1/Deployment:
    - nginx/nginx-deployment
  apps/v1/ReplicaSet:
    - nginx/nginx-deployment-66b6c48dd5
  v1/Event:
    - nginx/nginx-deployment-66b6c48dd5-9l4n9.165dccc49aae8161
    - nginx/nginx-deployment-66b6c48dd5-9l4n9.165dccc4b5fe1f95
    - nginx/nginx-deployment-66b6c48dd5-9l4n9.165dccc4bac83caf
    - nginx/nginx-deployment-66b6c48dd5-9l4n9.165dccc4c2609ca9
    - nginx/nginx-deployment-66b6c48dd5-9l4n9.165dcccc5f50a571
    - nginx/nginx-deployment-66b6c48dd5-czxj5.165dccd10a7cede5
    - nginx/nginx-deployment-66b6c48dd5-czxj5.165dccd126e35f36
    - nginx/nginx-deployment-66b6c48dd5-czxj5.165dccd13023360e
    - nginx/nginx-deployment-66b6c48dd5-czxj5.165dccd13700b6cf
    - nginx/nginx-deployment-66b6c48dd5-kwwmn.165dccd1083e8e3f
    - nginx/nginx-deployment-66b6c48dd5-kwwmn.165dccd12510c30b
    - nginx/nginx-deployment-66b6c48dd5-kwwmn.165dccd130b5c888
    - nginx/nginx-deployment-66b6c48dd5-kwwmn.165dccd136ed2aa2
    - nginx/nginx-deployment-66b6c48dd5-ph7tc.165dccc49ab2bd34
    - nginx/nginx-deployment-66b6c48dd5-ph7tc.165dccc4b9fc9cf2
    - nginx/nginx-deployment-66b6c48dd5-ph7tc.165dccc4c1536718
    - nginx/nginx-deployment-66b6c48dd5-ph7tc.165dccc4c8c0fc9a
    - nginx/nginx-deployment-66b6c48dd5-ph7tc.165dcccc5ec2d1b3
    - nginx/nginx-deployment-66b6c48dd5.165dccc49802b570
    - nginx/nginx-deployment-66b6c48dd5.165dccc4992f8858
    - nginx/nginx-deployment-66b6c48dd5.165dccd107609102
    - nginx/nginx-deployment-66b6c48dd5.165dccd10865dc6d
    - nginx/nginx-deployment.165dccc4964f2922
    - nginx/nginx-deployment.165dccd10652f80c
  v1/Namespace:
    - nginx
  v1/Pod:
    - nginx/nginx-deployment-66b6c48dd5-czxj5
    - nginx/nginx-deployment-66b6c48dd5-kwwmn
  v1/Secret:
    - nginx/default-token-qwsgw
  v1/ServiceAccount:
    - nginx/default
  1. Delete namespace nginx
  2. Restore from backup backup-nginx, velero restore create restore-nginx --from-backup=backup-nginx --include-namespaces=nginx
  3. Check size kubectl get restores restore-nginx -o yaml
spec:
  backupName: backup-nginx
  excludedResources:
  - nodes
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  - resticrepositories.velero.io
  hooks: {}
  includedNamespaces:
  - nginx
status:
  completionTimestamp: "2021-01-26T14:06:10Z"
  phase: Completed
  progress:
    itemsRestored: 6
    totalItems: 6
  startTimestamp: "2021-01-26T14:06:09Z"

@pranavgaikwad
Copy link
Contributor Author

pranavgaikwad commented Jan 26, 2021

@jenting As the events are excluded from the restore, 24 instances of the v1/Event resource present in the backup were not restored and hence the difference in the number. Please also note that I am working on a way to include namespaces in the count.

@jenting
Copy link
Contributor

jenting commented Jan 27, 2021

@jenting As the events are excluded from the restore, 24 instances of the v1/Event resource present in the backup were not restored and hence the difference in the number. Please also note that I am working on a way to include namespaces in the count.

Cool, thank you for the explanation.

I'd love that the velero backup describe displays the counts on total items to be restored and current restored items.

@nrb nrb added kind/release-blocker Must fix issues for the coming release (milestone) P2 - Long-term important labels Feb 3, 2021
Copy link
Contributor

@nrb nrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand how this is all working, but I'd like some more comments, as the restore code can be a little tricky :)

Overall I think it makes sense, though, and I appreciate your efforts here!

And finally, it needs a rebase before we'll be able to merge.

pkg/restore/restore.go Show resolved Hide resolved
pkg/restore/restore.go Outdated Show resolved Hide resolved
pkg/restore/restore.go Outdated Show resolved Hide resolved
pkg/restore/restore.go Outdated Show resolved Hide resolved
pkg/restore/restore.go Show resolved Hide resolved
return
}

func (ctx *restoreContext) getSelectedRestoreableItems(resource, targetNamespace, originalNamespace string, items []string) (res restoreableResource, warnings Result, errs Result) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this code correctly, getOrderedResourceCollection is very similar to the previous restore loop in that it's applying the selection criteria like namespace, and includes/excludes. However, the change here is that this function, getSelectedRestoreableItems then groups those resources by namespace and applies the Kubernetes label selection. It replaces some of the functionality of the restoreResource function, which should be removed since it's no longer relevant with this PR, from what I can tell.

Do I have that right? If so please add a comment to clarify that's what this function is doing, so we can keep track of it.

Also, based on this code it looks like we're unmarshaling twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrb that is correct. We are unmarshling twice. Unmarshling is needed at first to apply selection criteria, and then finally to actually restore the object. Another alternative was to keep entire object in-memory until it's restored which isnt very attractive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and your'e correct about the duplicate code from restoreResource function. thanks for pointing that out. will clean it up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative was to keep entire object in-memory until it's restored which isnt very attractive.

Right, that is the trade we'd have to make. I honestly think we will end up doing this at some point, though, as we'd like to form a graph of objects. For now I think this will work and we can revisit the unmarshalling mechanisms in the future.

changelogs/unreleased/3125-pranavgaikwad Outdated Show resolved Hide resolved
@pranavgaikwad
Copy link
Contributor Author

@jenting This commit addresses your feedback about velero describe restore command: 35f8e27

@pranavgaikwad
Copy link
Contributor Author

@nrb when you have a moment, request you to take a look at new code I added to include namespaces in the total count. That change updates EnsureNamespaceExistsAndIsReady() util method to return whether it creates a new ns or not. If a new ns is created, it is added to the list of restored items.

@nrb
Copy link
Contributor

nrb commented Feb 9, 2021

@pranavgaikwad Looks like there was a conflict with the colorized output PR that's affecting this.

@nrb nrb added this to the v1.6.0 milestone Feb 10, 2021
@pranavgaikwad
Copy link
Contributor Author

@pranavgaikwad Looks like there was a conflict with the colorized output PR that's affecting this.

Sure will rebase again.

pkg/restore/restore.go Outdated Show resolved Hide resolved
Comment on lines +439 to +467
if namespace != "" && !existingNamespaces.Has(selectedItem.targetNamespace) {
logger := ctx.log.WithField("namespace", namespace)
ns := getNamespace(logger, archive.GetItemFilePath(ctx.restoreDir, "namespaces", "", namespace), selectedItem.targetNamespace)
if _, nsCreated, err := kube.EnsureNamespaceExistsAndIsReady(ns, ctx.namespaceClient, ctx.resourceTerminatingTimeout); err != nil {
errs.AddVeleroError(err)
continue
} else {
// add the newly created namespace to the list of restored items
if nsCreated {
itemKey := velero.ResourceIdentifier{
GroupResource: kuberesource.Namespaces,
Namespace: ns.Namespace,
Name: ns.Name,
}
ctx.restoredItems[itemKey] = struct{}{}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block code looks similar to

// if the namespace scoped resource should be restored, ensure that the namespace into
// which the resource is being restored into exists.
// This is the *remapped* namespace that we are ensuring exists.
nsToEnsure := getNamespace(ctx.log, archive.GetItemFilePath(ctx.restoreDir, "namespaces", "", obj.GetNamespace()), namespace)
if _, nsCreated, err := kube.EnsureNamespaceExistsAndIsReady(nsToEnsure, ctx.namespaceClient, ctx.resourceTerminatingTimeout); err != nil {
errs.AddVeleroError(err)
return warnings, errs
} else {
// add the newly created namespace to the list of restored items
if nsCreated {
itemKey := velero.ResourceIdentifier{
GroupResource: kuberesource.Namespaces,
Namespace: nsToEnsure.Namespace,
Name: nsToEnsure.Name,
}
ctx.restoredItems[itemKey] = struct{}{}
}
}

And the code logic would run twice. I'm not sure what's the difference between these two or could we merge them into one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenting This duplication is required because restoreItem() can be called for additional items returned by plugins. The recursive call in that case only triggers restoreItem() function alone and not the rest of the previous logic.

jenting
jenting previously approved these changes Feb 18, 2021
Copy link
Contributor

@jenting jenting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jenting
Copy link
Contributor

jenting commented Feb 19, 2021

@pranavgaikwad I just realized there is a conflict file. Please resolve it if you have time. 🙇

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
@carlisia
Copy link
Contributor

I'll be reviewing this tomorrow.

Copy link
Contributor

@jenting jenting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @pranavgaikwad

Copy link
Contributor

@nrb nrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

v1.6.0 automation moved this from In progress to Reviewer approved Mar 4, 2021
@nrb nrb merged commit c46fe71 into vmware-tanzu:main Mar 4, 2021
v1.6.0 automation moved this from Reviewer approved to Done Mar 4, 2021
dharmab pushed a commit to dharmab/velero that referenced this pull request May 25, 2021
* restore progress reporting

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

* add restore statistics to describe restore

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

* address feedback, include namespaces in the count

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
@rishi-anand
Copy link

@pranavgaikwad, @jenting, @nrb, @carlisia I have seen this issue again in my cluster which is on v1.6.0

I have restored in same cluster with preserveNodePort: true and velero was throwing error msg about nodePort already in use. Later, I changed preserveNodePort: false but then restore crd says all items restored but phase is InProgress from a very long time. I tried restarting velero pod but that didn't fixed the issue.

apiVersion: velero.io/v1
kind: Restore
metadata:
  creationTimestamp: "2021-06-08T01:57:44Z"
  name: backup-1
  namespace: velero
  resourceVersion: "13816289"
  selfLink: /apis/velero.io/v1/namespaces/velero/restores/backup-1
  uid: c8cc70b3-4b7a-4bde-aadb-da5856aeb259
spec:
  backupName: backup-1
  excludedResources:
  - nodes
  - events
  - events.events.k8s.io
  - backups.velero.io
  - restores.velero.io
  - resticrepositories.velero.io
  hooks: {}
  includeClusterResources: true
  preserveNodePorts: true
  restorePVs: true
status:
  phase: InProgress
  progress:
    itemsRestored: 1037
    totalItems: 1037
  startTimestamp: "2021-06-08T01:57:44Z"

ywk253100 pushed a commit to ywk253100/velero that referenced this pull request Jun 29, 2021
* restore progress reporting

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

* add restore statistics to describe restore

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

* address feedback, include namespaces in the count

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
gyaozhou pushed a commit to gyaozhou/velero-read that referenced this pull request May 14, 2022
* restore progress reporting

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

* add restore statistics to describe restore

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

* address feedback, include namespaces in the count

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has-changelog has-unit-tests kind/release-blocker Must fix issues for the coming release (milestone)
Projects
No open projects
v1.6.0
  
Done
Development

Successfully merging this pull request may close these issues.

Restore progress
5 participants