Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup progress #20

Closed
kstewart opened this issue Aug 9, 2017 · 19 comments · Fixed by #2440
Closed

Backup progress #20

kstewart opened this issue Aug 9, 2017 · 19 comments · Fixed by #2440
Assignees
Labels
Enhancement/User End-User Enhancement to Velero Help wanted
Milestone

Comments

@kstewart
Copy link

kstewart commented Aug 9, 2017

Provide a way for users to see the progress of an in-flight backup. Some thoughts:

  • if backing up all namespaces, first get a list of all namespaces so we know how many there are to process
  • try to find out how many different types of GroupResources are being backed up, so we can record progress per namespace
  • periodically record status somewhere in backup.status
@ncdc
Copy link
Contributor

ncdc commented Aug 10, 2017

Idea: store per-backup log file to object storage. Add ark backup logs command to retrieve it.

@ncdc ncdc self-assigned this Aug 11, 2017
@ncdc ncdc modified the milestone: v0.4.0 Sep 5, 2017
@ncdc ncdc removed their assignment Sep 6, 2017
@ncdc
Copy link
Contributor

ncdc commented Sep 6, 2017

Note, backup logs are separate from progress. We'll be coming up with ways to track real-time progress as described above to close out this issue.

@skriss
Copy link
Member

skriss commented Sep 6, 2017

For backups, we process first by resource then by namespace.
For restores, we process first by namespace then by resource.

I think for progress reporting, we should not be tightly coupled to the current mode/order of processing since that could change. We can count the number of resource-namespace combinations and then report as each pair gets completed.

My initial thought is something like:

// OperationProgress describes the overall progress for a backup
// or restore operation.
//
// BackupStatus and RestoreStatus will each have a field of this type.
type OperationProgress struct {
	// PercentComplete is a weighted sum of individual items' progress.
	// Each item has a weight of 1/len(Items), and a value equal to the
	// item's PercentComplete. This is provided as a useful summary
	// of the more detailed item progress information.
	PercentComplete int

	// Items is the individual units of work that make up the
	// operation (currently defined as namespace-resource pairs).
	Items []ItemProgress
}

// ItemProgress describes the current backup/restore progress for
// a namespace-resource pair.
type ItemProgress struct {
	Namespace string
	Resource  string

	// PercentComplete is computed simply as (# items processed)
	// divided by (total # items for namespace-resource pair)
	PercentComplete int
}

cc @ncdc

@ncdc ncdc modified the milestones: v0.4.0, v0.5.0 Sep 7, 2017
@skriss
Copy link
Member

skriss commented Sep 26, 2017

@ncdc let me know if you have any thoughts on this. we may have to wait until we decide on a revised backup/restore design before finalizing implementation plan.

@ncdc
Copy link
Contributor

ncdc commented Sep 26, 2017

@skriss it would be nice if operationProgress.percentComplete were more accurate, based on the percentages of the individual ItemProgresses.

When we do a backup, we know up front how many different types of resources we have, and how many items we have per resource type. If we store that information in backup.status, we can use it when restoring. Doing it that way, we could simply use % complete per resource type, and not worry about namespace.

@ncdc
Copy link
Contributor

ncdc commented Sep 26, 2017

We could store a map of GroupResource string to

type ResourceStatus struct {
  Processed int
  Total int
}

It could look like this:

resourceProgress:
  pods:
    processed: 15
    total: 100
  storageclasses.storage.k8s.io:
    processed: 0
    total: 3

@ncdc
Copy link
Contributor

ncdc commented Sep 26, 2017

And if we need to precalculate and store percentages, we could, although it's probably easy enough not to and just let consumers do it.

Also, if we ever move to a work queue and we want to have multiple workers independently updating progress, we could get a lot of conflicts and retries trying to update a single map in a single Backup. Maybe json patch would help there...

@skriss
Copy link
Member

skriss commented Sep 26, 2017

I'd be fine just doing it per-resource rather than also by namespace.

When we do a backup, we know up front how many different types of resources we have, and how many items we have per resource type

How do we know total # items per resource type? We list/back them up per-namespace

If we store that information in backup.status, we can use it when restoring

I like this idea in theory, but need to think about how it interacts with restore includes/excludes, label selectors

Also, if we ever move to a work queue and we want to have multiple workers independently updating progress, we could get a lot of conflicts and retries trying to update a single map in a single Backup. Maybe json patch would help there...

WDYT about having a single goroutine responsible for updating progress, and the workers just report to that goroutine?

@rdodev
Copy link

rdodev commented Feb 9, 2018

Regardless of implementation approach chosen, I would strongly suggest against using percentages to measure progress. I would recommend using "x of y" instead.

@ncdc
Copy link
Contributor

ncdc commented Feb 9, 2018

Yeah we won't - as I wrote above

And if we need to precalculate and store percentages, we could, although it's probably easy enough not to and just let consumers do it.

@ncdc
Copy link
Contributor

ncdc commented Mar 9, 2018

cc @jbeda - another UX question

@jbeda
Copy link

jbeda commented Mar 10, 2018

I think that status of a backup comes down to a set of questions:

  • Is this thing making progress?
  • How long is this going to take? Should I wait? Get coffee? Go home? Come back next week?

The problem with percentages is that it is hard to answer these if it will take a long time. If it takes 5 minutes to move one percent then you have to wait 5-10 minutes to get an idea. With that in mind, having the raw data helps. Doesn't have to be super accurate -- something like "tasks". Or some counter that moves regularly.

@nrb
Copy link
Contributor

nrb commented Apr 17, 2018

I like this idea in theory, but need to think about how it interacts with restore includes/excludes, label selectors

My initial impression here is that we'd store the total in the backup, but may have to recalculate when doing a selective restore. So a backup could hold 10 items, but we only want 6 in this restore. That does mean we're duplicating the logic, but currently I think that's not terrible.

@jmontleon
Copy link
Contributor

jmontleon commented Mar 20, 2019

This is pretty rough, although working pretty well.

It's using the json output in restic (using master, json output as added after 0.9.4) to update the podvolumebackup CR with progress. It's also dumping the restic output to the pod logs, although it's an awful lot of output so maybe that's not great.

openshift#4

I don't know if and approach like this, if cleaned up, would be interesting?

Unfortunately at this time it looks like restic still does not yet have output for restores so a similar approach is not yet possible for restore.

@skriss
Copy link
Member

skriss commented Mar 21, 2019

@jmontleon I really like the idea. Looking at it some more.

@jmontleon
Copy link
Contributor

jmontleon commented Mar 21, 2019

@skriss if you'd like to see it in action I have an image at docker.io/jmontleon/velero with the changes. Should work if you just update the image on the restic daemonset and velero deployment in a test environment and perform a backup with restic. It's updating at a 10 second interval, which could probably be fixed to be an optional parameter so as long as the backup takes 30-60 seconds or so you should get an idea of what it looks like.

@skriss
Copy link
Member

skriss commented Apr 2, 2019

@jmontleon sorry I've been slow in providing feedback here - hasn't fallen off my radar.

dymurray added a commit to dymurray/velero that referenced this issue Apr 16, 2019
Rebase against upstream again after SkipRestore feature accepted upstream
@skriss skriss self-assigned this Apr 6, 2020
@skriss skriss modified the milestones: v1.x, v1.4 Apr 27, 2020
@nrb nrb closed this as completed in #2440 May 7, 2020
@vishnuitta
Copy link
Contributor

thanks @skriss and @nrb for having fix for this issue..
I have one question regarding backup progress when using plugins:
Does this PR also covers taking progress/feedback from the plugins that are doing backup of volume's data?

@skriss
Copy link
Member

skriss commented May 12, 2020

It does not - no information is collected from the plugins to inform progress. We could consider that, but it would be a separate enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement/User End-User Enhancement to Velero Help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants