Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design proposal for two stage Backup API #1804

Closed
wants to merge 3 commits into from

Conversation

satyamz
Copy link

@satyamz satyamz commented Aug 26, 2019

This design proposal is conclusion of problems discussed in issue #1519.

Signed-off-by: Satyam Zode satyam.zode@mayadata.io

More discussion on this can be found in issue:
vmware-tanzu#1519

Signed-off-by: Satyam Zode <satyamzode@gmail.com>
Signed-off-by: Satyam Zode <satyamzode@gmail.com>
Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satyamz thanks a lot for putting this proposal together! I think it makes a lot of sense. I added a few initial comments and questions.

- Add `UploadSnapshot` method to `VolumeSnapshotter` interface.
- Generate GRPC code using codegen.
- Implement `UploadSnapshot` method in client and server.
- Call `UploadSnapshot` method as soon as `Post-Hook` is executed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to have the item_backupper call UploadSnapshot right after it calls CreateSnapshot, but it should do so in a goroutine. Then, we'll need to add a sync.WaitGroup (or similar) that it gets added to, and that the overall backup process waits on before completing the backup.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! 👍
I will update document for this.

design/two-stage-backup-api-proposal.md Outdated Show resolved Hide resolved

```

```Note: Backup progress should not be updated until `UploadSnapshot` is returned.```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now we probably don't want to add any new backup phases, but down the road we may want to add an "Uploading" (or similar) phase to cover when all the items have been backed up, and the snapshots have been taken, but we're waiting for the snapshot to finish uploading.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now we probably don't want to add any new backup phases

Right. We should only use existing phases since we are not adding more to the current backup workflow.

Volumesnapshotter is designed for snapshot, this can be either cloud snapshot or local snapshot. As of now, most plugin use volumesnapshotlocations to decide the snapshot type.
We can add new config parameter in velero command line to define snapshot type, through this, velero will decide whether to invoke `UploadSnapshot` or not.

**Implementation plan for Velero:**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few general questions/comments:

  • we probably want a timeout value for the UploadSnapshot calls. Do you think this should be configurable on the VolumeSnapshotLocation?
  • what should happen if 1+ UploadSnapshot calls fail? Does the backup become Failed or PartiallyFailed? Do we need some way to delete the local snapshot for cleanup? etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want a timeout value for the UploadSnapshot calls. Do you think this should be configurable on the VolumeSnapshotLocation?

This is not clear to me. Can you be specific about, why do we want timeout? Is it a timeout or number of retries for UploadSnapshot call?

what should happen if 1+ UploadSnapshot calls fail?

We should mark backup as Failed since, snapshot succeeded but upload failed. If we are not able to reuse (upload again) the snapshot then it doesn't make much sense to me to mark entire backup as PartiallyFailed.
If we are marking backup as Failed in the above case then we must have intelligence to do clean up too. Otherwise, snapshot will unnecessarily be present on the node.

Do we need some way to delete the local snapshot for cleanup?

Yes! For local snapshot we should provide a way to do clean-up. Clean up can be triggered if UploadSnapshot returns an error. Clean up can be a go routine too (Part of same waitGroup).

(On the similar line, I was thinking to use channel for passing signal from UploadSnapshot to trigger a DeleteSnapshot call but that would require to change the signature of UploadSnapshot which wouldn't be good since there are third party plugins which will consume VolumeSnapshotter)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not clear to me. Can you be specific about, why do we want timeout? Is it a timeout or number of retries for UploadSnapshot call?

I was thinking we'd want the velero backup to terminate if the UploadSnapshot call exceeds a certain time limit, though maybe this isn't necessary. Open to thoughts from others.

Yes! For local snapshot we should provide a way to do clean-up. Clean up can be triggered if UploadSnapshot returns an error. Clean up can be a go routine too (Part of same waitGroup).

Does this just require calling the existing DeleteSnapshot(...) method in case of an error returned from UploadSnapshot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So once the local snapshot has been taken, upload to stable storage is a second phase. At this point, the backup is at risk, but you should be able to restore from it. If, for example, we're uploading to a cloud service from on-prem and the network goes does for a day, we shouldn't fail all of the local snapshots, but we should mark them at risk. When the network comes back, the uploads should finish and all of the snapshots are now fully protected. If you fail the backups because of network partition, you'll lose the protection that you had and the ability to roll back to those points-in-time, even though we would have been able to.

I'd prefer to see some kind of a hook/call back/status where the plug-in reports on upload progress and Velero can track it as partially completed. Or, Velero needs to manage the whole upload part more closely with an upload manager that knows what has and has not happened.

BTW, in this proposal, how would we get parallelism on uploads?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satyamz If you delete the local snapshot, how does restore work? Do we need a download method that pulls the snapshot out of the long term storage to use?

I think the restore side should be addressed in this proposal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsu-igeek You have put very good points here.

@skriss I think, we should provide some configuration for clean up too or skip the clean up part for now and make sure that backup is marked as PartiallyFailed if upload fails in between. I need your help on deciding this.

@nrb

If you delete the local snapshot, how does restore work? Do we need a download method that pulls the snapshot out of the long term storage to use?

I really haven't given a thought to this side of the problem. However, your suggestion sounds right to me. I think, we should first address the backup part. Restore will come into picture when we are doing clean up of existing snapshots.

@skriss I think, we should consider clean up in the second phase. I see that, snapshot can be used even if snapshot is uploaded to remote storage. This will require modifications on restore side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should first address the backup part.

That's reasonable. I think we can certainly iterate on the backup design first.

My personal preference would be to not release the API until we're fairly confident in both backup and restore, to minimize the number of new releases plugins would have to make to conform.

@skriss skriss added the Area/Design Design Documents label Aug 26, 2019
Signed-off-by: Satyam Zode <satyamzode@gmail.com>
@satyamz
Copy link
Author

satyamz commented Aug 27, 2019

Hi @skriss. I addressed comments here and updated proposal too. Let's settle for #1804 (comment) then I will incorporate those reviews too. Thank you 😄


In the above flow, `CreateSnapshot` method creates snapshot and starts upload too. Hence, split this into two APIs: `CreateSnapshot` and `UploadSnapshot`.

Proposed design: `Pre-hook` -> `CreateSnapshot` -> `Post-hook` -> `UploadSnapshot` (Background)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to clarify that we'll actually trigger the UploadSnapshot call before calling post hooks, but it'll run in a separate goroutine that won't block running post hooks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Velero has a pretty neat feature to maintain the application consistent backups/snapshot called `Pre-hooks` and `Post-Hooks`. However, there's a small change in Velero which needs to be addressed in order to achieve minimum downtime for an application or more consistent application backups. Existing backup flow for a typical stateful application looks like: `Pod` -> `Pre-Hook` --> `PVC` -> `PV` -> `Post-Hook`. This is a mechanism which is followed by Velero block store plugins. In most of the plugins, `CreateSnapshot` method of `VolumeSnapshotter` interface implements logic to upload the snapshot and this is a blocking call for `Post-Hook`. This will introduce more time and will keep application frozen for time equals to snapshot upload time.


To solve the above problem, we can split existing backup flow and introduce one more method called `UploadSnapshot` to `VolumeSnapshotter` interface, which will execute once `Post-Hook` is executed and will keep running in background. This will reduce the time between `Pre-Hook` and `Post-Hook` significantly as well as it will not impact application consistency or downtime.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, let's clarify that UploadSnapshot is being called before post hook, but in a separate goroutine

- Add `UploadSnapshot` method to `VolumeSnapshotter` interface.
- Generate GRPC code using codegen.
- Implement `UploadSnapshot` method in client and server.
- Call `UploadSnapshot` method from `itemBackupper` as soon as `CreateSnapshot` call returns. `UploadSnapshot` should run as go routine and should be tracked with backup process.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to detail exactly how the goroutines get tracked and waited for. If we're using a sync.WaitGroup, where is it instantiated, how does it get passed down to the item backupper, and where does it get waited on?

This change should be implemented in both Velero as well as Velero plugins. However, Velero plugins can continue to rely on current implementation. Only way for Velero plugins to get benifitted from this approach is to implement `UploadSnapshot` method.

Volumesnapshotter is designed for snapshot, this can be either cloud snapshot or local snapshot. As of now, most plugin use volumesnapshotlocations to decide the snapshot type.
We can add new config parameter in velero command line to define snapshot type, through this, velero will decide whether to invoke `UploadSnapshot` or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely clear on what you're proposing here -- where does this config parameter go? Is it part of the VolumeSnapshotLocation definition? Can you provide some more detail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a case where it would be desirable to call CreateSnapshot but not UploadSnapshot?

And, for thoroughness, is there a possibility for a need of the reverse, needing to call UploadSnapshot but not CreateSnapshot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carlisia I would think AWS, GCP, and Azure wouldn't use UploadSnapshot, since it's a combined operation in all of their create snapshot APIs.

I'm not sure about your second question. It would have to be some snapshot that already exists somehow, I would think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is to decide whether Velero should execute uploadSnapshot or not.
There are few storage providers like openEBS, portworx, who support local snapshot and cloud base snapshot. With uploadSnapshot API velero will enable uploading of the snapshot(or cloud base snapshot).

What if the user wants a local snapshot only?

I agree that this can be achieved by volumeSnapshotLocation also, where the plugin will decide if it is configured to upload snapshot or not. Since velero is providing uploadSnapshot API,
it will be better to have this control in Velero itself. This parameter will go to BackupSpec.

And, for thoroughness, is there a possibility for a need of the reverse, needing to call UploadSnapshot but not CreateSnapshot?

For this, the snapshot should exist. But it's up to plugin since uploadSnapshot consumes snapshotId to upload snapshot.


## Alternatives Considered

Let plugin developers implement their own upload mechanism on storage orchastrator side. Once `CreateSnapshot` of plugin is called, plugin would take snapshot of Volume, trigger own upload method in background and return the `CreateSnapshot` call as a success to Velero. This is not as per Velero norm and will not be implemented as a part of Velero. In general, this way will not benefit larger part of Velero users who are using restic plugin. Hence, it would be good to have a implementation to trigger upload mechanism from Velero.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the Velero norm. AWS, GCP, and Azure plugins all assume that the uploads happen in the background, on the storage provider's system. I'm not sure I agree that the majority of users are using restic, but it's true that restic means we have 2 different paths for snapshotting.

@satyamz
Copy link
Author

satyamz commented Nov 13, 2019

I am not sure I've more time to devote to this effort. Hi @mynktl, see if you can time to do this. Feel free to open the PR.

@satyamz satyamz closed this Nov 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Design Design Documents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants