Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the BIA in CSI plugin adopt v2. #177

Merged
merged 1 commit into from
Jun 8, 2023

Conversation

blackpiglet
Copy link
Contributor

Move the polling VS ReadyToUse logic into Async operation.

@blackpiglet blackpiglet self-assigned this May 25, 2023
@blackpiglet blackpiglet force-pushed the 6099-fix branch 3 times, most recently from bad6dcc to 3e62746 Compare May 25, 2023 14:38
@blackpiglet
Copy link
Contributor Author

blackpiglet commented May 26, 2023

@sseago
I met a problem when removing the polling logic in the Velero server code.
In the Velero server backup controller code, there was a polling logic of VolumeSnapshot status turning into ReadyToUse.
And there is also logic to delete VolumeSnapshot after the VolumeSnapshot turned into ReadyToUse.
https://github.com/vmware-tanzu/velero/blob/main/pkg/controller/backup_controller.go#L994-L1000

I'm not sure about whether to move the deleing VolumeSnapshot logic to the BIA v2 also.
Because the deleting logic has max concurrent connections limitation with kube-apiserver, it looks like there is no way to limit the concurrent BIA v2 actions. It's possible deleting VolumeSnapshot logic in BIA v2 may overload the kube-apiserver with a big number of VolumeSnapshots in backups.

What's your opinion about this? Is it possible to limit the concurrent BIA v2 action numbers?
If not, where should we put the deleting VolumeSnapshots logic? I think the backup_finalizer_controller.go is an option.

@reasonerjt @Lyndon-Li

@blackpiglet
Copy link
Contributor Author

I met a problem when removing the polling logic in the Velero server code.
In the Velero server backup controller code, there was a polling logic of VolumeSnapshot status turning into ReadyToUse.
And there is also logic to delete VolumeSnapshot after the VolumeSnapshot turned into ReadyToUse.
https://github.com/vmware-tanzu/velero/blob/main/pkg/controller/backup_controller.go#L994-L1000 ...

After a discussion with @Lyndon-Li, we think it should be safe to move the VolumeSnapshot deletion logic into BIA v2.
The BIA v2 action working mechanism should be enough to avoid lots of concurrent connections to kube-apiserver.

@sseago
Copy link
Collaborator

sseago commented May 26, 2023

@blackpiglet when you say "move VolumeSnapshot deletion logic into BIA v2", I'm not sure where you mean to put it. We don't currently have a plugin API call that executes during Finalize. We could, however, move this logic from backup controller reconcile into finalize controller reconcile. Basically after persisting itemsToUpdate for the backup, then delete VolumeSnapshots.

@sseago
Copy link
Collaborator

sseago commented May 26, 2023

@blackpiglet backup operations controller isn't a good place for it, since that controller won't act on a backup if all async operations are completed when leaving the InProgress phase -- if all async operations are done when the backup controller reconcile is complete (i.e. upon uploading tarball, logs, etc.), then we go straight to finalize. Finalize is never skipped, so that's the place we want this. We had to do exactly this in our OADP velero fork to deal with the OADP datamover, since we needed VS to not be deleted during datamover plugin operations -- see this commit on our fork:
openshift/velero@8e52bca

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented May 26, 2023

@sseago
We can delete the VS in finalize phase, e.g. inside FinalizeBackup function.
On the other hand, since the VS is in itemToUpdate list, the VS object will be called against BIA Execute for the second time, this is also an opportunity to delete the VS.

Ideally, we take the second way, since the snapshot is created by the plugin so it make more sense to delete it by plugin as well. Moreover, we want to remove the CSI related logic from Velero as much as possible
Let us know if that works with OADP data mover or not.

@blackpiglet blackpiglet force-pushed the 6099-fix branch 5 times, most recently from 08f26d3 to d9f7007 Compare May 31, 2023 08:44
@blackpiglet blackpiglet marked this pull request as ready for review May 31, 2023 09:13
@Lyndon-Li
Copy link
Contributor

Do we have a Velero counterpart for this PR? Since we have moved some code from Velero

@blackpiglet blackpiglet force-pushed the 6099-fix branch 7 times, most recently from db8852f to 1e6d63e Compare June 1, 2023 14:02
@blackpiglet
Copy link
Contributor Author

blackpiglet commented Jun 1, 2023

Do we have a Velero counterpart for this PR? Since we have moved some code from Velero

Yes. It's PR vmware-tanzu/velero#6327

Move the polling VS ReadyToUse logic into Async operation.
Read the Velero server's ResourceTimeout setting from backup
annotation.
Set the operation started time in the operationID.
Also put VSC status polling into Async plugin.
Add unit test.
Patch annotations into VSC to make it work in restore.

Signed-off-by: Xun Jiang <jxun@vmware.com>
@shubham-pampattiwar
Copy link
Collaborator

@blackpiglet @Lyndon-Li Will there be a separate PR for datamover changes relaed to CSI plugin ?

@blackpiglet
Copy link
Contributor Author

blackpiglet commented Jun 6, 2023

@blackpiglet @Lyndon-Li Will there be a separate PR for datamover changes relaed to CSI plugin ?

@shubham-pampattiwar Sure. I'm working on it. The BIA part is also in progress, and I will post a PR today, and I think most of rest work should be done in this week.

Could you take another look at this PR? The data mover work depends on this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants