Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-on fixes for BIAv2 controller work #5971

Merged
merged 1 commit into from Mar 14, 2023

Conversation

sseago
Copy link
Collaborator

@sseago sseago commented Mar 8, 2023

Thank you for contributing to Velero!

Please add a summary of your change

Follow-on PR for BIAv2 controller implementation.
This includes the changes discussed but not implemented in #5849

This includes several items:

  • Terminology/naming updates for certain params and controllers
  • Additional comments for the item collector changes
  • Refactored the BackupItemOperationsMap code to encapsulate it and to facilitate sharing between controllers
  • Converge finalize and non-finalize backupItem workflow -- no more exceptions for hooks, PVs, Pods
  • Fail backupItem if a plugin starts an operation during finalize
  • Fixed a few codespell errors which were causing PR failures

Does your change fix a particular issue?

Fixes #(issue)

Please indicate you've done the following:

  • [x ] Accepted the DCO. Commits without the DCO will delay acceptance.
  • [x ] Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@codecov-commenter
Copy link

Codecov Report

Merging #5971 (a4f0ea8) into main (7d8cb99) will decrease coverage by 0.06%.
The diff coverage is 49.58%.

❗ Current head a4f0ea8 differs from pull request most recent head 038f239. Consider uploading reports for the commit 038f239 to get more accurate results

@@            Coverage Diff             @@
##             main    #5971      +/-   ##
==========================================
- Coverage   39.75%   39.70%   -0.06%     
==========================================
  Files         256      256              
  Lines       23237    23135     -102     
==========================================
- Hits         9239     9185      -54     
+ Misses      13300    13260      -40     
+ Partials      698      690       -8     
Impacted Files Coverage Δ
pkg/backup/item_collector.go 54.01% <ø> (ø)
pkg/cmd/server/server.go 6.14% <0.00%> (ø)
pkg/cmd/util/output/backup_describer.go 0.00% <0.00%> (ø)
pkg/controller/download_request_controller.go 68.22% <ø> (ø)
...k/backupitemaction/v2/backup_item_action_client.go 0.00% <0.00%> (ø)
pkg/backup/backup.go 52.58% <16.66%> (ø)
...k/backupitemaction/v2/backup_item_action_server.go 31.29% <33.33%> (ø)
pkg/backup/item_backupper.go 79.25% <63.33%> (+0.50%) ⬆️
pkg/controller/backup_operations_controller.go 52.03% <68.57%> (ø)
pkg/controller/backup_controller.go 55.28% <100.00%> (+0.12%) ⬆️
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -334,7 +249,7 @@ func (c *asyncBackupOperationsReconciler) updateBackupAndOperationsJSON(
removeIfComplete = false
return errors.Wrap(err, "error uploading backup json")
}
if err := operations.uploadProgress(backupStore, backup.Name); err != nil {
if err := operations.UploadProgress(backupStore, backup.Name); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel there is still a contest to upload the same object:

  • itemOperationsMap.PutOperationsForBackup update the cache when there are changes
  • When describing the same backup, it calls ItemOperationsMap.UpdateForBackup. If there are changes, the later calls UploadProgress to upload the object
  • At the meantime, another turn of reconciler may see the operation comes to terminal phase and call the UploadProgress here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lyndon-Li It's not a problem, though. While progress may be slightly out-of-date due to the time uploading/downloading, the value will be correct as of the time the user issued the command. For 1.12 we want to add a new CRD to pull status directly from the velero pod, which will eliminate this bit, but that is too large a change to fit into the 1.11 time frame.

Copy link
Contributor

@Lyndon-Li Lyndon-Li Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sseago I think the contest will cause some problem though the opportunity is very small:

  1. In the backup_operation controller, a backup is seen to get the terminal phase, so operations.UploadProgress is called here at line 252, therefore, the operation persisted in the backup store is with a terminal status
  2. Meanwhile, the operation should be removed because it has completed, but as you can see the deletion happens at line 228 itemOperationsMap.DeleteOperationsForBackup, which is inside a defer function, or means, it is not in the same atomic operation with operations.UploadProgress
  3. Suppose right before itemOperationsMap.DeleteOperationsForBackup is called, a backup describe is triggered, so it calls ItemOperationsMap.UpdateForBackup, the later can still get the operation from the cache map because it is not deleted yet, but the one it gets is not in a terminal phase
  4. Then itemOperationsMap.UpdateForBackup calls operations.UploadProgress and upload a operation in the backup store with a unterminated status, or out-of-date operation
  5. Then itemOperationsMap.DeleteOperationsForBackup is called
  6. Finally, when users call backup describe again, nothing will be uploaded again because the operation has been deleted from the cache
  7. As a result, the operation in the backup store will be in a unterminated status forever

Copy link
Contributor

@Lyndon-Li Lyndon-Li Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the fix effort, it should be very small: look like we miss one PutOperationsForBackup between operations.UploadProgress and itemOperationsMap.DeleteOperationsForBackup when the backup is detected to get the terminal status. But this Put should be atomic with either Upload or Delete somehow. For sure, not all cases need this way, only the last upload/deletion does

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lyndon-Li Ahh, I see the issue. So while the map maintains pointers to OperationsForBackup structs, GetOperationsForBackup gets a deep copy (so we don't need to hold the lock for long) -- but this means that there's a separation between upload of a list of operation and the map getting updated. I think the solution here may be to create BackupItemOperationsMap.UploadProgressAndPutOperationsForBackup which grabs the lock, uploads progress, and updates the map atomically. Then I can make OperationsForBackup.UploadProgress an unexported func and only call it from the method above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Now when the controller uploads progress, we also update the operations map in the same function call, and it's done while holding the operations map lock, so the describe call into the map won't happen in-between these actions.

@Lyndon-Li
Copy link
Contributor

@sseago
Besides the following up changes in the current PR, we also suggest to change the name of below two phases like below:
BackupPhaseFinalizingAfterPluginOperations to BackupPhaseFinalizing
BackupPhaseFinalizingAfterPluginOperationsPartiallyFailed to BackupPhaseFinalizingPartiallyFailed

Reasons:

  1. Every backup will go to this "finalizing" phase no matter it has backupOperation or not, so it is more like a generic backup phase
  2. We even want this phase to be a generic backup phase, because we may need to do other tasks in the "finalizing" phase
  3. The controller is already "backup_finalizer_controller", and from the controller logic, it is query easy to make it a generic controller

@sseago
Copy link
Collaborator Author

sseago commented Mar 10, 2023

@Lyndon-Li I agree with your suggestions for the finalize phase name change. I'll fix that when I rebase this PR to deal with the new conflicts with main branch.

@sseago sseago force-pushed the biav2-follow-on branch 3 times, most recently from 7a86eac to 53ddf1b Compare March 13, 2023 14:29
Signed-off-by: Scott Seago <sseago@redhat.com>
@reasonerjt
Copy link
Contributor

LGTM
Thanks @sseago for the follow up!

@reasonerjt reasonerjt merged commit eeee4e0 into vmware-tanzu:main Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants