Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod volume restores always run after all data mover restores completes for the same pod #6668

Closed
Lyndon-Li opened this issue Aug 17, 2023 · 5 comments · Fixed by #6946
Closed

Comments

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Aug 17, 2023

As the scenario of #6658, when we have both PVR and data mover in the restore of the same pod, below behaviors will always happen:

  1. PVRs won't start unless all the data mover restores completes
  2. If one or more data mover restores fail or being cancelled, none of the PVRs will start

The reason is:

  • PVRs won't start when the init-container of the pod runs
  • The init-container won't run unless the storage of the pod is ready, or in the other word, all the PVCs are bound
  • The PVCs to be restored by data movers won't be bound until the data mover restore is ready
@Lyndon-Li
Copy link
Contributor Author

Why PVR ensures the init-container is running:

  • This is an indirect way to check if all the volumes are provisioned, otherwise, if volumes are not provisioned, PVRs will fail
  • PVRs need to ensure the pod is not running all the time during the restore, otherwise, the restore and the application may do conflict operations

We need to review and refactor this init-container usage to fix the problem.

@Lyndon-Li Lyndon-Li self-assigned this Aug 17, 2023
@Lyndon-Li Lyndon-Li added the 1.13-candidate issue/pr that should be considered to target v1.13 minor release label Aug 17, 2023
@reasonerjt
Copy link
Contributor

@Lyndon-Li We probably can customize the behavior of the init-container, such as passing the volume list as a parm to the process?

@reasonerjt reasonerjt added Reviewed Q3 2023 Needs triage We need discussion to understand problem and decide the priority labels Aug 30, 2023
@reasonerjt reasonerjt removed Needs triage We need discussion to understand problem and decide the priority 1.13-candidate issue/pr that should be considered to target v1.13 minor release labels Sep 20, 2023
@reasonerjt reasonerjt added this to the v1.13 milestone Sep 20, 2023
@Lyndon-Li
Copy link
Contributor Author

Lyndon-Li commented Sep 28, 2023

After further investigation, we found:

  • Pod's volumes are not attached until all the PVCs getting into Bound status
  • However, the PVCs restored by data mover won't turn to Bound until the corresponding DDCRs complete
  • As a result, none of the volumes are attached until data mover finishes, even though the PVCs that are not handled by data mover have been in Bound status
  • Since volumes are not attached, PVRs have no way to start since the volumes are not in the host at all

Therefore, given the current behavior of PVRs (needs to visit the restored workload pods' volume) and the current behavior of Kuberentes' provisioner, there is not a way to solve this problem.

@Lyndon-Li
Copy link
Contributor Author

Let's add the scenarios to the fs-backup document.

@sseago
Copy link
Collaborator

sseago commented Sep 29, 2023

@Lyndon-Li I think that is fine. We should document it, but it's not a big deal. I suspect this is a bit of an edge case. Most "mixed datamover/fs-backup" scenarios will probably be some pods with datamover volumes, and others with fs backup volumes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants