-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataMover: Restore is getting stuck in waitingForPluginOpertions phase for the StatefulSet application. #6813
Comments
There are 3 DDCRs (DataDownload CRs) -- The problem was with the 3rd DDCR -- cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-n4smn, here is the log:
The prepare had not completed until the timeout, 30 mins. |
@PrasadJoshi12
|
@Lyndon-Li I tried it again and this time it worked fine. I will give it a try again tomorrow. I have seen this issue happening multiple times. |
I double checked the code, it is probably that when the problem happened, the restored pod was not successfully scheduled within 30 min, so Velero had no where to run the data mover. |
@Lyndon-Li I tried it again on my cluster, This time dataDownload is stuck for the first PVC. It looks like an intermittent issue.
test-restore-65znj pod is running since last 15 mins.
Attached pod yaml below.
Workload pod is still in pending status.
PVC is looking for PV with a label velero.io/dynamic-pv-restore: test-oadp-223.cassandra-data-cassandra-0.pxc7r
There are no PV in cluster with label
|
@PrasadJoshi12
Hopefully you are keeping the environment, if so please help to collect the same pod & pvc info for the other two DDCRs as you've done for |
this logs are collected after Datadownload was running for almost 15 mins. After 30 mins both DDCR was failed.
Re-collected it again. |
@PrasadJoshi12 Are you in Velero slack channel, could we discuss a little bit? |
@Lyndon-Li I'm not in velero slack channel. can you please send link ? |
@PrasadJoshi12 |
@Lyndon-Li I'm not able join the kubernetes slack.
|
@PrasadJoshi12 https://communityinviter.com/apps/kubernetes/community to get the invite to slack. |
Thank you @shawn-hurley |
Attached restored PVC yamls |
Issue was happening with SC binding mode WaitForFirstConsumer. Tested it with SC binding mode immediate. Restore completed successfully. Thank you so much @Lyndon-Li for helping with debugging.
|
Checked the environment again, during restore, only one pod was created:
However, as Velero's restore logics, the restore order in this case is like PVCs->pods->statefulset. So all the 3 pods should be created explicitly by Velero.
The one pod cassandra-0 was actually created by the statefulset controller, instead of by Velero, because we can see below event:
This event is printed by the statefulset controller, see the kubernetes code here |
@sseago Please help to check why the pods were skipped by openshift-velero-plugin, this issue should be an OADP specific issue. |
Yes. We'll have to update our plugin |
@Lyndon-Li @kaovilai The intent is to discard pods we don't need restored because they have owners who manage them. Prior to native datamover, unless an owned pod either had hooks to run or had volumes to restore via fs-backup, the pod was skipped. It looks like we need an additional check for snapshot-move-data here. |
Alternatively, we could also consider removing the "don't restore pods by default" logic entirely in the plugin, but I know that at one point in the past, if the pod belonged to a DaemonSet, if we didn't discard the pod then after restore we ended up with duplicated pods -- a new one launched post-restore for the DaemonSet, and the restored one -- but that bug may not be relevant for current k8s versions -- last time I observed this happening was over 3 years ago. |
OK. Then let me close this issue from upstream. |
What steps did you take and what happened:
DataDownload was completed for the one of the PVC but for others two it just gets stuck in prepared and accepted phase.
It seems that the Restorrer pod is up and running.
I checked node agent pod logs but didn't find any errors. After some amount of time the datadownload CR marked as failed with error
"timeout on preparing data download"
What did you expect to happen:
Restore should be successful.
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
bundle-2023-09-13-15-02-53.tar.gz
Anything else you would like to add:
Environment:
velero version
):velero client config get features
):kubectl version
):/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: