Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataMover: Restore is getting stuck in waitingForPluginOpertions phase for the StatefulSet application. #6813

Closed
PrasadJoshi12 opened this issue Sep 13, 2023 · 22 comments
Assignees

Comments

@PrasadJoshi12
Copy link

What steps did you take and what happened:

  1. Deployed a StatefulSet application with 3 replicas.
  2. Executed backup with Native DataMover, backup completed successfully.
  3. Deleted app namespace
  4. Executed Restore
status:
  errors: 1
  phase: WaitingForPluginOperationsPartiallyFailed
  progress:
    itemsRestored: 41
    totalItems: 41
  restoreItemOperationsAttempted: 3
  restoreItemOperationsCompleted: 1
  restoreItemOperationsFailed: 1
  startTimestamp: "2023-09-13T07:51:16Z"
  warnings: 12

DataDownload was completed for the one of the PVC but for others two it just gets stuck in prepared and accepted phase.

$ oc get datadownload cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-76rnl -o yaml
apiVersion: velero.io/v2alpha1
kind: DataDownload
metadata:
  creationTimestamp: "2023-09-13T07:51:32Z"
  finalizers:
  - velero.io/data-upload-download-finalizer
  generateName: cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-
  generation: 3
  labels:
    velero.io/accepted-by: oadp-53250-cvgz9-worker-b-wqcv9.c.openshift-qe.internal
    velero.io/async-operation-id: dd-6d486051-9bf9-4c90-9026-7e2e270d04b0.1ef26d08-710f-469c7d519
    velero.io/restore-name: cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a
    velero.io/restore-uid: 6d486051-9bf9-4c90-9026-7e2e270d04b0
  name: cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-76rnl
  namespace: openshift-adp
  ownerReferences:
  - apiVersion: velero.io/v1
    controller: true
    kind: Restore
    name: cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a
    uid: 6d486051-9bf9-4c90-9026-7e2e270d04b0
  resourceVersion: "51066"
  uid: 30b94cb4-93a6-41ba-880e-74def4dc16d6
spec:
  backupStorageLocation: ts-dpa-1
  operationTimeout: 4h0m0s
  snapshotID: 69b0085467ebbb1b2e8d49ab26093c84
  sourceNamespace: test-oadp-223
  targetVolume:
    namespace: test-oadp-223
    pv: ""
    pvc: cassandra-data-cassandra-1
status:
  node: oadp-53250-cvgz9-worker-a-7spnn.c.openshift-qe.internal
  phase: Prepared
  progress: {}
  startTimestamp: "2023-09-13T07:51:32Z"

It seems that the Restorrer pod is up and running.

$ oc get pod 
NAME                                                              READY   STATUS      RESTARTS   AGE
576f1bdd6ea22487866ac7a878e64beedc0605bfad29712a38c4f983bdf5khs   0/1     Completed   0          129m
**cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-76rnl          1/1     Running     0          95m**
node-agent-nrbpn                                                  1/1     Running     0          112m
node-agent-ql5m6                                                  1/1     Running     0          112m
node-agent-ztzzt                                                  1/1     Running     0          112m
openshift-adp-controller-manager-698b855b58-4pqpm                 1/1     Running     0          128m
ttl-sh-oadp-operator-bundle-562b573-1h                            1/1     Running     0          129m
velero-57c5cb799b-5d2sg                                           1/1     Running     0          112m

I checked node agent pod logs but didn't find any errors. After some amount of time the datadownload CR marked as failed with error
"timeout on preparing data download"

$ oc get datadownload 
NAME                                                       STATUS      STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE   NODE
cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-5sn2j   Completed   99m       67209564     67209564      ts-dpa-1           99m   oadp-53250-cvgz9-worker-c-x5g6g.c.openshift-qe.internal
cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-76rnl   Prepared    99m                                  ts-dpa-1           99m   oadp-53250-cvgz9-worker-a-7spnn.c.openshift-qe.internal
cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-n4smn   Failed      99m                                  ts-dpa-1           99m  

What did you expect to happen:
Restore should be successful.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help
bundle-2023-09-13-15-02-53.tar.gz

Anything else you would like to add:

Environment:

  • Velero version (use velero version):
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Sep 13, 2023

There are 3 DDCRs (DataDownload CRs) -- cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-5sn2j, cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-76rnl and cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-n4smn.
The restore pods for cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-5sn2j, cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-76rnl were scheduled to the same node (oadp-5325), so the data download would be handled by the same node-agent pod node-agent-qm5l6 sequentially. So there is no problem for these two DDCRs --- one DDCR had just completed, the other one had not started.

The problem was with the 3rd DDCR -- cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-n4smn, here is the log:

time="2023-09-13T07:51:32Z" level=info msg="Data download starting" controller=datadownload datadownload=openshift-adp/cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-n4smn logSource="pkg/controller/data_download_controller.go:151"
time="2023-09-13T07:51:32Z" level=info msg="Accepting data download cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-n4smn" controller=DataDownload logSource="pkg/controller/data_download_controller.go:606"
time="2023-09-13T07:51:32Z" level=info msg="This datadownload has been accepted by oadp-53250-cvgz9-worker-a-7spnn.c.openshift-qe.internal" DataDownload=cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-n4smn controller=DataDownload logSource="pkg/controller/data_download_controller.go:632"
time="2023-09-13T07:51:32Z" level=info msg="Data download is accepted" controller=datadownload datadownload=openshift-adp/cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-n4smn logSource="pkg/controller/data_downlo

time="2023-09-13T08:21:44Z" level=info msg="Timeout happened for preparing datadownload" DataDownload=cassandra-e2e-218e6d52-5208-11ee-a137-845cf3eff33a-n4smn controller=DataDownload logSource="pkg/controller/data_download_controller.go:643"

The prepare had not completed until the timeout, 30 mins.
From the code logic, the exposer is waiting for the workload PVC cassandra-data-cassandra-2 to be used by the workload pod.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Sep 13, 2023

@PrasadJoshi12
To further troubleshoot the problem, we need the details of the workload PVC cassandra-data-cassandra-2:

  • run kubectl get pvc cassandra-data-cassandra-2 -n test-oadp-223 -o yaml and kubectl describe pvc cassandra-data-cassandra-2 -n test-oadp-223 and share the outputs
  • Is there any difference of PVC cassandra-data-cassandra-2 from the other two ones? Are they using the same storageclass?
  • Find the storage class PVC cassandra-data-cassandra-2 and run kubectl get storageclass <the sc name you found> -o yaml, and share the outputs

@Lyndon-Li Lyndon-Li self-assigned this Sep 13, 2023
@PrasadJoshi12
Copy link
Author

@Lyndon-Li I tried it again and this time it worked fine. I will give it a try again tomorrow. I have seen this issue happening multiple times.

@Lyndon-Li
Copy link
Contributor

I double checked the code, it is probably that when the problem happened, the restored pod was not successfully scheduled within 30 min, so Velero had no where to run the data mover.
To confirm this, we need to describe the restored pod and the restored PVC and also get the yaml of the restored pod and restored PVC (as listed here #6813 (comment)).
For why the restored pod was not scheduled in 30 min, we need to further check after it is confirmed. This may be a bug or an env problem.

@PrasadJoshi12
Copy link
Author

PrasadJoshi12 commented Sep 14, 2023

@Lyndon-Li I tried it again on my cluster, This time dataDownload is stuck for the first PVC. It looks like an intermittent issue.

$ oc get datadownload
NAME                 STATUS     STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE   NODE
test-restore-65znj   Prepared   14m                                  ts-dpa-1           14m   oadp-5359-1-ghchk-worker-b-dxz9m.c.openshift-qe.internal
test-restore-746pc   Accepted   14m                                  ts-dpa-1           14m   
test-restore-kvjsd   Accepted   14m                                  ts-dpa-1           14m   

test-restore-65znj pod is running since last 15 mins.

$ oc get pod
NAME                                                              READY   `STATUS`      RESTARTS   AGE
fc42f9c3c026688d97230269915bde9c4dd4757b14667055dd27d9b957swg7z   0/1     Completed   0          77m
node-agent-cnmjt                                                  1/1     Running     0          25m
node-agent-csxvm                                                  1/1     Running     0          25m
node-agent-kqwj8                                                  1/1     Running     0          25m
openshift-adp-controller-manager-77d7f8689f-79dxw                 1/1     Running     0          77m
**test-restore-65znj                                                1/1     Running     0          15m**
ttl-sh-oadp-operator-bundle-165d247-1h                            1/1     Running     0          78m
velero-5f55cf6788-k22t8                                           1/1     Running     0          25m

Attached pod yaml below.
test-restore-65znj.txt

$ oc describe  pvc test-restore-65znj
Name:          test-restore-65znj
Namespace:     openshift-adp
StorageClass:  ssd-csi
Status:        Bound
Volume:        pvc-ae9be1b9-98a6-4caf-adc7-9c6a1823c465
Labels:        app=cassandra
               velero.io/backup-name=test-backup
               velero.io/restore-name=test-restore
               velero.io/volume-snapshot-name=velero-cassandra-data-cassandra-0-vwhfr
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
               volume.kubernetes.io/selected-node: oadp-5359-1-ghchk-worker-b-dxz9m.c.openshift-qe.internal
               volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       test-restore-65znj
Events:
  Type    Reason                 Age                From                                                                                                           Message
  ----    ------                 ----               ----                                                                                                           -------
  Normal  ExternalProvisioning   18m (x2 over 18m)  persistentvolume-controller                                                                                    waiting for a volume to be created, either by external provisioner "pd.csi.storage.gke.io" or manually created by system administrator
  Normal  Provisioning           18m                pd.csi.storage.gke.io_oadp-5359-1-ghchk-master-1.c.openshift-qe.internal_b0d2fc13-46af-4449-a591-824a3dc27ff8  External provisioner is provisioning volume for claim "openshift-adp/test-restore-65znj"
  Normal  ProvisioningSucceeded  18m                pd.csi.storage.gke.io_oadp-5359-1-ghchk-master-1.c.openshift-qe.internal_b0d2fc13-46af-4449-a591-824a3dc27ff8  Successfully provisioned volume pvc-ae9be1b9-98a6-4caf-adc7-9c6a1823c465

Workload pod is still in pending status.

$ oc describe pod cassandra-0 -n test-oadp-223 
Name:           cassandra-0
Namespace:      test-oadp-223
Priority:       0
Node:           <none>
Labels:         app=cassandra
                controller-revision-hash=cassandra-6cf44ff9d8
                statefulset.kubernetes.io/pod-name=cassandra-0
Annotations:    openshift.io/scc: privileged
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  StatefulSet/cassandra
Containers:
  cassandra:
    Image:       quay.io/migqe/cassandra:multiarch
    Ports:       7000/TCP, 7001/TCP, 7199/TCP, 9042/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP, 0/TCP
    Limits:
      cpu:     500m
      memory:  1Gi
    Requests:
      cpu:     500m
      memory:  1Gi
    Environment:
      MAX_HEAP_SIZE:              512M
      HEAP_NEWSIZE:               512M
      CASSANDRA_SEEDS:            cassandra-0.cassandra.test-oadp-223.svc.cluster.local
      CASSANDRA_CLUSTER_NAME:     Cassandra
      CASSANDRA_DC:               DC1
      CASSANDRA_RACK:             Rack1
      CASSANDRA_AUTO_BOOTSTRAP:   false
      CASSANDRA_ENDPOINT_SNITCH:  GossipingPropertyFileSnitch
      POD_IP:                      (v1:status.podIP)
    Mounts:
      /var/lib/cassandra from cassandra-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xj2ff (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  cassandra-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  cassandra-data-cassandra-0
    ReadOnly:   false
  kube-api-access-xj2ff:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  15m    default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
  Warning  FailedScheduling  5m11s  default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition

PVC is looking for PV with a label velero.io/dynamic-pv-restore: test-oadp-223.cassandra-data-cassandra-0.pxc7r

$ oc get pvc -n test-oadp-223  cassandra-data-cassandra-0 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    backup.velero.io/must-include-additional-items: "true"
    volume.beta.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
    volume.kubernetes.io/selected-node: oadp-5359-1-ghchk-worker-b-dxz9m.c.openshift-qe.internal
    volume.kubernetes.io/storage-provisioner: pd.csi.storage.gke.io
  creationTimestamp: "2023-09-14T12:12:01Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app: cassandra
    velero.io/backup-name: test-backup
    velero.io/restore-name: test-restore
    velero.io/volume-snapshot-name: velero-cassandra-data-cassandra-0-vwhfr
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:accessModes: {}
        f:capacity:
          .: {}
          f:storage: {}
        f:phase: {}
    manager: kube-controller-manager
    operation: Update
    subresource: status
    time: "2023-09-14T10:40:00Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:pv.kubernetes.io/bind-completed: {}
          f:pv.kubernetes.io/bound-by-controller: {}
          f:volume.beta.kubernetes.io/storage-provisioner: {}
          f:volume.kubernetes.io/storage-provisioner: {}
        f:labels:
          .: {}
          f:app: {}
      f:spec:
        f:accessModes: {}
        f:resources:
          f:requests:
            .: {}
            f:storage: {}
        f:volumeMode: {}
        f:volumeName: {}
    manager: kube-controller-manager
    operation: Update
    time: "2023-09-14T12:12:02Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:volume.kubernetes.io/selected-node: {}
    manager: kube-scheduler
    operation: Update
    time: "2023-09-14T12:12:02Z"
  name: cassandra-data-cassandra-0
  namespace: test-oadp-223
  resourceVersion: "140186"
  uid: 163c1f83-0c3f-49f9-9b3a-712f59d3c92e
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  selector:
    matchLabels:
      velero.io/dynamic-pv-restore: test-oadp-223.cassandra-data-cassandra-0.pxc7r
  storageClassName: ssd-csi
  volumeMode: Filesystem
status:
  phase: Pending

There are no PV in cluster with label
oc get pv -l velero.io/dynamic-pv-restore=test-oadp-223.cassandra-data-cassandra-0.pxc7r

$ oc get sc
NAME                PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
ssd-csi (default)   pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true                   5h30m
standard-csi        pd.csi.storage.gke.io   Delete          WaitForFirstConsumer   true                   5h30m
$ oc get sc  ssd-csi -o yaml
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: "2023-09-14T07:10:26Z"
  managedFields:
  - apiVersion: storage.k8s.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:allowVolumeExpansion: {}
      f:parameters:
        .: {}
        f:replication-type: {}
        f:type: {}
      f:provisioner: {}
      f:reclaimPolicy: {}
      f:volumeBindingMode: {}
    manager: gcp-pd-csi-driver-operator
    operation: Update
    time: "2023-09-14T07:10:26Z"
  - apiVersion: storage.k8s.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:storageclass.kubernetes.io/is-default-class: {}
    manager: e2e.test
    operation: Update
    time: "2023-09-14T08:18:43Z"
  name: ssd-csi
  resourceVersion: "49257"
  uid: 1dadc7e9-245a-494d-bd56-61cee6a287e8
parameters:
  replication-type: none
  type: pd-ssd
provisioner: pd.csi.storage.gke.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Sep 14, 2023

@PrasadJoshi12
Could you share the latest Velero log bundle that matches the captured pods, pvcs above?
I need to check the log to confirm, but let me guess:

  • If the restore is stuck, it is not due to test-restore-65znj, but due to either test-restore-746pc or test-restore-kvjsd
  • Either test-restore-746pc or test-restore-kvjsd is being prepared in the node oadp-5359-1-ghchk-worker-b-dxz9m.c.openshift-qe.internal
  • Since the prepare is a synchronised operation, so test-restore-746pc or test-restore-kvjsd is taking the controller time in node oadp-5359-1-ghchk-worker-b-dxz9m.c.openshift-qe.internal
  • Then test-restore-65znj is scheduled to the same node oadp-5359-1-ghchk-worker-b-dxz9m.c.openshift-qe.internal
  • But the controller has no time to run data movement for test-restore-65znj, so it is stuck at Prepared status
  • I guess if you wait some more time (probably another 15 min), the final result is -- either test-restore-746pc or test-restore-kvjsd will go to Failed status due to the same preparation timeout problem and test-restore-65znj resumes and goes to InProgress status

Hopefully you are keeping the environment, if so please help to collect the same pod & pvc info for the other two DDCRs as you've done for test-restore-65znj and also collect Velero log bundle. Otherwise, please collect all the pod & pvc info for all DDCRs as well as the log bundle next time when the stuck happens again.

@PrasadJoshi12
Copy link
Author

PrasadJoshi12 commented Sep 14, 2023

this logs are collected after Datadownload was running for almost 15 mins.
bundle-2023-09-14-18-06-24.tar.gz

After 30 mins both DDCR was failed.

$ oc get pod -n openshift-adp
NAME                                                              READY   STATUS      RESTARTS   AGE
fc42f9c3c026688d97230269915bde9c4dd4757b14667055dd27d9b957swg7z   0/1     Completed   0          3h53m
node-agent-cnmjt                                                  1/1     Running     0          3h1m
node-agent-csxvm                                                  1/1     Running     0          3h1m
node-agent-kqwj8                                                  1/1     Running     0          3h1m
openshift-adp-controller-manager-77d7f8689f-79dxw                 1/1     Running     0          3h53m
test-restore-65znj                                                1/1     Running     0          171m
ttl-sh-oadp-operator-bundle-165d247-1h                            1/1     Running     0          3h53m
velero-5f55cf6788-k22t8                                           1/1     Running     0          3h1m
$ oc get datadownload -n openshift-adp
NAME                 STATUS     STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE    NODE
test-restore-65znj   Prepared   172m                                 ts-dpa-1           172m   oadp-5359-1-ghchk-worker-b-dxz9m.c.openshift-qe.internal
test-restore-746pc   Failed     172m                                 ts-dpa-1           172m   
test-restore-kvjsd   Failed     172m                                 ts-dpa-1           172m   

Re-collected it again.
bundle-2023-09-14-20-36-02.tar.gz

@Lyndon-Li
Copy link
Contributor

@PrasadJoshi12 Are you in Velero slack channel, could we discuss a little bit?

@PrasadJoshi12
Copy link
Author

@Lyndon-Li I'm not in velero slack channel. can you please send link ?

@Lyndon-Li
Copy link
Contributor

@Lyndon-Li
Copy link
Contributor

@PrasadJoshi12
From the log, I can confirm that test-restore-746pc and test-restore-kvjsd failed due to the same preparation timeout problem. And test-restore-65znj is blocked by the former two ones as they have taken the only two controllers.
Next to further troubleshoot what happened with test-restore-746pc and test-restore-kvjsd, we also need the info of the pods & pvcs in the restored namespace.

@PrasadJoshi12
Copy link
Author

PrasadJoshi12 commented Sep 14, 2023

@Lyndon-Li I'm not able join the kubernetes slack.

Don’t have an @get-an-invite-at.slack.kubernetes.io email address?
Contact the workspace administrator at Kubernetes for an invitation.

@shawn-hurley
Copy link
Contributor

@PrasadJoshi12
Copy link
Author

Thank you @shawn-hurley

@PrasadJoshi12
Copy link
Author

PrasadJoshi12 commented Sep 15, 2023

@PrasadJoshi12
Copy link
Author

PrasadJoshi12 commented Sep 15, 2023

Issue was happening with SC binding mode WaitForFirstConsumer.

Tested it with SC binding mode immediate. Restore completed successfully.

Thank you so much @Lyndon-Li for helping with debugging.

$ oc get datadownload 
NAME                  STATUS      STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE     NODE
test-restore1-7md5t   Completed   2m12s     67279946     67279946      ts-dpa-1           2m12s   oadp-53730-p4lcm-worker-b-znjjm.c.openshift-qe.internal
test-restore1-bqkgh   Completed   2m12s     67209744     67209744      ts-dpa-1           2m12s   oadp-53730-p4lcm-worker-b-znjjm.c.openshift-qe.internal
test-restore1-pvtz2   Completed   2m12s     67209574     67209574      ts-dpa-1           2m12s   oadp-53730-p4lcm-worker-a-sqm68.c.openshift-qe.internal
status:
  completionTimestamp: "2023-09-15T08:40:29Z"
  phase: Completed
  progress:
    itemsRestored: 41
    totalItems: 41
  restoreItemOperationsAttempted: 3
  restoreItemOperationsCompleted: 3
  startTimestamp: "2023-09-15T08:38:20Z"
  warnings: 12

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Sep 15, 2023

Checked the environment again, during restore, only one pod was created:

oc get pod -n test-oadp-223
NAME          READY   STATUS    RESTARTS   AGE
cassandra-0   0/1     Pending   0          52m

However, as Velero's restore logics, the restore order in this case is like PVCs->pods->statefulset. So all the 3 pods should be created explicitly by Velero.
Check from the restore log, the 3 pods were indeed skipped by openshift-velero-plugin:

time="2023-09-14T12:12:01Z" level=info msg="[pod-restore] skipping restore of pod cassandra-0, has owner references, no restic backup, and no restore hooks" cmd=/plugins/velero-plugins logSource="/opt/app-root/src/github.com/konveyor/openshift-velero-plugin/velero-plugins/pod/restore.go:126" pluginName=velero-plugins restore=openshift-adp/test-restore
time="2023-09-14T12:12:01Z" level=info msg="Skipping restore of Pod: cassandra-0 because a registered plugin discarded it" logSource="pkg/restore/restore.go:1263" restore=openshift-adp/test-restore

time="2023-09-14T12:12:01Z" level=info msg="[pod-restore] skipping restore of pod cassandra-1, has owner references, no restic backup, and no restore hooks" cmd=/plugins/velero-plugins logSource="/opt/app-root/src/github.com/konveyor/openshift-velero-plugin/velero-plugins/pod/restore.go:126" pluginName=velero-plugins restore=openshift-adp/test-restore
time="2023-09-14T12:12:01Z" level=info msg="Skipping restore of Pod: cassandra-1 because a registered plugin discarded it" logSource="pkg/restore/restore.go:1263" restore=openshift-adp/test-restore

time="2023-09-14T12:12:01Z" level=info msg="[pod-restore] skipping restore of pod cassandra-2, has owner references, no restic backup, and no restore hooks" cmd=/plugins/velero-plugins logSource="/opt/app-root/src/github.com/konveyor/openshift-velero-plugin/velero-plugins/pod/restore.go:126" pluginName=velero-plugins restore=openshift-adp/test-restore
time="2023-09-14T12:12:01Z" level=info msg="Skipping restore of Pod: cassandra-2 because a registered plugin discarded it" logSource="pkg/restore/restore.go:1263" restore=openshift-adp/test-restore

The one pod cassandra-0 was actually created by the statefulset controller, instead of by Velero, because we can see below event:

25m         Normal    SuccessfulCreate       statefulset/cassandra                              create Pod cassandra-0 in StatefulSet cassandra successful

This event is printed by the statefulset controller, see the kubernetes code here

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Sep 15, 2023

@sseago Please help to check why the pods were skipped by openshift-velero-plugin, this issue should be an OADP specific issue.

@kaovilai
Copy link
Contributor

Yes. We'll have to update our plugin

@sseago
Copy link
Collaborator

sseago commented Sep 15, 2023

@Lyndon-Li @kaovilai The intent is to discard pods we don't need restored because they have owners who manage them. Prior to native datamover, unless an owned pod either had hooks to run or had volumes to restore via fs-backup, the pod was skipped. It looks like we need an additional check for snapshot-move-data here.

@sseago
Copy link
Collaborator

sseago commented Sep 15, 2023

Alternatively, we could also consider removing the "don't restore pods by default" logic entirely in the plugin, but I know that at one point in the past, if the pod belonged to a DaemonSet, if we didn't discard the pod then after restore we ended up with duplicated pods -- a new one launched post-restore for the DaemonSet, and the restored one -- but that bug may not be relevant for current k8s versions -- last time I observed this happening was over 3 years ago.

@Lyndon-Li
Copy link
Contributor

OK. Then let me close this issue from upstream.
From data mover's logic, it is safe to restore pods under sts even though ownerReference is there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants