Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero randomly restoring another namespaces PVCs on a backup. #2570

Closed
arianitu opened this issue May 25, 2020 · 29 comments · Fixed by #3007
Closed

Velero randomly restoring another namespaces PVCs on a backup. #2570

arianitu opened this issue May 25, 2020 · 29 comments · Fixed by #3007
Assignees
Labels
Bug Volumes Relating to volume backup and restore
Milestone

Comments

@arianitu
Copy link

arianitu commented May 25, 2020

What steps did you take and what happened:
We schedule a backup when creating our development environment. After the environment is created, velero somehow replaces the new PVC with a cloned PVC instead of keeping the new PVC.

This is a huge issue for us since we do not want PVCs randomly being attached. I am unsure why it's doing it, we are not issuing any restore commands. The only command I see that we are issuing is a backup schedule.

At first I thought maybe our ingress was just serving the wrong environment, but when I run kubetl get pvc on the new environment, I see velero-clone in my PVC names:

kubectl get pvc -n wp-bf083a8f-1ac7-416c-bac5-234f586c4440

NAME            STATUS   VOLUME                                              CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mysql-mysql-0   Bound    velero-clone-1a2a1a6c-915f-4096-a693-cff87b547190   5Gi        RWO            ssd            111m
site-code       Bound    pvc-b758f25c-9e94-11ea-9504-42010aa20135            5Gi        RWO            ssd            111m

I wanted to check the restore logs, but velero restore get is returning nothing for some reason. Is there another way to get a list of restores and their logs? Maybe I'm running into a bug?

What did you expect to happen:
I did not expect a backup schedule to trigger a PVC clone. I expected the backup schedule to just create a backup as usual.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
    https://pastebin.com/Z1MgNT20

  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml

Name:         wp-bf083a8f-1ac7-416c-bac5-234f586c4440-20200525143337
Namespace:    velero
Labels:       velero.io/schedule-name=wp-bf083a8f-1ac7-416c-bac5-234f586c4440
              velero.io/storage-location=default
Annotations:  <none>

Phase:  Completed

Namespaces:
  Included:  wp-bf083a8f-1ac7-416c-bac5-234f586c4440
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Snapshot PVs:  true

TTL:  336h0m0s

Hooks:  <none>

Backup Format Version:  1

Started:    2020-05-25 10:33:37 -0400 EDT
Completed:  2020-05-25 10:33:58 -0400 EDT

Expiration:  2020-06-08 10:33:37 -0400 EDT

Persistent Volumes:  2 of 2 snapshots completed successfully (specify --details for more information)
  • velero backup logs <backupname>

https://pastebin.com/sVuiWjGk

  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml

Cannot describe restore. Velero restore get is not showing any entries.

  • velero restore logs <restorename>

Cannot get restore logs. Velero restore get is not showing any entries.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Velero version (use velero version):
Client:
	Version: v1.3.1
	Git commit: -
Server:
	Version: v1.3.2
  • Velero features (use velero client config get features):
features: <NOT SET>
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T23:41:55Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.36", GitCommit:"34a615f32e9a0c9e97cdb9f749adb392758349a6", GitTreeState:"clean", BuildDate:"2020-04-06T16:33:17Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version:
    GKE (Google Cloud)

  • Cloud provider or hardware configuration:
    Google Cloud

  • OS (e.g. from /etc/os-release):

@arianitu
Copy link
Author

Also to assist in debugging, can I get some information on how a snapshot gets decided to be attached?

So just to describe the steps on what happened:

  1. New namespace gets created
  2. New PVC gets created
  3. New Deployment that refers to new PVC gets created.
  4. New StatefulSet (MySQL) with PVCTemplate gets created
  5. Backup Schedule gets initiated after above are created. (Also, why does Backup Schedule instantly trigger a backup instead of waiting until the time it was scheduled for?)

When this environment gets created, what is weird is instead of the Deployment being attached to the new PVC or the StatefulSet being attached the the PVC Claim, it is attached to velero-clone.

So instead of a brand new environment, we get a random PVC snapshot attached instead (pretty huge bug + security issue since the PVC is in another namespace.)

@arianitu
Copy link
Author

arianitu commented May 26, 2020

Perhaps Velero is only comparing PVC name instead of namespace + PVC name. Each environment has the same PVC name.

Or perhaps?

  1. Velero was waiting for a snapshot for a previous clone/restore that did not finish.
  2. A backup was triggered, which triggers a snapshot.
  3. The restore process attaches to this unrelated backup instead of the original backup it was waiting on?
  4. Velero finishes its restore process by replacing/renaming the freshly created PVC.

Also, I believe velero should have a security e-mail of some sort since I looked everywhere for an email and can't find a way to report a security issue to the team.

@skriss
Copy link
Member

skriss commented May 26, 2020

@arianitu my guess based on the info that you provided is that the PV named velero-clone-1a2a1a6c-915f-4096-a693-cff87b547190 already existed on your cluster and was available/not bound by any existing PVC, so when you created a new PVC, Kubernetes bound it to this existing PV since it met the requirements.

To verify this - could you provide the full YAML for:

  • the PV named velero-clone-1a2a1a6c-915f-4096-a693-cff87b547190 (particularly interested in the creationTimestamp relative to your deployment/statefulset/PVC)
  • the Deployment and/or StatefulSet you created
  • the PVC you created

@arianitu
Copy link
Author

arianitu commented May 26, 2020

@skriss

That could be possible, but how does the PV decide it met its requirements? This was a completely fresh environment, although the yaml of clone looks really weird since it's targeting this new environment.

  1. Here is the yaml of the clone:
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    kubernetes.io/createdby: gce-pd-dynamic-provisioner
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/provisioned-by: kubernetes.io/gce-pd
    velero.io/original-pv-name: pvc-579abc6a-8b5e-11ea-9cac-42010aa200b6
  creationTimestamp: "2020-05-04T20:18:18Z"
  finalizers:
  - kubernetes.io/pv-protection
  labels:
    failure-domain.beta.kubernetes.io/region: northamerica-northeast1
    failure-domain.beta.kubernetes.io/zone: northamerica-northeast1-b
    velero.io/backup-name: fd4ebbae-de28-4ea0-9964-b2eca9cb2ac4
    velero.io/restore-name: 9de3ae59-8a48-4035-bf46-19ecf699e03e
  name: velero-clone-1a2a1a6c-915f-4096-a693-cff87b547190
  resourceVersion: "99581900"
  selfLink: /api/v1/persistentvolumes/velero-clone-1a2a1a6c-915f-4096-a693-cff87b547190
  uid: 647c3031-8e44-11ea-9cac-42010aa200b6
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 5Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: mysql-mysql-0
    namespace: wp-bf083a8f-1ac7-416c-bac5-234f586c4440
    resourceVersion: "99581898"
    uid: b7249ada-9e94-11ea-9745-42010aa20074
  gcePersistentDisk:
    fsType: ext4
    pdName: restore-f72bea94-1f24-467a-a1ea-7d662b04d2aa
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/zone
          operator: In
          values:
          - northamerica-northeast1-b
        - key: failure-domain.beta.kubernetes.io/region
          operator: In
          values:
          - northamerica-northeast1
  persistentVolumeReclaimPolicy: Delete
  storageClassName: ssd
  volumeMode: Filesystem
status:
  phase: Bound

Looking at that PV, it was created on "2020-05-04T20:18:18Z", but somehow has the namespace "wp-bf083a8f-1ac7-416c-bac5-234f586c4440".

How is that possible when the namespace "wp-bf083a8f-1ac7-416c-bac5-234f586c4440" was created yesterday?

apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2020-05-25T14:33:34Z"
  name: wp-bf083a8f-1ac7-416c-bac5-234f586c4440
  resourceVersion: "99581878"
  selfLink: /api/v1/namespaces/wp-bf083a8f-1ac7-416c-bac5-234f586c4440
  uid: b6651cdb-9e94-11ea-9504-42010aa20135
spec:
  finalizers:
  - kubernetes
status:
  phase: Active
  1. I cannot give the full Deployment and StatefulSets since they are sensitive. But I can give you the volumeClaimTemplate section of the MySQL StatefulSet which was the one that was overriden in this case:
  volumeClaimTemplates:
  - metadata:
      creationTimestamp: null
      name: mysql
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5G
      storageClassName: ssd
      volumeMode: Filesystem
    status:
      phase: Pending
  1. Here are the PVCs:
apiVersion: v1
items:
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    annotations:
      pv.kubernetes.io/bind-completed: "yes"
      pv.kubernetes.io/bound-by-controller: "yes"
    creationTimestamp: "2020-05-25T14:33:35Z"
    finalizers:
    - kubernetes.io/pvc-protection
    labels:
      app: mysql
    name: mysql-mysql-0
    namespace: wp-bf083a8f-1ac7-416c-bac5-234f586c4440
    resourceVersion: "99581903"
    selfLink: /api/v1/namespaces/wp-bf083a8f-1ac7-416c-bac5-234f586c4440/persistentvolumeclaims/mysql-mysql-0
    uid: b7249ada-9e94-11ea-9745-42010aa20074
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 5G
    storageClassName: ssd
    volumeMode: Filesystem
    volumeName: velero-clone-1a2a1a6c-915f-4096-a693-cff87b547190
  status:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 5Gi
    phase: Bound
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    annotations:
      pv.kubernetes.io/bind-completed: "yes"
      pv.kubernetes.io/bound-by-controller: "yes"
      volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
    creationTimestamp: "2020-05-25T14:33:35Z"
    finalizers:
    - kubernetes.io/pvc-protection
    name: site-code
    namespace: wp-bf083a8f-1ac7-416c-bac5-234f586c4440
    resourceVersion: "99581963"
    selfLink: /api/v1/namespaces/wp-bf083a8f-1ac7-416c-bac5-234f586c4440/persistentvolumeclaims/site-code
    uid: b758f25c-9e94-11ea-9504-42010aa20135
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 5G
    storageClassName: ssd
    volumeMode: Filesystem
    volumeName: pvc-b758f25c-9e94-11ea-9504-42010aa20135
  status:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 5Gi
    phase: Bound
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

By the way, both the Deployment and StatefulSet PVCs got replaced by cloned velero instances twice and only the StatefulSet PVC got replaced on the third time. Every subsequent environment is no longer seeing this behaviour.

@skriss
Copy link
Member

skriss commented May 26, 2020

There's some info on the binding process here. The basic idea is that if there's a PV that exists/is available, that has the same storage class as the PVC, and is at least as big as the PVC's requests, then it may be used to satisfy the PVC. Only if there are no existing PVs that match the PVC's spec is a new one dynamically provisioned.

The PV itself is a cluster-scoped resource, so it doesn't have a namespace; the namespace inside the PV's YAML just represents the two-way binding with the PVC (which happened when you created the PVC).

If you always want new PVs to be dynamically provisioned, then I'd ensure that your cluster contains no available PVs before creating the new PVCs.

@arianitu
Copy link
Author

@skriss does this mean only 1 clone can happen at the same time since you're not guaranteed an order to matching a PV?

Seems error prone, especially if a restore was pending/failed before. Is there any way to make it consistent so the PV only binds to the intended namespace?

@skriss
Copy link
Member

skriss commented May 26, 2020

When Velero clones a PV as part of a restore, it'll specify that specific PV name in the PVC spec, so the new PVC is always bound to the correct PV. So the Velero behavior should be safe as far as I know.

The behavior you're seeing isn't something Velero is triggering - it just happens that this old PV that was created by Velero a few weeks ago (not sure why) happens to meet the requirements of this new PVC that you're creating.

@arianitu
Copy link
Author

arianitu commented May 26, 2020

@skriss Okay I think I am understanding this a bit more, but isn't that still a big problem? If a restore fails for any reason, you will have Deployments/StatefulSets being attached to a PV that was cloned instead of a freshly made one.

There isn't a good way for us to prevent PVs from existing before creating Deployments/StatefulSets. We create Deployments/StatefulSets all the time and we're cloning environments all the time. If a clone fails, our deployment is going to fail.

Is there any solution that can work around this problem? It doesn't seem very intuitive and people will definitely run into this and be a bit shocked to notice that their Deployment is serving another random failed restore.

@arianitu
Copy link
Author

@skriss Can we instead manually create our PVCs with a name like you said Velero does to prevent Deployments/StatefulSets from grabbing any PV?

@arianitu
Copy link
Author

Here's a race condition as well:

  1. A restore is triggered, creating a new cloned PV.
  2. Another developer applies a Deployment/StatefulSet shortly after the PV is created.
  3. The cloned PV is going to attach to the Deployment instead of the PVC being Restored, leading to both operations failing.

In a busy environment, this situation is going to happen often.

@arianitu
Copy link
Author

Closing this since this isn't really an issue with Velero and more of an issue how Kubernetes handles PV/PVCs.

@arianitu
Copy link
Author

arianitu commented Jul 27, 2020

Hello @skriss sorry to bring this up from the dead, but given that Velero is using ClaimRef, why did our PVCs match to the Velero PV?

This issue says this should not happen: kubernetes/kubernetes#23615 and it has been fixed.

Can you maybe help me understand ClaimRef and what the idea of it is? I thought ClaimRef is a deterministic way to find a PersistentVolume from a PersistentVolumeClaim. In the above case, why did the ClaimRef match, is it because ClaimRef only uses .name? Or does it use .namespace as well? Shouldn't it have to match on Name + Namespace, why did a separate Namespace match to the PV?

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/persistentvolume/util/util.go#L214
https://github.com/kubernetes/kubernetes/blob/00da04ba23d755d02a78d5021996489ace96aa4d/pkg/controller/volume/persistentvolume/util/util.go#L154

It looks like ClaimRef has to match Namespace + Name, so why did our PVC attach to the PV?

Other attempts to solve this problem:

  1. Apply a LabelSelector with a NOT EXISTS for the Velero labels attached to the PV. It failed saying: GKE: Selector not supported on PVC.

The only thing we can think of is to manually create the GKE Disk, PersistentDisk and PersistentDiskVolume using LabelSelectors. But that sucks since we lose the volume abstraction layer of Kuberenetes. This seems very silly, why can't I get a PVC to just NOT binding to a specific PV created by Velero?

I've also requested more information on the above Kubernetes issue.

@arianitu
Copy link
Author

arianitu commented Jul 27, 2020

When Velero clones a PV as part of a restore, it'll specify that specific PV name in the PVC spec, so the new PVC is always bound to the correct PV. So the Velero behavior should be safe as far as I know.

Can you point me to the logic for this? Are you using ClaimRef, or LabelSelectors or what are you using to guarantee this, maybe this is a Velero bug?

Edit:

It appears Velero uses volumeName: to restore a PVC to a PV. I'm investigating if there is maybe a better way to do it and guarantee PVCs binding to a PV and not running into the above condition.

https://github.com/vmware-tanzu/velero/blob/master/pkg/restore/restore.go#L1051

@arianitu
Copy link
Author

arianitu commented Jul 27, 2020

@skriss Can we please also add ClaimRef instead of just volumeName. If we add ClaimRef we should not run into the above race condition since the Name + Namespace would have to match for the PVC to be able to mount to the PV.

At line https://github.com/vmware-tanzu/velero/blob/master/pkg/restore/restore.go#L1051, we need to also append ClaimRef with the Name + Namespace.

@arianitu arianitu reopened this Jul 27, 2020
@skriss
Copy link
Member

skriss commented Jul 27, 2020

@arianitu I'm not working on Velero anymore, but someone from @vmware-tanzu/velero-maintainers should be able to help answer your questions.

@arianitu
Copy link
Author

arianitu commented Jul 27, 2020

@carlisia can you assist in finding the right maintainer to fix this bug? If you look at the Kubernetes issue above, apparently both volumeName and ClaimRef is required for correct binding of an existing PV to a PVC.

edit:

Thank you @skriss for all your work on the Velero project, good luck on your future projects!

@nrb nrb added the Bug label Jul 27, 2020
@nrb
Copy link
Contributor

nrb commented Jul 27, 2020

@arianitu Yeah, it looks like you're right, we need to make sure that the ClaimRef exists on restore too.

@nrb
Copy link
Contributor

nrb commented Jul 27, 2020

@arianitu Are you doing namespace remapping during your restores? That's the line you linked to, I just want to be sure. I think we need to be checking this whether there's remapping happening or not.

Searching through the restore/restore.go code, we do use the ClaimRef in spots, but don't necessarily use it in actual restoration. We also don't reference it in the RestoreItemAction for PVs. I think, but don't know at the moment, that the ClaimRef shouldn't be cleared, but if the PV's coming in and getting attached to the wrong PVCs, then that would point to either the ClaimRef being cleared or something being unset.

@carlisia carlisia added Needs info Waiting for information Needs investigation labels Jul 28, 2020
@arianitu
Copy link
Author

@nrb Yes, I am doing a namespace remapping when I hit this case. Can you show me the code where ClaimRef is used?

@arianitu
Copy link
Author

Also @nrb this is a pretty critical bug for us do you know the timeline of a fix? I think PVC restores should definitely use ClaimRef and I don't see that anywhere.

@nrb
Copy link
Contributor

nrb commented Jul 29, 2020

@arianitu Are you hitting it consistently when doing namespace remapping? I agree we should be setting the ClaimRef on remap cases, though I think it's more or less handled for us when not doing namespace remapping.

I don't have a timeline on a fix at the moment, depending on how easily it could reproduced we might be able to put out a patch release, but I can't make promises there at the moment.

@arianitu
Copy link
Author

@nrb Yes this is consistent, but it is a race condition. You need another PVC to waiting for a PV that is created by Velero to hit this case. But the bug seems major to me, what do you think? Velero will only restore things consistently as long as nothing else is happening on the cluster (which on a busy cluster is quite rare.)

Do you want a reproducible case?

@nrb
Copy link
Contributor

nrb commented Aug 6, 2020

@arianitu A reproducible case would be awesome!

@ashish-amarnath
Copy link
Contributor

@arianitu Checking in to see if you were able to get us a reproducible case for this?

@arianitu
Copy link
Author

@ashish-amarnath still working on this, we're a bit busy, but this issue is a high priority for us so we hope to get a case soon.

@ashish-amarnath
Copy link
Contributor

@arianitu I am going to close this issue. Please re-open once you are able to provide a repro case.

@arianitu
Copy link
Author

arianitu commented Oct 9, 2020

@ashish-amarnath @nrb Please re-open, reproduction case below:

1. What is the problem?

Velero does not use ClaimRef with the cloned PV created by --namespace-mappings. Failure to use ClaimRef leads to indeterministic behavior when binding a PVC to a PV.

Please see kubernetes/kubernetes#23615 (comment)

Original Namespace

  1. Create namespace: kubectl create namespace clone-test-a
  2. Create PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nginx-pvc
  namespace: clone-test-a
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: ssd
  1. Create sample deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: clone-test-a
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.19.3
        ports:
        - containerPort: 80
        volumeMounts:
        - name: nginx-data
          mountPath: "/var/www/html"
      volumes:
      - name: nginx-data
        persistentVolumeClaim:
          claimName: nginx-pvc
  1. Create backup via velero backup create velero-restore-race --include-namespaces=clone-test-a --snapshot-volumes

Do a namespace clone

  1. Create a namespace kubectl create namespace clone-test-b
  2. Go to Google Cloud Dashboard and create a new Disk from a Snapshot (make sure Type is the same as the one you made originally and make sure its in the same region + zone.) Find the snapshot that we created in the above case. For me it was gke-us-east1-c-1-1540c-pvc-0516e7df-c4ce-467f-930e-dcb932fe2c27. I named this blank disk called velero-should-not-match
  3. Create a PV of the velero-should-not-match instead of running velero restore to simulate the race condition where velero would create a PV WITHOUT a ClaimRef.
apiVersion: v1
kind: PersistentVolume
metadata:
  name: this-should-not-match
spec:
  capacity:
    storage: 1Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  storageClassName: ssd
  gcePersistentDisk:
    fsType: ext4
    pdName: velero-should-not-match
  persistentVolumeReclaimPolicy: Delete

Create a random namespace that is unrelated to the above (assume this is happening in the cluster at the same time as the restore is happening where the above PV was restored by velero, but the PVC has not yet been created in clone-test-b)

  1. kubectl create namespace unrelated-namespace
  2. Create a PVC which does not contain a volumeName:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nginx-pvc
  namespace: unrelated-namespace
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: ssd
  1. Notice that the unrelated-namespace got the PV because the velero PV does not have a ClaimRef.

kubectl get pv:

this-should-not-match                               1Gi        RWO            Delete           Bound    unrelated-namespace/nginx-pvc                           ssd                     33s

Solution and test

  1. Go create a manual Disk again in GKE to simulate a Velero Restore. You can use the same snapshot as before which is gke-us-east1-c-1-1540c-pvc-0516e7df-c4ce-467f-930e-dcb932fe2c27 and name is velero-does-not-match
  2. Create a PV with a ClaimRef setting the ClaimRef to the target namespace that velero would have intended the restore (clone-test-b)
apiVersion: v1
kind: PersistentVolume
metadata:
  name: this-does-not-match
spec:
  capacity:
    storage: 1Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  storageClassName: ssd
  gcePersistentDisk:
    fsType: ext4
    pdName: velero-does-not-match
  persistentVolumeReclaimPolicy: Delete
  claimRef:
    name: nginx-pvc
    namespace: clone-test-b
  1. Now repeat another random namespace: kubectl create namespace unrelated-namespace-2

  2. Create a PVC here

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nginx-pvc
  namespace: unrelated-namespace-2
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: ssd
  1. Notice this PVC never matches the new PV which had a ClaimRef leading to consistent behavior for velero restores
kubectl get pv
this-does-not-match                                 1Gi        RWO            Delete           Available   clone-test-b/nginx-pvc                                  ssd                     41s

This PVC got a brand new PV which is the correct behaviour:

NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nginx-pvc   Bound    pvc-e15b95f6-0a4f-11eb-9a60-42010a8e0076   1Gi        RWO            ssd            25s

@arianitu
Copy link
Author

arianitu commented Oct 9, 2020

Also, this bug is caused here: https://github.com/vmware-tanzu/velero/blob/master/pkg/restore/restore.go#L891

			if shouldRestoreSnapshot {
				// even if we're renaming the PV, obj still has the old name here, because the pvRestorer
				// uses the original name to look up metadata about the snapshot.
				ctx.log.Infof("Restoring persistent volume from snapshot.")
				updatedObj, err := ctx.pvRestorer.executePVAction(obj)
				if err != nil {
					errs.Add(namespace, fmt.Errorf("error executing PVAction for %s: %v", resourceID, err))
					return warnings, errs
				}
				obj = updatedObj
			}

In this case, this function calls executePVAction which does:

func (r *pvRestorer) executePVAction(obj *unstructured.Unstructured) (*unstructured.Unstructured, error) {
	pvName := obj.GetName()
	if pvName == "" {
		return nil, errors.New("PersistentVolume is missing its name")
	}

	// It's simpler to just access the spec through the unstructured object than to convert
	// to structured and back here, especially since the SetVolumeID(...) call below needs
	// the unstructured representation (and does a conversion internally).
	res, ok := obj.Object["spec"]
	if !ok {
		return nil, errors.New("spec not found")
	}
	spec, ok := res.(map[string]interface{})
	if !ok {
		return nil, errors.Errorf("spec was of type %T, expected map[string]interface{}", res)
	}

	delete(spec, "claimRef")

The claimRef is deleted here, but is never added back. I think maybe under https://github.com/vmware-tanzu/velero/blob/master/pkg/restore/restore.go#L864 we should add oldClaimRefName

                         oldName := obj.GetName()
                         oldClaimRefName := // get old claimRef name from claimRef

then under https://github.com/vmware-tanzu/velero/blob/master/pkg/restore/restore.go#L864 we should set claimRef.namespace and claimRef.name accordingly.

			if shouldRenamePV {
                                ....

				ctx.renamedPVs[oldName] = pvName
				obj.SetName(pvName)
                                 // set claimRef.Namesace = namespace and claimRef.name to oldClaimRefName
				....
			}

Something along these lines.

@nrb nrb reopened this Oct 12, 2020
@nrb
Copy link
Contributor

nrb commented Oct 12, 2020

@arianitu Thanks for the detailed description of the issue and the proposed fix.

The main thing that needs to happen is the UID needs to be cleared out of the ClaimRef, but the namespace and name can be kept in most instances. In the case that a namespace is being remapped, the namespaces inside a ClaimRef will need to be updated of course.

The UID is removed because this is what makes any object unique within the running cluster, even if all other information is the same. Upon restore, the UID of the target PVC to rebind with will not be the same if the PVC was remade, so it needs to be removed in order to ensure that the PVC and PV are linked. As you showed in your example, it's not necessary in the Spec, and is usually used to indicate successful binding since it was an early Kubernetes type and best practices hadn't emerged by then.

I'm working on a fix that does this now, I hope to have it available for review tomorrow. I plan to get this in for v1.5.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Volumes Relating to volume backup and restore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants