Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from Rook v1.6.0-v1.6.4 locks up all RBD and CephFS mounted volumes permanently #8085

Closed
sfynx opened this issue Jun 8, 2021 · 31 comments

Comments

@sfynx
Copy link
Contributor

sfynx commented Jun 8, 2021

Deviation from expected behavior:
I noticed that when you mount a CephFS volume in a Pod using Rook 1.6.4 / cephcsi 3.3.1 and then restart or terminate the CSI CephFS plugin (e.g. by restarting or deleting its DaemonSet), all operations on the volume become blocked, even after restarting the CSI pods.
Using Rook 1.6.0 / cephcsi 3.3.0 I cannot cause it to happen initially, however I can when forcing cephcsi 3.3.0 with Rook 1.6.4 later. Rook v1.5 / cephcsi 3.2.x doest not show this behavior, here the volumes remain accessible even when the CSI plugin is not available.

Expected behavior:
Volumes at least becoming responsive again after CSI plugin pods are back up.

How to reproduce it (minimal and precise):

  1. Install Rook v1.6.4 with its default cephcsi 3.3.1
  • I used CephFS 15.2.13 so I could keep it the same between Rook 1.5 and 1.6
  1. Create a CephCluster and CephFilesystem
  2. Create a CephFS PVC and mount it in a Pod
  3. Kill the csi-cephfsplugin DaemonSet: volume becomes unresponsive
  4. Restart the csi-cephfsplugin DaemonSet: volume stays unresponsive

Environment:

  • Cloud provider or hardware configuration: Google
  • Rook version (use rook version inside of a Rook Pod): 1.6.4
  • Storage backend version (e.g. for ceph do ceph -v): 15.2.13, kernel driver
  • Kubernetes version (use kubectl version): v1.19.10-gke.1700
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): GKE

Simply restarting the CSI plugin should not cause things to become unavailable permanently this way, I find that with other CSI plugins service is always restored after they are back up. Is anyone else able to replicate this issue?

@sfynx sfynx added the bug label Jun 8, 2021
@sfynx sfynx changed the title Rook 1.6.4 / cephcsi 3.3.1: shutting down csi-cepfhsplugin locks up all mounted volumes permanently Rook 1.6.4 / cephcsi 3.3.1: shutting down csi-cepfsplugin locks up all mounted volumes permanently Jun 8, 2021
@sfynx sfynx changed the title Rook 1.6.4 / cephcsi 3.3.1: shutting down csi-cepfsplugin locks up all mounted volumes permanently Rook 1.6.4 / cephcsi 3.3.1: shutting down csi-cephfsplugin locks up all mounted volumes permanently Jun 8, 2021
@travisn
Copy link
Member

travisn commented Jun 9, 2021

@Rakshith-R Can you repro this issue?

@Madhu-1 Madhu-1 added the csi label Jun 10, 2021
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Jun 10, 2021
This commit changes CSI_ENABLE_HOST_NETWORK default value to true
default since it was observed that with ceph v15.2.12/13 cephfs
volume gets blocked when csi cephfs nodeplugin is restarted and
nodeunplish also hangs when cephfs nodeplugin is using pod networking.
This issue does not occur with host networking.

Updates: rook#8085

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Jun 10, 2021
This commit changes CSI_ENABLE_HOST_NETWORK default value to true
since it was observed that with ceph v15.2.12/13 cephfsvolume gets
blocked when csi cephfs nodeplugin is restarted and nodeunpublish
call also hangs when cephfs nodeplugin is using pod networking.
This issue does not occur with host networking or ceph v16.

Updates: rook#8085

Signed-off-by: Rakshith R <rar@redhat.com>
@Rakshith-R
Copy link
Member

Rakshith-R commented Jun 10, 2021

@sfynx , can you please set the following to true and try (worked in my local testing)

# CSI_ENABLE_HOST_NETWORK: "true"

To summarize:

  • cephfs volumes mounted with cephfsplugin using podnetwork
  • when cephfsplugin pod is restarted, cephfs volume hangs and blocks any activity when new cephfsplugin pod comes online
  • (I guess, this is probably caused due to IP change when pod is restarted)
  • Nodeunpublish call also hangs which blocks pod deletion

This issue is not observed with

  • host networking enabled

Workaround: Enable hostnetworking

cc: @Madhu-1 @travisn

Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Jun 10, 2021
This commit changes CSI_ENABLE_HOST_NETWORK default value to true
since it was observed that cephfsvolume gets blocked when csi cephfs
nodeplugin is restarted and nodeunpublish call also hangs when
cephfs nodeplugin is using pod networking.

Updates: rook#8085

Signed-off-by: Rakshith R <rar@redhat.com>
@Madhu-1
Copy link
Member

Madhu-1 commented Jun 10, 2021

@Rakshith-R Thanks for testing it out. @travisn @leseb for multus we use pod networking for the plugin pods. the same issue exists for multus also?

@Madhu-1
Copy link
Member

Madhu-1 commented Jun 10, 2021

similar issue exists for RBD. For RBD write hangs

@Madhu-1
Copy link
Member

Madhu-1 commented Jun 10, 2021

root@dhcp53-170 rbd]# kubectl exec -it csirbd-demo-pod -- sh
# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          17G  6.5G  9.5G  41% /
tmpfs            64M     0   64M   0% /dev
tmpfs           2.9G     0  2.9G   0% /sys/fs/cgroup
/dev/vda1        17G  6.5G  9.5G  41% /etc/hosts
shm              64M     0   64M   0% /dev/shm
/dev/rbd0       976M  2.6M  958M   1% /var/lib/www/html
tmpfs           2.9G   12K  2.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           2.9G     0  2.9G   0% /proc/acpi
tmpfs           2.9G     0  2.9G   0% /proc/scsi
tmpfs           2.9G     0  2.9G   0% /sys/firmware
# cd /var/lib/www/html
# echo asddsf >a
# sync
# exit
[root@dhcp53-170 rbd]# kubectl get po -nrook-ceph
NAME                                            READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-5dqqh                          3/3     Running     0          45m
csi-cephfsplugin-provisioner-59499cbcdd-cdc2s   6/6     Running     0          76m
csi-rbdplugin-provisioner-857d65496c-tglvt      6/6     Running     0          76m
csi-rbdplugin-wjk5q                             3/3     Running     0          76m
rook-ceph-mds-myfs-a-685dc486c4-nkvtl           1/1     Running     0          75m
rook-ceph-mds-myfs-b-56985c4474-wk277           1/1     Running     0          75m
rook-ceph-mgr-a-5c65754f6f-546rb                1/1     Running     1          129m
rook-ceph-mon-a-688c7b6d9c-pm6rh                1/1     Running     0          129m
rook-ceph-operator-65b84f48f8-rql5w             1/1     Running     0          78m
rook-ceph-osd-0-76d86c868c-2dbw4                1/1     Running     0          77m
rook-ceph-osd-prepare-minicluster1-957ds        0/1     Completed   0          77m
rook-ceph-tools-5c69594764-fzqdm                1/1     Running     0          131m

[root@dhcp53-170 rbd]# kubectl delete po/csi-rbdplugin-wjk5q -nrook-ceph
pod "csi-rbdplugin-wjk5q" deleted
[root@dhcp53-170 rbd]# kubectl exec -it csirbd-demo-pod -- sh
# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          17G  6.5G  9.5G  41% /
tmpfs            64M     0   64M   0% /dev
tmpfs           2.9G     0  2.9G   0% /sys/fs/cgroup
/dev/vda1        17G  6.5G  9.5G  41% /etc/hosts
shm              64M     0   64M   0% /dev/shm
/dev/rbd0       976M  2.6M  958M   1% /var/lib/www/html
tmpfs           2.9G   12K  2.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           2.9G     0  2.9G   0% /proc/acpi
tmpfs           2.9G     0  2.9G   0% /proc/scsi
tmpfs           2.9G     0  2.9G   0% /sys/firmware
# cd /var/lib/www/html
# ls
a  lost+found
# cat a
asddsf
# echo fdsfsdfd >b   /// Hangs here

@Madhu-1
Copy link
Member

Madhu-1 commented Jun 10, 2021

@idryomov @kotreshhr any idea what could be wrong with pod networking for plugin pods?

mergify bot pushed a commit that referenced this issue Jun 10, 2021
This commit changes CSI_ENABLE_HOST_NETWORK default value to true
since it was observed that cephfsvolume gets blocked when csi cephfs
nodeplugin is restarted and nodeunpublish call also hangs when
cephfs nodeplugin is using pod networking.

Updates: #8085

Signed-off-by: Rakshith R <rar@redhat.com>
(cherry picked from commit f489aba)
@leseb leseb pinned this issue Jun 10, 2021
@travisn travisn changed the title Rook 1.6.4 / cephcsi 3.3.1: shutting down csi-cephfsplugin locks up all mounted volumes permanently Rook 1.6.4 / cephcsi 3.3.1: shutting down csi-cephfsplugin locks up all mounted volumes permanently when CSI driver does not have host networking enabled Jun 10, 2021
@leseb
Copy link
Member

leseb commented Jun 10, 2021

@Rakshith-R Thanks for testing it out. @travisn @leseb for multus we use pod networking for the plugin pods. the same issue exists for multus also?

I'd say yes but now that we have host networking enabled I wonder if Multus is broken, @rohan47 please confirm.

@Madhu-1
Copy link
Member

Madhu-1 commented Jun 10, 2021

Looks like if multus is enabled we set host networking to false but we need to test this scenario and see we are able read and write after plugin restart

travisn pushed a commit to travisn/rook that referenced this issue Jun 10, 2021
This commit changes CSI_ENABLE_HOST_NETWORK default value to true
since it was observed that cephfsvolume gets blocked when csi cephfs
nodeplugin is restarted and nodeunpublish call also hangs when
cephfs nodeplugin is using pod networking.

Updates: rook#8085

Signed-off-by: Rakshith R <rar@redhat.com>
(cherry picked from commit f489aba)
(cherry picked from commit 2ffaebb)
@sfynx
Copy link
Contributor Author

sfynx commented Jun 11, 2021

Thanks for all your effort so far. I can indeed confirm I can work around this issue with the latest Rook/CephCSI version when setting the CSI_ENABLE_HOST_NETWORK option (.Values.csi.enableCSIHostNetwork through the Helm chart) to true. Even when completely shutting down the CSI plugin all volumes now stay functional.

@Rakshith-R
Copy link
Member

Rakshith-R commented Jun 11, 2021

Rook v1.6.1 to 1.6.4 are affected by this issue.
Users using these versions should follow the following steps:

  1. Unmount all volumes provisioned by cephcsi driver working on pod network.
    (Delete all pods using pvc provisioned using cephcsi driver)

  2. Only then set rook-ceph-operator-config configmap ROOK_CSI_ENABLE_HOST_NETWORK to true, because this change will still lead to a restart and ip change.

Otherwise, driver nodeplugin restart will block mounted volumes and pod deletion will also hang at nodeunpublish call.

Step 1 should be also followed when upgrading from Rook v1.6.4 to v1.6.5 as host network is true by default in this version.

cc @Madhu-1 @travisn @leseb

@Antiarchitect
Copy link

Antiarchitect commented Jun 11, 2021

@Rakshith-R What do you mean by step 1? I've just upgraded to 1.6.5 before reading this, everything stops seeing disks, I rolled back and somehow some part of my cluster restored. Please describe thoroughly what should be done before 1.6.5 upgrade. Should I delete all pods with persistent volumes mounted?

P.S. I noticed problems with RBD volumes, not CephFS.

@Antiarchitect
Copy link

Okay my todays story:
I just upgraded to 1.6.5 from 1.6.4 and every pod had PV attached stopped working as if physical disk became unavailable. I rolled back to 1.6.4 ASAP and noticed that multiple pods cannot mount their PVs.

@Rakshith-R
Copy link
Member

Rakshith-R commented Jun 11, 2021

@Rakshith-R What do you mean by step 1? I've just upgraded to 1.6.5 before reading this, everything stops seeing disks, I rolled back and somehow some part of my cluster restored. Please describe thoroughly what should be done before 1.6.5 upgrade. Should I delete all pods with persistent volumes mounted?

P.S. I noticed problems with RBD volumes, not CephFS.

@Antiarchitect , all the volumes mounted at the time of update will be blocked. Please delete pods using these volumes before update.

Everything should work normally after the update and new cephcsi drivers are deployed.

Edited the description, hope it's clear now.

@Antiarchitect
Copy link

Antiarchitect commented Jun 11, 2021

Pods will be recreated immediately if I just delete them. Is there any way to do it after the upgrade?

@Rakshith-R
Copy link
Member

Rakshith-R commented Jun 11, 2021

Pods will be recreated immediately if I just delete them. Is there any way to do it after the upgrade?

@Antiarchitect
Mounts created by csi drivers running on pod networking(rook 1.6.1-4) will hang if they are restarted.
Try scaling down deployment replicas to 0 before updating?

Pod deletion after update will also hang. Only way to retrieve the volume and get it working is to force delete the pod( according to my experiments,which maybe risky).

@billimek
Copy link
Contributor

billimek commented Jun 11, 2021

Confirming similar 'hang' issues after upgrading from rook 1.6.4 to 1.6.5 with all rook-ceph backed workloads (only running cephrbd here).

Checking dmesg output from one of the hosts reveals a lot of,

[Fri Jun 11 10:41:37 2021] libceph: connect (1)10.43.21.150:6789 error -101
[Fri Jun 11 10:41:37 2021] libceph: mon0 (1)10.43.21.150:6789 connect error
[Fri Jun 11 10:41:38 2021] libceph: connect (1)10.43.21.150:6789 error -101
[Fri Jun 11 10:41:38 2021] libceph: mon0 (1)10.43.21.150:6789 connect error                                                                                                         [Fri Jun 11 10:41:39 2021] libceph: connect (1)10.43.21.150:6789 error -101

For reference, ceph chart config is here and ceph cluster config is here.

Rolling-back to ceph 1.6.4 did not seem to immediately resolve the issues.

A forceful reboot of all nodes with ceph workloads was required in my case in order to restore storage capabilities.

billimek added a commit to billimek/k8s-gitops that referenced this issue Jun 11, 2021
... to maybe fix issues related to rook/rook#8085

Signed-off-by: Jeff Billimek <jeff@billimek.com>
@Antiarchitect
Copy link

@billimek Confirm that reboot of nodes did the trick and that's horrible.

@Raboo
Copy link

Raboo commented Oct 4, 2021

Hi,

I hit this bug as I updated from v1.6.4 to v1.7.3. It was truly an horrific experience.
Really this is something that should be mentioned about in the upgrade documentation to prepare users.

So after 1.6.5, the CSI_ENABLE_HOST_NETWORK is defaulted to true or it is fixed in some other way?

@travisn travisn changed the title Rook 1.6.4 / cephcsi 3.3.1: shutting down csi-cephfsplugin locks up all mounted volumes permanently when CSI driver does not have host networking enabled Upgrading from Rook v1.6.0-v1.6.4 to newer releases locks up all mounted volumes permanently Oct 4, 2021
@travisn
Copy link
Member

travisn commented Oct 4, 2021

Hi,

I hit this bug as I updated from v1.6.4 to v1.7.3. It was truly an horrific experience. Really this is something that should be mentioned about in the upgrade documentation to prepare users.

So after 1.6.5, the CSI_ENABLE_HOST_NETWORK is defaulted to true or it is fixed in some other way?

This is painful for sure. This issue was pinned in an attempt to make it more visible, but it is certainly still difficult to notice. I've updated the issue title to make it more obvious what the side effect is and which versions affected when upgrading. We'll add a note to the upgrade guide as well.

@travisn travisn changed the title Upgrading from Rook v1.6.0-v1.6.4 to newer releases locks up all mounted volumes permanently Upgrading from Rook v1.6.0-v1.6.4 locks up all CephFS mounted volumes permanently Oct 4, 2021
@travisn travisn changed the title Upgrading from Rook v1.6.0-v1.6.4 locks up all CephFS mounted volumes permanently Upgrading from Rook v1.6.0-v1.6.4 locks up all RBD and CephFS mounted volumes permanently Oct 6, 2021
@github-actions github-actions bot removed the wontfix label Oct 6, 2021
@github-actions
Copy link

github-actions bot commented Dec 6, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@laptimus
Copy link

laptimus commented Dec 14, 2021

@Madhu-1

Looks like if multus is enabled we set host networking to false but we need to test this scenario and see we are able read and write after plugin restart

@Madhu-1 if one sets up a multus network without host networking (for security reasons) and then one makes a FS and cephcsi pods restarts, all cephcsi pods loose the mounts and everything is jammed till nodes are reboot. Sounds like a blocker.

Please re-open this issue as its blocking our production environment

@Rakshith-R
Copy link
Member

@laptimus , #8686 This is being proposed as a fix and is in progress.

This issue was caused when pod network was chosen as default for csi daemon pods and thus causing the trouble.

It would make more sense to open a fresh issue regarding multus setup.

BlaineEXE added a commit to BlaineEXE/rook that referenced this issue Jan 24, 2022
Add a section detailing a design to handle the volume-operations-blocked
on CSI restart issue when running with Multus.

See: rook#8085

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
BlaineEXE added a commit to BlaineEXE/rook that referenced this issue Jan 25, 2022
Add a section detailing a design to handle the volume-operations-blocked
on CSI restart issue when running with Multus.

See: rook#8085

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
BlaineEXE added a commit to BlaineEXE/rook that referenced this issue Jan 25, 2022
Add a section detailing a design to handle the volume-operations-blocked
on CSI restart issue when running with Multus.

See: rook#8085

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
BlaineEXE added a commit to BlaineEXE/rook that referenced this issue Jan 25, 2022
Add a section detailing a design to handle the volume-operations-blocked
on CSI restart issue when running with Multus.

See: rook#8085

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
BlaineEXE added a commit to BlaineEXE/rook that referenced this issue Feb 3, 2022
Add a section detailing a design to handle the volume-operations-blocked
on CSI restart issue when running with Multus.

See: rook#8085

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
BlaineEXE added a commit to BlaineEXE/rook that referenced this issue Feb 3, 2022
Add a section detailing a design to handle the volume-operations-blocked
on CSI restart issue when running with Multus.

See: rook#8085

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
mergify bot pushed a commit that referenced this issue Feb 4, 2022
Add a section detailing a design to handle the volume-operations-blocked
on CSI restart issue when running with Multus.

See: #8085

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
(cherry picked from commit 2bf1e1f)
parth-gr pushed a commit to parth-gr/rook that referenced this issue Feb 22, 2022
Add a section detailing a design to handle the volume-operations-blocked
on CSI restart issue when running with Multus.

See: rook#8085

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
parth-gr pushed a commit to parth-gr/rook that referenced this issue Feb 22, 2022
Add a section detailing a design to handle the volume-operations-blocked
on CSI restart issue when running with Multus.

See: rook#8085

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
@travisn travisn unpinned this issue Apr 26, 2022
@aneagoe
Copy link

aneagoe commented May 4, 2022

I've stumbled on this issue while going through the upgrade process. In order to make things easier and allow a certain degree of control, I've decided to take a different approach.
I've created a MutatingWebhookConfiguration and controller that would modify existing CSI PODs such that they use hostNetwork: true. This allowed me to cordon/drain nodes in a rolling fashion (thus gracefully moving workloads to other nodes) and ensure that newly created CSI pods will not have this issue. After all nodes are processed, it's just a matter of following the normal upgrade procedure.
All the code can be found at https://github.com/aneagoe/rook-upgrade-webhook, hope this can help people who haven't upgraded yet.

PeeK1e pushed a commit to vanillastack/vanillastack that referenced this issue Aug 8, 2022
See rook/rook#8085 (comment)

Signed-off-by: Alexander Trost <galexrt@googlemail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests