New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading from Rook v1.6.0-v1.6.4 locks up all RBD and CephFS mounted volumes permanently #8085
Comments
|
@Rakshith-R Can you repro this issue? |
This commit changes CSI_ENABLE_HOST_NETWORK default value to true default since it was observed that with ceph v15.2.12/13 cephfs volume gets blocked when csi cephfs nodeplugin is restarted and nodeunplish also hangs when cephfs nodeplugin is using pod networking. This issue does not occur with host networking. Updates: rook#8085 Signed-off-by: Rakshith R <rar@redhat.com>
This commit changes CSI_ENABLE_HOST_NETWORK default value to true since it was observed that with ceph v15.2.12/13 cephfsvolume gets blocked when csi cephfs nodeplugin is restarted and nodeunpublish call also hangs when cephfs nodeplugin is using pod networking. This issue does not occur with host networking or ceph v16. Updates: rook#8085 Signed-off-by: Rakshith R <rar@redhat.com>
|
@sfynx , can you please set the following to true and try (worked in my local testing)
To summarize:
This issue is not observed with
Workaround: Enable hostnetworking |
This commit changes CSI_ENABLE_HOST_NETWORK default value to true since it was observed that cephfsvolume gets blocked when csi cephfs nodeplugin is restarted and nodeunpublish call also hangs when cephfs nodeplugin is using pod networking. Updates: rook#8085 Signed-off-by: Rakshith R <rar@redhat.com>
|
@Rakshith-R Thanks for testing it out. @travisn @leseb for multus we use pod networking for the plugin pods. the same issue exists for multus also? |
|
similar issue exists for RBD. For RBD write hangs |
|
|
@idryomov @kotreshhr any idea what could be wrong with pod networking for plugin pods? |
This commit changes CSI_ENABLE_HOST_NETWORK default value to true since it was observed that cephfsvolume gets blocked when csi cephfs nodeplugin is restarted and nodeunpublish call also hangs when cephfs nodeplugin is using pod networking. Updates: #8085 Signed-off-by: Rakshith R <rar@redhat.com> (cherry picked from commit f489aba)
I'd say yes but now that we have host networking enabled I wonder if Multus is broken, @rohan47 please confirm. |
|
Looks like if multus is enabled we set host networking to false but we need to test this scenario and see we are able read and write after plugin restart |
This commit changes CSI_ENABLE_HOST_NETWORK default value to true since it was observed that cephfsvolume gets blocked when csi cephfs nodeplugin is restarted and nodeunpublish call also hangs when cephfs nodeplugin is using pod networking. Updates: rook#8085 Signed-off-by: Rakshith R <rar@redhat.com> (cherry picked from commit f489aba) (cherry picked from commit 2ffaebb)
|
Thanks for all your effort so far. I can indeed confirm I can work around this issue with the latest Rook/CephCSI version when setting the |
|
Rook v1.6.1 to 1.6.4 are affected by this issue.
Otherwise, driver nodeplugin restart will block mounted volumes and pod deletion will also hang at nodeunpublish call. Step 1 should be also followed when upgrading from Rook v1.6.4 to v1.6.5 as host network is true by default in this version. |
|
@Rakshith-R What do you mean by step 1? I've just upgraded to 1.6.5 before reading this, everything stops seeing disks, I rolled back and somehow some part of my cluster restored. Please describe thoroughly what should be done before 1.6.5 upgrade. Should I delete all pods with persistent volumes mounted? P.S. I noticed problems with RBD volumes, not CephFS. |
|
Okay my todays story: |
@Antiarchitect , all the volumes mounted at the time of update will be blocked. Please delete pods using these volumes before update. Everything should work normally after the update and new cephcsi drivers are deployed. Edited the description, hope it's clear now. |
|
Pods will be recreated immediately if I just delete them. Is there any way to do it after the upgrade? |
@Antiarchitect Pod deletion after update will also hang. Only way to retrieve the volume and get it working is to force delete the pod( according to my experiments,which maybe risky). |
|
Confirming similar 'hang' issues after upgrading from rook 1.6.4 to 1.6.5 with all rook-ceph backed workloads (only running cephrbd here). Checking [Fri Jun 11 10:41:37 2021] libceph: connect (1)10.43.21.150:6789 error -101
[Fri Jun 11 10:41:37 2021] libceph: mon0 (1)10.43.21.150:6789 connect error
[Fri Jun 11 10:41:38 2021] libceph: connect (1)10.43.21.150:6789 error -101
[Fri Jun 11 10:41:38 2021] libceph: mon0 (1)10.43.21.150:6789 connect error [Fri Jun 11 10:41:39 2021] libceph: connect (1)10.43.21.150:6789 error -101For reference, ceph chart config is here and ceph cluster config is here. Rolling-back to ceph 1.6.4 did not seem to immediately resolve the issues. A forceful reboot of all nodes with ceph workloads was required in my case in order to restore storage capabilities. |
... to maybe fix issues related to rook/rook#8085 Signed-off-by: Jeff Billimek <jeff@billimek.com>
|
@billimek Confirm that reboot of nodes did the trick and that's horrible. |
|
Hi, I hit this bug as I updated from v1.6.4 to v1.7.3. It was truly an horrific experience. So after 1.6.5, the |
This is painful for sure. This issue was pinned in an attempt to make it more visible, but it is certainly still difficult to notice. I've updated the issue title to make it more obvious what the side effect is and which versions affected when upgrading. We'll add a note to the upgrade guide as well. |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions. |
|
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation. |
@Madhu-1 if one sets up a multus network without host networking (for security reasons) and then one makes a FS and cephcsi pods restarts, all cephcsi pods loose the mounts and everything is jammed till nodes are reboot. Sounds like a blocker. Please re-open this issue as its blocking our production environment |
Add a section detailing a design to handle the volume-operations-blocked on CSI restart issue when running with Multus. See: rook#8085 Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
Add a section detailing a design to handle the volume-operations-blocked on CSI restart issue when running with Multus. See: rook#8085 Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
Add a section detailing a design to handle the volume-operations-blocked on CSI restart issue when running with Multus. See: rook#8085 Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
Add a section detailing a design to handle the volume-operations-blocked on CSI restart issue when running with Multus. See: rook#8085 Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
Add a section detailing a design to handle the volume-operations-blocked on CSI restart issue when running with Multus. See: rook#8085 Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
Add a section detailing a design to handle the volume-operations-blocked on CSI restart issue when running with Multus. See: rook#8085 Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
Add a section detailing a design to handle the volume-operations-blocked on CSI restart issue when running with Multus. See: rook#8085 Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
Add a section detailing a design to handle the volume-operations-blocked on CSI restart issue when running with Multus. See: rook#8085 Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
|
I've stumbled on this issue while going through the upgrade process. In order to make things easier and allow a certain degree of control, I've decided to take a different approach. |
See rook/rook#8085 (comment) Signed-off-by: Alexander Trost <galexrt@googlemail.com>
Deviation from expected behavior:
I noticed that when you mount a CephFS volume in a Pod using Rook 1.6.4 / cephcsi 3.3.1 and then restart or terminate the CSI CephFS plugin (e.g. by restarting or deleting its DaemonSet), all operations on the volume become blocked, even after restarting the CSI pods.
Using Rook 1.6.0 / cephcsi 3.3.0 I cannot cause it to happen initially, however I can when forcing cephcsi 3.3.0 with Rook 1.6.4 later. Rook v1.5 / cephcsi 3.2.x doest not show this behavior, here the volumes remain accessible even when the CSI plugin is not available.
Expected behavior:
Volumes at least becoming responsive again after CSI plugin pods are back up.
How to reproduce it (minimal and precise):
Environment:
rook versioninside of a Rook Pod): 1.6.4ceph -v): 15.2.13, kernel driverkubectl version): v1.19.10-gke.1700Simply restarting the CSI plugin should not cause things to become unavailable permanently this way, I find that with other CSI plugins service is always restored after they are back up. Is anyone else able to replicate this issue?
The text was updated successfully, but these errors were encountered: