Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd: node fencing, skip pv when pv is not backed by csi #12563

Merged
merged 1 commit into from
Jul 26, 2023

Conversation

subhamkrai
Copy link
Contributor

Description of your changes:
when node fencing is triggered and a negative case where one deployment pod consuming pv not provisioned by csi and another deployment pod with rbd rwo are on the same node and node fencing is triggered.

In this case, rook operator panics since it is triggered to get field pv.Spec.CSI.Driver which will be nil

Which issue is resolved by this Pull Request:
Resolves #12558

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: If this is only a documentation change, add the label skip-ci on the PR.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Copy link
Member

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. we might need to handle static PVC also.

Copy link
Member

@Rakshith-R Rakshith-R left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks !

@subhamkrai
Copy link
Contributor Author

LGTM. we might need to handle static PVC also.

can you explain more? @Madhu-1

@Madhu-1
Copy link
Member

Madhu-1 commented Jul 21, 2023

LGTM. we might need to handle static PVC also.

can you explain more? @Madhu-1

Take a look at the doc https://github.com/ceph/ceph-csi/blob/devel/docs/static-pvc.md#create-rbd-static-pv. the RBD static PV doesn't contains some parameters like rbdImage in the volumeAttributes and volumeHandle looks different from regular volume. i suggest you also test node fencing with static PVC and see how its behavior

@subhamkrai
Copy link
Contributor Author

LGTM. we might need to handle static PVC also.

can you explain more? @Madhu-1

Take a look at the doc https://github.com/ceph/ceph-csi/blob/devel/docs/static-pvc.md#create-rbd-static-pv. the RBD static PV doesn't contain some parameters like rbdImage in the volumeAttributes and volumeHandle looks different from regular volume. I suggest you also test node fencing with static PVC and see how its behavior

okay, I'll test in this pr and update here, adding do-not-merge tag

@subhamkrai subhamkrai added the do-not-merge DO NOT MERGE :) label Jul 21, 2023
@subhamkrai
Copy link
Contributor Author

subhamkrai commented Jul 24, 2023

LGTM. we might need to handle static PVC also.

can you explain more? @Madhu-1

Take a look at the doc https://github.com/ceph/ceph-csi/blob/devel/docs/static-pvc.md#create-rbd-static-pv. the RBD static PV doesn't contains some parameters like rbdImage in the volumeAttributes and volumeHandle looks different from regular volume. i suggest you also test node fencing with static PVC and see how its behavior

I was testing with same example mentioned in the doc and noticed that pvc is not getting bound
with error
Warning VolumeMismatch 8s (x2 over 19s) persistentvolume-controller Cannot bind to requested volume "fs-static-pv": storageClassName does not match

Edit: nevermind, storageClass was not present

Also, in the example mentioned in the doc has driver name "driver":"rbd.csi.ceph.com" and in code we
update rbdPVlist only if

if pv.Spec.CSI.Driver == fmt.Sprintf("%s.rbd.csi.ceph.com", cluster.Namespace) {

so, I believe based on code nodeFencing will not be applied to static pv which I think make sense as well since static pv life time is same as pod and in case of nodeLoss pod is gone.

@Madhu-1
Copy link
Member

Madhu-1 commented Jul 24, 2023

LGTM. we might need to handle static PVC also.

can you explain more? @Madhu-1

Take a look at the doc https://github.com/ceph/ceph-csi/blob/devel/docs/static-pvc.md#create-rbd-static-pv. the RBD static PV doesn't contains some parameters like rbdImage in the volumeAttributes and volumeHandle looks different from regular volume. i suggest you also test node fencing with static PVC and see how its behavior

I was testing with same example mentioned in the doc and noticed that pvc is not getting bound with error Warning VolumeMismatch 8s (x2 over 19s) persistentvolume-controller Cannot bind to requested volume "fs-static-pv": storageClassName does not match

Edit: nevermind, storageClass was not present

Also, in the example mentioned in the doc has driver name "driver":"rbd.csi.ceph.com" and in code we update rbdPVlist only if

if pv.Spec.CSI.Driver == fmt.Sprintf("%s.rbd.csi.ceph.com", cluster.Namespace) {

You need to create static PVC with the same driver deployed by csi driver. you need to use rbd storageclass with what ever the csi driver is deployed by Rook.

so, I believe based on code nodeFencing will not be applied to static pv which I think make sense as well since static pv life time is same as pod and in case of nodeLoss pod is gone.

Sorry I didn't get above comment.

@subhamkrai
Copy link
Contributor Author

@Madhu-1 NodeFence cr was not created because the code failed to run command rbd status pool/imageName with static pv there is no ImageName in the spec

 volumeAttributes:
        clusterID: rook-ceph
        imageFeatures: layering
        pool: replicapool
        staticVolume: "true"
      volumeHandle: static-image

@Madhu-1
Copy link
Member

Madhu-1 commented Jul 24, 2023

You can add a check for rbdPV.Spec.CSI.VolumeAttributes["pool"] !="" || rbdPV.Spec.CSI.VolumeAttributes["imageName"] != "" provide a better warning/error message.

@subhamkrai subhamkrai force-pushed the fix-panic-nodeLoss branch 2 times, most recently from 415bf85 to 2cf9fca Compare July 24, 2023 08:33
Copy link
Member

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@subhamkrai subhamkrai added backport-release-1.12 and removed do-not-merge DO NOT MERGE :) labels Jul 25, 2023
@@ -262,11 +262,20 @@ func listRBDPV(listPVs *corev1.PersistentVolumeList, cluster *cephv1.CephCluster
var listRbdPV []corev1.PersistentVolume

for _, pv := range listPVs.Items {
// Ignore PVs that support multinode access (RWX, ROX), since they can be mounted on multiple nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was mistake, restored, thanks

// Ignore PVs that support multinode access (RWX, ROX), since they can be mounted on multiple nodes.
if pvSupportsMultiNodeAccess(pv.Spec.AccessModes) {
// Skip if pv is not provisioned by CSI
if pv.Spec.CSI == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add a function donotPrrovision and can move all this check there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this is comparably very small function, I think we can keep in a single function.

Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see these changes are functionally pretty small. Only a few fairly minor comments

pkg/operator/ceph/cluster/watcher.go Outdated Show resolved Hide resolved
pkg/operator/ceph/cluster/watcher.go Show resolved Hide resolved
pkg/operator/ceph/cluster/watcher.go Outdated Show resolved Hide resolved
@subhamkrai subhamkrai force-pushed the fix-panic-nodeLoss branch 2 times, most recently from 290171c to 64208a7 Compare July 26, 2023 06:51
when node fencing is triggered and a negative case where
one deployment pod consuming pv not provisioned by csi
and another deployment pod with rbd rwo are on the same
node and node fencing is triggered.

In this case rook operator panics since it is trigger to get
field `pv.Spec.CSI.Driver` which will be nil.

Also, skip static pv as there is no `imageName` in the spec
rbd status command will fail.

Signed-off-by: subhamkrai <srai@redhat.com>
Copy link
Member

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BlaineEXE BlaineEXE merged commit 45a1a6c into rook:master Jul 26, 2023
41 of 49 checks passed
@subhamkrai subhamkrai deleted the fix-panic-nodeLoss branch July 26, 2023 16:44
BlaineEXE added a commit that referenced this pull request Jul 26, 2023
rbd: node fencing, skip pv when pv is not backed by csi (backport #12563)
@subhamkrai subhamkrai restored the fix-panic-nodeLoss branch July 27, 2023 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic when operator is fencing a node
6 participants