New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: Continue processing PVs for network fencing when no node IPs found #13768
Conversation
The cephfs PVC might exist on the kubernetes node object but due to some timing issues the ip might not be visible on the ceph cluster or the client might already got evicted or disconnected from ceph cluster. In this case we will not be able to get IP details for the subvolume and we dont have any check for empty ip's in the code and rook tries to create NetworkFence CR with empty Ip's and the NetworkFence will get moved to the Failed state. This PR adds the necessary check and logging to prevent this one. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
to keep the code consistent and to propogate more details about error rook uses error wraping. updating current code whereever its required to wrap the details. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
66499e8
to
eb97390
Compare
pkg/operator/ceph/cluster/watcher.go
Outdated
|
||
if len(ips) == 0 { | ||
logger.Infof("no active mds clients found for cephfs volume %q", cephFSPV.Name) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By returning nil
, shouldn't this mean the fencing was successful? But the fencing is being skipped. How will the fencing be retried in the future if we are returning nil? Don't we need to return an error so the fencing will try again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes true, we are doing the same for RBD
rook/pkg/operator/ceph/cluster/watcher.go
Lines 404 to 409 in 64cf8f2
if len(ips) != 0 { | |
err = c.createNetworkFence(ctx, rbdPV, node, cluster, ips, rbdDriver) | |
if err != nil { | |
return pkgerror.Wrapf(err, "failed to create network fence for node %q", node.Name) | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both rbd and cephfs, the calling method is in a for
loop. If nil
is returned, the for
loop breaks out. In the case of no IPs found, should the for
loop keep iterating through other PVs instead of aborting? Or all PVs will have the same result of no IPs found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you are correct we should try for other PVC, added new comment for introducing new check. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a whitespace suggestion
pkg/operator/ceph/cluster/watcher.go
Outdated
if len(ips) == 0 { | ||
logger.Infof("no active rbd clients found for rbd volume %q", rbdPV.Name) | ||
return errActiveClientNotFound | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove blank line
retry other cephfs/rbd pvc if there are no active clients found on ceph cluster. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
4f23c17
to
ca21bd9
Compare
// continue to fence next rbd volume if active client not found | ||
if stderrors.Is(err, errActiveClientNotFound) { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR title says cephfs. This seems to be operating on RBD PVs. Is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR was originally just for cephfs, but also expanded for rbd. I've updated the title.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BlaineEXE sorry i missed the ping, Thanks @travisn for updating it
core: Continue processing PVs for network fencing when no node IPs found (backport #13768)
The cephfs PVC might exist on the kubernetes node object but due to some timing issues the ip might not be visible on the ceph cluster or the client might have already been evicted or disconnected from the ceph cluster. In this case, we will not be able to get IP details for the subvolume and we don't have any check for empty ip's in the code rook tries to create NetworkFence CR with an empty Ip's, and the NetworkFence will get moved to the Failed state. This PR adds the necessary checks and logging to prevent this one.
Checklist: