Mon failover can cause mons to fall out of quorum if the operator is disrupted in the middle of the failover #7797

travisn · 2021-04-30T03:19:22Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:
There is a timing issue during mon failover such that if the operator dies at the wrong time and the original mon does not come back up, mon quorum can be lost. In a test cluster, the following sequence appears to have occurred.

A cluster was installed
Node maintenance was started and nodes were drained one at a time and rebooted.
For each node:
- Drain the node
- Reboot the node, which would typically take 10-15 minutes

At some point in this cycle, the following occurred:

Because a drained mon was down over 10 minutes, mon failover was triggered by the operator because mon-a was down too long waiting for the drained node to come back online.
The operator started mon-j and was waiting for it to join quorum
The operator was restarted before the mon failover was completed
Upon restart, the operator still believes mon-a, mon-c, and mon-i are the three expected to be in quorum. Since the failover was not completed, the operator hadn't added mon-j to the list of mons expected to be in quorum.
mon-c, mon-i, and mon-j are in quorum, with mon-a being out of quorum according to Ceph
The operator tries to start mon-a again since it is expected to be in quorum
At this point, there are 4 mons, so 3 must be online for quorum to be achieved.
The operator deletes the mon-j deployment and attempts to remove mon-j from quorum. This operation fails silently since a mon cannot be removed while only 2 remaining mons would be forming quorum.
Since mon-a isn't coming back up and mon-j is gone, the cluster remains out of quorum.

The operator log captured below is only after the operator restarted.

2021-04-26 22:07:21.417689 I | op-mon: checking for basic quorum with existing mons
2021-04-26 22:07:21.454720 I | op-mon: mon "c" endpoint are [v2:9.255.160.224:3300,v1:9.255.160.224:6789]
2021-04-26 22:07:21.476752 I | op-mon: mon "a" endpoint are [v2:9.255.50.95:3300,v1:9.255.50.95:6789]
2021-04-26 22:07:21.497920 I | op-mon: mon "i" endpoint are [v2:9.255.50.26:3300,v1:9.255.50.26:6789]
2021-04-26 22:07:21.546803 I | clusterdisruption-controller: all osds are up. last drained failure domain: "rack0".  pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:1096} {StateName:active+undersized Count:40}]". waiting for PGs to be active+clean or PG health check to timeout
2021-04-26 22:07:21.623840 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"openshift-storage","monitors":["9.255.160.224:6789","9.255.50.95:6789","9.255.50.26:6789"]}] data:c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789 mapping:{"node":{"a":{"Name":"e5n1.fbond","Hostname":"e5n1.fbond","Address":"9.0.32.20"},"c":{"Name":"e4n1.fbond","Hostname":"e4n1.fbond","Address":"9.0.32.19"},"i":{"Name":"e6n1.fbond","Hostname":"e6n1.fbond","Address":"9.0.32.21"}}} maxMonId:8]
2021-04-26 22:07:22.021492 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2021-04-26 22:07:22.021568 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2021-04-26 22:07:22.248309 I | clusterdisruption-controller: all osds are up. last drained failure domain: "rack0".  pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:1096} {StateName:active+undersized Count:40}]". waiting for PGs to be active+clean or PG health check to timeout
2021-04-26 22:07:22.422498 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2021-04-26 22:07:22.422594 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2021-04-26 22:07:22.442995 I | op-mon: deployment for mon rook-ceph-mon-c already exists. updating if needed
2021-04-26 22:07:22.639596 I | op-k8sutil: deployment "rook-ceph-mon-c" did not change, nothing to update
2021-04-26 22:07:22.639622 I | op-mon: waiting for mon quorum with [c a i]
2021-04-26 22:07:23.048100 I | op-mon: mon a is not yet running
2021-04-26 22:07:23.230214 I | op-mon: mons running: [c i]
2021-04-26 22:07:23.538462 I | op-mon: Monitors in quorum: [c i j]
2021-04-26 22:07:23.543861 I | op-mon: deployment for mon rook-ceph-mon-a already exists. updating if needed
2021-04-26 22:07:23.561624 I | op-k8sutil: deployment "rook-ceph-mon-a" did not change, nothing to update
2021-04-26 22:07:23.561642 I | op-mon: waiting for mon quorum with [c a i]
2021-04-26 22:07:23.830488 I | op-mon: mon a is not yet running
2021-04-26 22:07:23.883649 I | ceph-spec: ceph-object-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2021-04-26T22:06:33Z" "2021-04-26T22:06:33Z" "HEALTH_WARN"}
2021-04-26 22:07:24.030592 I | op-mon: mons running: [c i]
2021-04-26 22:07:24.383064 I | op-mon: Monitors in quorum: [c i j]
2021-04-26 22:07:24.387353 I | op-mon: deployment for mon rook-ceph-mon-i already exists. updating if needed
2021-04-26 22:07:24.405460 I | op-k8sutil: deployment "rook-ceph-mon-i" did not change, nothing to update
2021-04-26 22:07:24.405481 I | op-mon: waiting for mon quorum with [c a i]
2021-04-26 22:07:24.628489 I | op-mon: mon a is not yet running
2021-04-26 22:07:24.828923 I | op-mon: mons running: [c i]
2021-04-26 22:07:24.881526 I | ceph-spec: ceph-block-pool-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2021-04-26T22:06:33Z" "2021-04-26T22:06:33Z" "HEALTH_WARN"}
2021-04-26 22:07:25.004260 I | ceph-spec: ceph-object-store-user-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2021-04-26T22:06:33Z" "2021-04-26T22:06:33Z" "HEALTH_WARN"}
2021-04-26 22:07:25.004334 I | ceph-spec: ceph-object-store-user-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2021-04-26T22:06:33Z" "2021-04-26T22:06:33Z" "HEALTH_WARN"}
2021-04-26 22:07:25.161335 I | op-mon: Monitors in quorum: [c i j]
2021-04-26 22:07:25.161353 I | op-mon: mons created: 3
2021-04-26 22:07:25.501826 I | op-mon: waiting for mon quorum with [c a i]
2021-04-26 22:07:25.524146 I | op-mon: mon a is not yet running
2021-04-26 22:07:25.535405 I | op-mon: mons running: [c i]
2021-04-26 22:07:25.547875 I | ceph-spec: ceph-file-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2021-04-26T22:06:33Z" "2021-04-26T22:06:33Z" "HEALTH_WARN"}
2021-04-26 22:07:25.848394 I | op-mon: Monitors in quorum: [c i j]
2021-04-26 22:07:26.983357 I | cephclient: getting or creating ceph auth key "client.csi-rbd-provisioner"
2021-04-26 22:07:27.307671 W | op-mon: mon "a" not found in quorum, waiting for timeout (599 seconds left) before failover
2021-04-26 22:07:27.317992 I | cephclient: getting or creating ceph auth key "client.csi-rbd-node"
2021-04-26 22:07:27.319894 W | op-mon: mon "j" not in source of truth but in quorum, removing
2021-04-26 22:07:27.319914 I | op-mon: ensuring removal of unhealthy monitor j
2021-04-26 22:07:27.654429 I | op-mon: removed monitor j
2021-04-26 22:07:27.660581 I | cephclient: getting or creating ceph auth key "client.csi-cephfs-provisioner"
2021-04-26 22:07:27.679240 I | op-mon: mon pvc did not exist "rook-ceph-mon-j"
2021-04-26 22:07:27.689525 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"openshift-storage","monitors":["9.255.160.224:6789","9.255.50.95:6789","9.255.50.26:6789"]}] data:c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789 mapping:{"node":{"a":{"Name":"e5n1.fbond","Hostname":"e5n1.fbond","Address":"9.0.32.20"},"c":{"Name":"e4n1.fbond","Hostname":"e4n1.fbond","Address":"9.0.32.19"},"i":{"Name":"e6n1.fbond","Hostname":"e6n1.fbond","Address":"9.0.32.21"}}} maxMonId:8]
2021-04-26 22:07:27.694485 I | ceph-spec: object "rook-ceph-mon-j" matched on delete, reconciling
2021-04-26 22:07:27.695740 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2021-04-26 22:07:27.695815 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2021-04-26 22:07:33.888776 I | op-mon: parsing mon endpoints: c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789
2021-04-26 22:07:33.938553 I | clusterdisruption-controller: all osds are up. last drained failure domain: "rack0".  pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:1096} {StateName:active+undersized Count:40}]". waiting for PGs to be active+clean or PG health check to timeout
2021-04-26 22:07:34.887770 I | op-mon: parsing mon endpoints: c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789
2021-04-26 22:07:35.011011 I | op-mon: parsing mon endpoints: c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789
2021-04-26 22:07:35.011100 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2021-04-26 22:07:35.011274 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2021-04-26 22:07:35.011283 I | ceph-object-store-user-controller: creating ceph object user "ocs-storagecluster-cephobjectstoreuser" in namespace "openshift-storage"
2021-04-26 22:07:35.198252 I | ceph-object-store-user-controller: created ceph object user "ocs-storagecluster-cephobjectstoreuser"
2021-04-26 22:07:35.218451 I | ceph-spec: updated ceph object "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstoreuser"
2021-04-26 22:07:35.218474 I | ceph-object-store-user-controller: created ceph object user secret "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstoreuser"
2021-04-26 22:07:35.230708 I | op-mon: parsing mon endpoints: c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789
2021-04-26 22:07:35.230786 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2021-04-26 22:07:35.230930 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2021-04-26 22:07:35.230939 I | ceph-object-store-user-controller: creating ceph object user "noobaa-ceph-objectstore-user" in namespace "openshift-storage"
2021-04-26 22:07:35.394367 I | ceph-object-store-user-controller: created ceph object user "noobaa-ceph-objectstore-user"
2021-04-26 22:07:35.404502 I | ceph-spec: updated ceph object "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user"
2021-04-26 22:07:35.404519 I | ceph-object-store-user-controller: created ceph object user secret "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user"
2021-04-26 22:07:35.488464 I | ceph-spec: object "rook-ceph-mon-j" matched on delete, reconciling
2021-04-26 22:07:35.554704 I | op-mon: parsing mon endpoints: c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789
2021-04-26 22:08:27.802640 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2021-04-26 22:08:48.760624 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2021-04-26 22:09:27.899707 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2021-04-26 22:10:03.898807 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1

This happened because there is a time window where the operator needs to retain state that it expects mon failover to complete. mon-j should have been allowed to remain in quorum, and remove mon-a since mon-a was still unhealthy.

In a simple minikube cluster where failover is tested, the time window is about 15 seconds where this issue could hit if the operator is restarted in that time.

2021-04-30 00:23:51.902693 W | op-mon: mon "a" NOT found in quorum and timeout exceeded, mon will be failed over
2021-04-30 00:23:51.902706 I | op-mon: prevent voluntary mon drain while failing over
2021-04-30 00:23:51.913886 I | op-mon: Failing over monitor "a"
2021-04-30 00:23:51.921559 I | op-mon: scaling the mon "a" deployment to replica 0
2021-04-30 00:23:51.935005 I | op-mon: starting new mon: &{ResourceName:rook-ceph-mon-d DaemonName:d PublicIP: Port:6789 Zone: DataPathMap:0xc001263d70}
2021-04-30 00:23:51.948075 I | op-mon: sched-mon: created canary deployment rook-ceph-mon-d-canary
2021-04-30 00:23:51.959865 I | op-mon: sched-mon: waiting for canary pod creation rook-ceph-mon-d-canary
2021-04-30 00:23:56.968780 I | op-mon: sched-mon: canary monitor deployment rook-ceph-mon-d-canary scheduled to minikube
2021-04-30 00:23:56.968877 I | op-mon: assignmon: mon d assigned to node minikube
2021-04-30 00:23:56.974286 I | op-mon: cleaning up canary monitor deployment "rook-ceph-mon-d-canary"
2021-04-30 00:23:57.037246 I | op-mon: mon "d" endpoint is [v2:10.111.243.151:3300,v1:10.111.243.151:6789]
2021-04-30 00:23:57.074172 I | op-mon: updating maxMonID from 2 to 3 after committing mon "d"
** START OF CRITICAL TIME WINDOW **
2021-04-30 00:23:57.090744 I | op-mon: waiting for mon quorum with [d]
2021-04-30 00:23:57.201921 I | op-mon: mons running: [d]
2021-04-30 00:23:57.707217 I | op-mon: Monitors in quorum: [b c]
2021-04-30 00:23:57.707233 I | op-mon: mons created: 1
2021-04-30 00:23:58.316255 I | op-mon: waiting for mon quorum with [d]
2021-04-30 00:23:58.326840 I | op-mon: mons running: [d]
2021-04-30 00:23:58.818884 W | op-mon: failed to find initial monitor d in mon map
2021-04-30 00:23:58.818965 W | op-mon: monitor d is not in quorum list
2021-04-30 00:24:03.826651 I | op-mon: mons running: [d]
2021-04-30 00:24:11.654366 I | op-mon: Monitors in quorum: [b c d]
2021-04-30 00:24:11.654390 I | op-mon: ensuring removal of unhealthy monitor a
2021-04-30 00:24:12.067308 I | op-mon: removed monitor a
2021-04-30 00:24:12.094870 I | op-mon: mon pvc did not exist "rook-ceph-mon-a"
2021-04-30 00:24:12.106864 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.111.243.151:6789","10.103.153.26:6789","10.100.13.195:6789"]}] data:b=10.103.153.26:6789,c=10.100.13.195:6789,d=10.111.243.151:6789 mapping:{"node":{"b":{"Name":"minikube","Hostname":"minikube","Address":"192.168.99.117"},"c":{"Name":"minikube","Hostname":"minikube","Address":"192.168.99.117"},"d":{"Name":"minikube","Hostname":"minikube","Address":"192.168.99.117"}}} maxMonId:3]
** END OF CRITICAL TIME WINDOW **

Expected behavior:
Mon quorum and automated mon failover must ensure integrity of mon quorum.

Mon failover should be allowed to continue even after the operator restart, instead of cancelling the failover and erroneously removing a mon
The mon failover timeout should be longer during node drain, as tracked with Mon failover should not be triggered during a node drain until a longer timeout #7776
Node drains are recently improved with ceph: prevent voluntary mon drain while another mon is failing over #7442 by preventing further drains while mon failover is happening, which may cause the operator to restart during that critical time window.

How to reproduce it (minimal and precise):

Restart the operator in the middle of a mon failover and prevent the down mon from starting back up (the operator will try to start it again).

The text was updated successfully, but these errors were encountered:

eljabsheh · 2021-04-30T23:14:52Z

Downstream BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1955831

travisn added bug ceph main ceph tag labels Apr 30, 2021

travisn added this to To do in v1.6 via automation Apr 30, 2021

travisn self-assigned this Apr 30, 2021

travisn moved this from To do to In progress in v1.6 May 10, 2021

travisn mentioned this issue May 11, 2021

ceph: Persist expected mon endpoints immediately during mon failover #7884

Merged

10 tasks

travisn closed this as completed in #7884 May 12, 2021

v1.6 automation moved this from In progress to Done May 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mon failover can cause mons to fall out of quorum if the operator is disrupted in the middle of the failover #7797

Mon failover can cause mons to fall out of quorum if the operator is disrupted in the middle of the failover #7797

travisn commented Apr 30, 2021

eljabsheh commented Apr 30, 2021

Mon failover can cause mons to fall out of quorum if the operator is disrupted in the middle of the failover #7797

Mon failover can cause mons to fall out of quorum if the operator is disrupted in the middle of the failover #7797

Comments

travisn commented Apr 30, 2021

eljabsheh commented Apr 30, 2021