Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mon failover can cause mons to fall out of quorum if the operator is disrupted in the middle of the failover #7797

Closed
travisn opened this issue Apr 30, 2021 · 1 comment · Fixed by #7884
Assignees
Labels
bug ceph main ceph tag
Projects

Comments

@travisn
Copy link
Member

travisn commented Apr 30, 2021

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
There is a timing issue during mon failover such that if the operator dies at the wrong time and the original mon does not come back up, mon quorum can be lost. In a test cluster, the following sequence appears to have occurred.

  1. A cluster was installed
  2. Node maintenance was started and nodes were drained one at a time and rebooted.
  3. For each node:
    • Drain the node
    • Reboot the node, which would typically take 10-15 minutes

At some point in this cycle, the following occurred:

  1. Because a drained mon was down over 10 minutes, mon failover was triggered by the operator because mon-a was down too long waiting for the drained node to come back online.
  2. The operator started mon-j and was waiting for it to join quorum
  3. The operator was restarted before the mon failover was completed
  4. Upon restart, the operator still believes mon-a, mon-c, and mon-i are the three expected to be in quorum. Since the failover was not completed, the operator hadn't added mon-j to the list of mons expected to be in quorum.
  5. mon-c, mon-i, and mon-j are in quorum, with mon-a being out of quorum according to Ceph
  6. The operator tries to start mon-a again since it is expected to be in quorum
  7. At this point, there are 4 mons, so 3 must be online for quorum to be achieved.
  8. The operator deletes the mon-j deployment and attempts to remove mon-j from quorum. This operation fails silently since a mon cannot be removed while only 2 remaining mons would be forming quorum.
  9. Since mon-a isn't coming back up and mon-j is gone, the cluster remains out of quorum.

The operator log captured below is only after the operator restarted.

2021-04-26 22:07:21.417689 I | op-mon: checking for basic quorum with existing mons
2021-04-26 22:07:21.454720 I | op-mon: mon "c" endpoint are [v2:9.255.160.224:3300,v1:9.255.160.224:6789]
2021-04-26 22:07:21.476752 I | op-mon: mon "a" endpoint are [v2:9.255.50.95:3300,v1:9.255.50.95:6789]
2021-04-26 22:07:21.497920 I | op-mon: mon "i" endpoint are [v2:9.255.50.26:3300,v1:9.255.50.26:6789]
2021-04-26 22:07:21.546803 I | clusterdisruption-controller: all osds are up. last drained failure domain: "rack0".  pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:1096} {StateName:active+undersized Count:40}]". waiting for PGs to be active+clean or PG health check to timeout
2021-04-26 22:07:21.623840 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"openshift-storage","monitors":["9.255.160.224:6789","9.255.50.95:6789","9.255.50.26:6789"]}] data:c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789 mapping:{"node":{"a":{"Name":"e5n1.fbond","Hostname":"e5n1.fbond","Address":"9.0.32.20"},"c":{"Name":"e4n1.fbond","Hostname":"e4n1.fbond","Address":"9.0.32.19"},"i":{"Name":"e6n1.fbond","Hostname":"e6n1.fbond","Address":"9.0.32.21"}}} maxMonId:8]
2021-04-26 22:07:22.021492 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2021-04-26 22:07:22.021568 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2021-04-26 22:07:22.248309 I | clusterdisruption-controller: all osds are up. last drained failure domain: "rack0".  pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:1096} {StateName:active+undersized Count:40}]". waiting for PGs to be active+clean or PG health check to timeout
2021-04-26 22:07:22.422498 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2021-04-26 22:07:22.422594 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2021-04-26 22:07:22.442995 I | op-mon: deployment for mon rook-ceph-mon-c already exists. updating if needed
2021-04-26 22:07:22.639596 I | op-k8sutil: deployment "rook-ceph-mon-c" did not change, nothing to update
2021-04-26 22:07:22.639622 I | op-mon: waiting for mon quorum with [c a i]
2021-04-26 22:07:23.048100 I | op-mon: mon a is not yet running
2021-04-26 22:07:23.230214 I | op-mon: mons running: [c i]
2021-04-26 22:07:23.538462 I | op-mon: Monitors in quorum: [c i j]
2021-04-26 22:07:23.543861 I | op-mon: deployment for mon rook-ceph-mon-a already exists. updating if needed
2021-04-26 22:07:23.561624 I | op-k8sutil: deployment "rook-ceph-mon-a" did not change, nothing to update
2021-04-26 22:07:23.561642 I | op-mon: waiting for mon quorum with [c a i]
2021-04-26 22:07:23.830488 I | op-mon: mon a is not yet running
2021-04-26 22:07:23.883649 I | ceph-spec: ceph-object-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2021-04-26T22:06:33Z" "2021-04-26T22:06:33Z" "HEALTH_WARN"}
2021-04-26 22:07:24.030592 I | op-mon: mons running: [c i]
2021-04-26 22:07:24.383064 I | op-mon: Monitors in quorum: [c i j]
2021-04-26 22:07:24.387353 I | op-mon: deployment for mon rook-ceph-mon-i already exists. updating if needed
2021-04-26 22:07:24.405460 I | op-k8sutil: deployment "rook-ceph-mon-i" did not change, nothing to update
2021-04-26 22:07:24.405481 I | op-mon: waiting for mon quorum with [c a i]
2021-04-26 22:07:24.628489 I | op-mon: mon a is not yet running
2021-04-26 22:07:24.828923 I | op-mon: mons running: [c i]
2021-04-26 22:07:24.881526 I | ceph-spec: ceph-block-pool-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2021-04-26T22:06:33Z" "2021-04-26T22:06:33Z" "HEALTH_WARN"}
2021-04-26 22:07:25.004260 I | ceph-spec: ceph-object-store-user-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2021-04-26T22:06:33Z" "2021-04-26T22:06:33Z" "HEALTH_WARN"}
2021-04-26 22:07:25.004334 I | ceph-spec: ceph-object-store-user-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2021-04-26T22:06:33Z" "2021-04-26T22:06:33Z" "HEALTH_WARN"}
2021-04-26 22:07:25.161335 I | op-mon: Monitors in quorum: [c i j]
2021-04-26 22:07:25.161353 I | op-mon: mons created: 3
2021-04-26 22:07:25.501826 I | op-mon: waiting for mon quorum with [c a i]
2021-04-26 22:07:25.524146 I | op-mon: mon a is not yet running
2021-04-26 22:07:25.535405 I | op-mon: mons running: [c i]
2021-04-26 22:07:25.547875 I | ceph-spec: ceph-file-controller: CephCluster "ocs-storagecluster-cephcluster" found but skipping reconcile since ceph health is &{"HEALTH_ERR" map["error":{"Urgent" "failed to get status. . Error initializing cluster client: ObjectNotFound('error calling conf_read_file',): exit status 1"}] "2021-04-26T22:06:33Z" "2021-04-26T22:06:33Z" "HEALTH_WARN"}
2021-04-26 22:07:25.848394 I | op-mon: Monitors in quorum: [c i j]
2021-04-26 22:07:26.983357 I | cephclient: getting or creating ceph auth key "client.csi-rbd-provisioner"
2021-04-26 22:07:27.307671 W | op-mon: mon "a" not found in quorum, waiting for timeout (599 seconds left) before failover
2021-04-26 22:07:27.317992 I | cephclient: getting or creating ceph auth key "client.csi-rbd-node"
2021-04-26 22:07:27.319894 W | op-mon: mon "j" not in source of truth but in quorum, removing
2021-04-26 22:07:27.319914 I | op-mon: ensuring removal of unhealthy monitor j
2021-04-26 22:07:27.654429 I | op-mon: removed monitor j
2021-04-26 22:07:27.660581 I | cephclient: getting or creating ceph auth key "client.csi-cephfs-provisioner"
2021-04-26 22:07:27.679240 I | op-mon: mon pvc did not exist "rook-ceph-mon-j"
2021-04-26 22:07:27.689525 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"openshift-storage","monitors":["9.255.160.224:6789","9.255.50.95:6789","9.255.50.26:6789"]}] data:c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789 mapping:{"node":{"a":{"Name":"e5n1.fbond","Hostname":"e5n1.fbond","Address":"9.0.32.20"},"c":{"Name":"e4n1.fbond","Hostname":"e4n1.fbond","Address":"9.0.32.19"},"i":{"Name":"e6n1.fbond","Hostname":"e6n1.fbond","Address":"9.0.32.21"}}} maxMonId:8]
2021-04-26 22:07:27.694485 I | ceph-spec: object "rook-ceph-mon-j" matched on delete, reconciling
2021-04-26 22:07:27.695740 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2021-04-26 22:07:27.695815 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2021-04-26 22:07:33.888776 I | op-mon: parsing mon endpoints: c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789
2021-04-26 22:07:33.938553 I | clusterdisruption-controller: all osds are up. last drained failure domain: "rack0".  pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:1096} {StateName:active+undersized Count:40}]". waiting for PGs to be active+clean or PG health check to timeout
2021-04-26 22:07:34.887770 I | op-mon: parsing mon endpoints: c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789
2021-04-26 22:07:35.011011 I | op-mon: parsing mon endpoints: c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789
2021-04-26 22:07:35.011100 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2021-04-26 22:07:35.011274 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2021-04-26 22:07:35.011283 I | ceph-object-store-user-controller: creating ceph object user "ocs-storagecluster-cephobjectstoreuser" in namespace "openshift-storage"
2021-04-26 22:07:35.198252 I | ceph-object-store-user-controller: created ceph object user "ocs-storagecluster-cephobjectstoreuser"
2021-04-26 22:07:35.218451 I | ceph-spec: updated ceph object "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstoreuser"
2021-04-26 22:07:35.218474 I | ceph-object-store-user-controller: created ceph object user secret "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstoreuser"
2021-04-26 22:07:35.230708 I | op-mon: parsing mon endpoints: c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789
2021-04-26 22:07:35.230786 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2021-04-26 22:07:35.230930 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found
2021-04-26 22:07:35.230939 I | ceph-object-store-user-controller: creating ceph object user "noobaa-ceph-objectstore-user" in namespace "openshift-storage"
2021-04-26 22:07:35.394367 I | ceph-object-store-user-controller: created ceph object user "noobaa-ceph-objectstore-user"
2021-04-26 22:07:35.404502 I | ceph-spec: updated ceph object "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user"
2021-04-26 22:07:35.404519 I | ceph-object-store-user-controller: created ceph object user secret "rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user"
2021-04-26 22:07:35.488464 I | ceph-spec: object "rook-ceph-mon-j" matched on delete, reconciling
2021-04-26 22:07:35.554704 I | op-mon: parsing mon endpoints: c=9.255.160.224:6789,a=9.255.50.95:6789,i=9.255.50.26:6789
2021-04-26 22:08:27.802640 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2021-04-26 22:08:48.760624 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2021-04-26 22:09:27.899707 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2021-04-26 22:10:03.898807 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1

This happened because there is a time window where the operator needs to retain state that it expects mon failover to complete. mon-j should have been allowed to remain in quorum, and remove mon-a since mon-a was still unhealthy.

In a simple minikube cluster where failover is tested, the time window is about 15 seconds where this issue could hit if the operator is restarted in that time.

2021-04-30 00:23:51.902693 W | op-mon: mon "a" NOT found in quorum and timeout exceeded, mon will be failed over
2021-04-30 00:23:51.902706 I | op-mon: prevent voluntary mon drain while failing over
2021-04-30 00:23:51.913886 I | op-mon: Failing over monitor "a"
2021-04-30 00:23:51.921559 I | op-mon: scaling the mon "a" deployment to replica 0
2021-04-30 00:23:51.935005 I | op-mon: starting new mon: &{ResourceName:rook-ceph-mon-d DaemonName:d PublicIP: Port:6789 Zone: DataPathMap:0xc001263d70}
2021-04-30 00:23:51.948075 I | op-mon: sched-mon: created canary deployment rook-ceph-mon-d-canary
2021-04-30 00:23:51.959865 I | op-mon: sched-mon: waiting for canary pod creation rook-ceph-mon-d-canary
2021-04-30 00:23:56.968780 I | op-mon: sched-mon: canary monitor deployment rook-ceph-mon-d-canary scheduled to minikube
2021-04-30 00:23:56.968877 I | op-mon: assignmon: mon d assigned to node minikube
2021-04-30 00:23:56.974286 I | op-mon: cleaning up canary monitor deployment "rook-ceph-mon-d-canary"
2021-04-30 00:23:57.037246 I | op-mon: mon "d" endpoint is [v2:10.111.243.151:3300,v1:10.111.243.151:6789]
2021-04-30 00:23:57.074172 I | op-mon: updating maxMonID from 2 to 3 after committing mon "d"
** START OF CRITICAL TIME WINDOW **
2021-04-30 00:23:57.090744 I | op-mon: waiting for mon quorum with [d]
2021-04-30 00:23:57.201921 I | op-mon: mons running: [d]
2021-04-30 00:23:57.707217 I | op-mon: Monitors in quorum: [b c]
2021-04-30 00:23:57.707233 I | op-mon: mons created: 1
2021-04-30 00:23:58.316255 I | op-mon: waiting for mon quorum with [d]
2021-04-30 00:23:58.326840 I | op-mon: mons running: [d]
2021-04-30 00:23:58.818884 W | op-mon: failed to find initial monitor d in mon map
2021-04-30 00:23:58.818965 W | op-mon: monitor d is not in quorum list
2021-04-30 00:24:03.826651 I | op-mon: mons running: [d]
2021-04-30 00:24:11.654366 I | op-mon: Monitors in quorum: [b c d]
2021-04-30 00:24:11.654390 I | op-mon: ensuring removal of unhealthy monitor a
2021-04-30 00:24:12.067308 I | op-mon: removed monitor a
2021-04-30 00:24:12.094870 I | op-mon: mon pvc did not exist "rook-ceph-mon-a"
2021-04-30 00:24:12.106864 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.111.243.151:6789","10.103.153.26:6789","10.100.13.195:6789"]}] data:b=10.103.153.26:6789,c=10.100.13.195:6789,d=10.111.243.151:6789 mapping:{"node":{"b":{"Name":"minikube","Hostname":"minikube","Address":"192.168.99.117"},"c":{"Name":"minikube","Hostname":"minikube","Address":"192.168.99.117"},"d":{"Name":"minikube","Hostname":"minikube","Address":"192.168.99.117"}}} maxMonId:3]
** END OF CRITICAL TIME WINDOW **

Expected behavior:
Mon quorum and automated mon failover must ensure integrity of mon quorum.

How to reproduce it (minimal and precise):

Restart the operator in the middle of a mon failover and prevent the down mon from starting back up (the operator will try to start it again).

@travisn travisn added bug ceph main ceph tag labels Apr 30, 2021
@travisn travisn added this to To do in v1.6 via automation Apr 30, 2021
@travisn travisn self-assigned this Apr 30, 2021
@eljabsheh
Copy link

@travisn travisn moved this from To do to In progress in v1.6 May 10, 2021
v1.6 automation moved this from In progress to Done May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ceph main ceph tag
Projects
No open projects
v1.6
Done
Development

Successfully merging a pull request may close this issue.

2 participants