Run two MGRs to have one in standby mode #1796

galexrt · 2018-06-15T17:51:26Z

Is this a bug report or feature request?

Feature Request

Feature Request

Are there any similar features already existing:
No.

What should the feature do:
Run two MGRs (again #1334) instances.
See #1048.

What would be solved through this feature:
Running two MGRs would allow to have better HA for Ceph metrics, because if one node is NotReady for Kubernetes it takes up to 5 minutes till the Pod eviction starts causing the MGR Pod to be unavailable for this time causes metrics to be lost which may be important to see what the issue is for a failure.
(For more info on Kubernetes node down detection and pod eviction, see https://fatalfailure.wordpress.com/2016/06/10/improving-kubernetes-reliability-quicker-detection-of-a-node-down/)

Does this have an impact on existing features:
No, the code is also implemented to be able to run two MGRs again in #1779.

The text was updated successfully, but these errors were encountered:

liewegas · 2018-06-15T18:10:25Z

@jcsp Is there anything to worry about here with the way mgr failover happens? I seem to recall something about how failover across instances is slower than when the same instance name restarts.

Perhaps that doesn't actually matter here, though, since if you have mgr.x and mgr.y and x fails, it will restart just as quickly regardless of whether there is also a mgr.y, so this can only improve things (when mgr.y takes over quickly) but wouldn't make anything worse.

jcsp · 2018-06-18T07:19:42Z

Yeah, recovery with the same instance name is faster (no timeout), just like how the MDS works. However, in this case I think we're talking about using Ceph's timeout-based failover just because it's snappier than the k8s node failure handling, so either way it's a timeout.

I didn't realise the k8s timeout on a failed node was so long (5m) -- since mon_mgr_beacon_grace is only 30s, they'll clearly get a faster mgr recovery using Ceph's built in failover. However, if a node is already NotReady from k8s perspective, we shuldn't even wait for that timeout -- Rook could issue "ceph mgr fail" commands when it sees the active mgr's node go NotReady (or the active pod goes away for any other reason).

These arguments also apply to MDS daemons -- I'd like Rook to be proactively issuing "ceph mds fail" commands when a pod (or its node) appears to have gone bad, when Ceph might still be waiting on a timeout.

travisn · 2018-06-18T15:51:05Z

@jcsp Since the mgr and mds have a built-in health monitor and failover, what monitoring would be effective for rook to add in addition? I'm not sure we can do better than the K8s timeouts. When K8s determines a node is NotReady, its pods should already be evicted and started elsewhere immediately. Before that happens, the mgr and mds wouldn't perform their own failover of the active node since they have their own health checks and timeouts?

jcsp · 2018-06-18T16:28:52Z

For an unexpected node failure, in general Ceph's timeouts would have already kicked in at the point that k8s timeouts do. However, I imagine in some cases the k8s node death might be somehow intentional/orchestrated, so we would like the cue from k8s that the node is dead, rather than waiting for the Ceph timeout. Conceivably, k8s might also have better knowledge of node failures, e.g. tipped off by power management, and it could pass that immediate non-timeout-driven information to ceph with the fail commands.

I'm totally hand waving though :-)

travisn · 2018-06-18T16:58:55Z

Got it, will definitely keep that in mind to look for ways to proactively monitor the daemons. And hand waving is acceptable at this point too. :)

BlaineEXE · 2018-08-31T18:20:32Z

There has been some discussion about this here: #2048 (comment)

As I suggested in 2048, I think this is a feature we should opt to deprioritize for the 0.9 release, as I think there is already a full amount of work targeted for 0.9. We can reevaluate for 0.10/1.0.

stale · 2018-12-12T14:20:08Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2018-12-19T14:41:19Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

wangxiao86 · 2019-04-22T10:26:20Z

I think I have get the reason of the issue( one of two mgrs responds "Connection refused" due to the module failed to start. ):
"
[18/Apr/2019:12:36:34] ENGINE Error in 'start' listener <bound method Server.start of <cherrypy._cpserver.Server object at 0x7f978f94d050>>
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cherrypy/process/wspbus.py", line 197, in publish
output.append(listener(*args, **kwargs))
File "/usr/lib/python2.7/site-packages/cherrypy/_cpserver.py", line 176, in start
ServerAdapter.start(self)
File "/usr/lib/python2.7/site-packages/cherrypy/process/servers.py", line 180, in start
wait_for_free_port(*self.bind_addr)
File "/usr/lib/python2.7/site-packages/cherrypy/process/servers.py", line 423, in wait_for_free_port
raise IOError("Port %r not free on %r" % (port, host))
IOError: Port 9283 not free on '10.107.1.232'
"
Check the error message above, in mgr-a pod ( the module startup failed), i find the ip value of bind_addr is 10.107.1.232 , but it should be 10.107.1.248.

The assignment to server_addr traced to Cherrypy，As follows: mgr/prometheus/module.py
the StandbyModule:

the MgrModule:

and the server_addr is set by get_localize_config fun of ceph/mgr/mgr_module.py , As follows:

And then I found this command ("ceph config set mgr.a mgr/dashboard/server_addr $(ROOK_POD_IP)") not working （ the command is executed by this function: makeSetServerAddrInitContainer function of mgr/spec.go ）. But I did not find mgr.a in the output result of "ceph config dump". Therefore, when there are two mgr pods (mgr-a-deployment and mgr-b-depolyment), the server_addr obtained by the two mgr pods is the same, resulting in the wrong startup time of one mgr pod: "IOError: Port xxx not free on 'xxx"

I run two mgrs (mgr-a and mgr-b) by set mgr Replicas=2. I have modified the makeSetServerAddrInitContainer function of mgr/spec.go : “cfgPath := fmt.Sprintf("mgr/%s/server_addr", mgrModule)” change to “cfgPath := fmt.Sprintf("mgr/%s/%s/server_addr", mgrModule, mgrConfig.DaemonID)”. After modification, mgr-a pod is started to get the correct IP through mgr//a/server_addr and mgr-b pod is started to get the correct IP through mgr//b/server_addr, so both mgr-a pod and mgr-b pod were running normally: " one mgr pod return infos normally, another mgr pod return empty instead of "Connection refused" "

But I'm not sure if my change will have any other impact.

…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty.

…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty. Signed-off-by: wangxiao86 <1085038484@qq.com>

…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty.

…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty. Signed-off-by: wangxiao86 <wangxiao1@sensetime.com>

stale · 2019-07-22T02:54:00Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

BlaineEXE · 2019-07-25T16:35:45Z

Commenting to remove staleness. This is still something we want to do, but the priority isn't as high as other things at this time.

mmgaggle · 2019-08-21T19:34:57Z

Relatedly, we should use anti-affinity placement features by failure-domain labels. Basically the same issue as #3332 this, but for MGR vs MDS.

jmolmo · 2019-11-19T11:00:11Z

After the Rook-Community meeting held past 11/15/2019 , we decided to not continue adding two managers in the rook cluster to avoid problems between k8 load balancer, k8 resiliency policies and Ceph system.
However, we continue with the modification to improve the time needed to have a new mgr pod running when the node where this pod runs is "notReady"
This modification is in:
Improve restarting time for mgr,mon,toolbox pods running in a k8's "NotReady" node

jshen28 · 2021-01-12T00:43:48Z

Hi I understand mgr is not going to work under load balancing, but I think we still could have more than 2 mgr and let them all registered in prometheus. By doing this we could associate individual services to each deployment. and I recall prometheus could collect metrics by labels..

galexrt added the ceph main ceph tag label Jun 15, 2018

galexrt self-assigned this Jun 15, 2018

galexrt mentioned this issue Jun 15, 2018

Name mgr and mds pods more consistently #1779

Merged

3 tasks

galexrt removed their assignment Sep 13, 2018

galexrt mentioned this issue Oct 18, 2018

How to set rook-ceph-mgr HA #2227

Closed

stale bot added the wontfix label Dec 12, 2018

stale bot closed this as completed Dec 19, 2018

galexrt reopened this Apr 22, 2019

stale bot removed the wontfix label Apr 22, 2019

wangxiao86 added a commit to wangxiao86/rook that referenced this issue Apr 23, 2019

Fix this issue: rook#1796 ; set mgr Replicas=2 for HA, two mgrs run n…

212a847

…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty.

wangxiao86 mentioned this issue Apr 23, 2019

Set mgr daemon name in server_addr config setting path #3028

Merged

5 tasks

travisn mentioned this issue Apr 23, 2019

Fix mgr settings #2900

Merged

1 task

wangxiao86 added a commit to wangxiao86/rook that referenced this issue Apr 26, 2019

Fix this issue: rook#1796 ; set mgr Replicas=2 for HA, two mgrs run n…

f80b289

…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty.

This was referenced Jun 18, 2019

Run two MGRs to have one in standby mode #1048

Closed

[ceph] having ceph-mgr with a replica of 2 (with node anti-affinity) is not working #2538

Closed

stale bot added the wontfix label Jul 22, 2019

travisn removed the wontfix label Jul 22, 2019

BlaineEXE added the feature label Jul 25, 2019

sebastian-philipp mentioned this issue Sep 19, 2019

Replicate mgr pod #1581

Closed

travisn assigned jmolmo Nov 12, 2019

jmolmo mentioned this issue Nov 19, 2019

Improve restarting time for mgr,mon,toolbox pods running in a k8's "NotReady" node #4342

Closed

jmolmo closed this as completed Nov 19, 2019

jshen28 mentioned this issue Jan 13, 2021

Allow configuring multiple MGRs #6955

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run two MGRs to have one in standby mode #1796

Run two MGRs to have one in standby mode #1796

galexrt commented Jun 15, 2018

liewegas commented Jun 15, 2018

jcsp commented Jun 18, 2018

travisn commented Jun 18, 2018

jcsp commented Jun 18, 2018

travisn commented Jun 18, 2018

BlaineEXE commented Aug 31, 2018

stale bot commented Dec 12, 2018

stale bot commented Dec 19, 2018

wangxiao86 commented Apr 22, 2019 •

edited

stale bot commented Jul 22, 2019

BlaineEXE commented Jul 25, 2019

mmgaggle commented Aug 21, 2019

jmolmo commented Nov 19, 2019 •

edited

jshen28 commented Jan 12, 2021 •

edited

Run two MGRs to have one in standby mode #1796

Run two MGRs to have one in standby mode #1796

Comments

galexrt commented Jun 15, 2018

liewegas commented Jun 15, 2018

jcsp commented Jun 18, 2018

travisn commented Jun 18, 2018

jcsp commented Jun 18, 2018

travisn commented Jun 18, 2018

BlaineEXE commented Aug 31, 2018

stale bot commented Dec 12, 2018

stale bot commented Dec 19, 2018

wangxiao86 commented Apr 22, 2019 • edited

stale bot commented Jul 22, 2019

BlaineEXE commented Jul 25, 2019

mmgaggle commented Aug 21, 2019

jmolmo commented Nov 19, 2019 • edited

jshen28 commented Jan 12, 2021 • edited

wangxiao86 commented Apr 22, 2019 •

edited

jmolmo commented Nov 19, 2019 •

edited

jshen28 commented Jan 12, 2021 •

edited