Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run two MGRs to have one in standby mode #1796

Closed
galexrt opened this issue Jun 15, 2018 · 14 comments
Closed

Run two MGRs to have one in standby mode #1796

galexrt opened this issue Jun 15, 2018 · 14 comments
Assignees
Labels
ceph main ceph tag feature

Comments

@galexrt
Copy link
Member

galexrt commented Jun 15, 2018

Is this a bug report or feature request?

  • Feature Request

Feature Request

Are there any similar features already existing:
No.

What should the feature do:
Run two MGRs (again #1334) instances.
See #1048.

What would be solved through this feature:
Running two MGRs would allow to have better HA for Ceph metrics, because if one node is NotReady for Kubernetes it takes up to 5 minutes till the Pod eviction starts causing the MGR Pod to be unavailable for this time causes metrics to be lost which may be important to see what the issue is for a failure.
(For more info on Kubernetes node down detection and pod eviction, see https://fatalfailure.wordpress.com/2016/06/10/improving-kubernetes-reliability-quicker-detection-of-a-node-down/)

Does this have an impact on existing features:
No, the code is also implemented to be able to run two MGRs again in #1779.

@galexrt galexrt added the ceph main ceph tag label Jun 15, 2018
@galexrt galexrt self-assigned this Jun 15, 2018
@liewegas
Copy link
Member

@jcsp Is there anything to worry about here with the way mgr failover happens? I seem to recall something about how failover across instances is slower than when the same instance name restarts.

Perhaps that doesn't actually matter here, though, since if you have mgr.x and mgr.y and x fails, it will restart just as quickly regardless of whether there is also a mgr.y, so this can only improve things (when mgr.y takes over quickly) but wouldn't make anything worse.

@jcsp
Copy link
Member

jcsp commented Jun 18, 2018

Yeah, recovery with the same instance name is faster (no timeout), just like how the MDS works. However, in this case I think we're talking about using Ceph's timeout-based failover just because it's snappier than the k8s node failure handling, so either way it's a timeout.

I didn't realise the k8s timeout on a failed node was so long (5m) -- since mon_mgr_beacon_grace is only 30s, they'll clearly get a faster mgr recovery using Ceph's built in failover. However, if a node is already NotReady from k8s perspective, we shuldn't even wait for that timeout -- Rook could issue "ceph mgr fail" commands when it sees the active mgr's node go NotReady (or the active pod goes away for any other reason).

These arguments also apply to MDS daemons -- I'd like Rook to be proactively issuing "ceph mds fail" commands when a pod (or its node) appears to have gone bad, when Ceph might still be waiting on a timeout.

@travisn
Copy link
Member

travisn commented Jun 18, 2018

@jcsp Since the mgr and mds have a built-in health monitor and failover, what monitoring would be effective for rook to add in addition? I'm not sure we can do better than the K8s timeouts. When K8s determines a node is NotReady, its pods should already be evicted and started elsewhere immediately. Before that happens, the mgr and mds wouldn't perform their own failover of the active node since they have their own health checks and timeouts?

@jcsp
Copy link
Member

jcsp commented Jun 18, 2018

For an unexpected node failure, in general Ceph's timeouts would have already kicked in at the point that k8s timeouts do. However, I imagine in some cases the k8s node death might be somehow intentional/orchestrated, so we would like the cue from k8s that the node is dead, rather than waiting for the Ceph timeout. Conceivably, k8s might also have better knowledge of node failures, e.g. tipped off by power management, and it could pass that immediate non-timeout-driven information to ceph with the fail commands.

I'm totally hand waving though :-)

@travisn
Copy link
Member

travisn commented Jun 18, 2018

Got it, will definitely keep that in mind to look for ways to proactively monitor the daemons. And hand waving is acceptable at this point too. :)

@BlaineEXE
Copy link
Member

There has been some discussion about this here: #2048 (comment)

As I suggested in 2048, I think this is a feature we should opt to deprioritize for the 0.9 release, as I think there is already a full amount of work targeted for 0.9. We can reevaluate for 0.10/1.0.

@galexrt galexrt removed their assignment Sep 13, 2018
@stale
Copy link

stale bot commented Dec 12, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Dec 12, 2018
@stale
Copy link

stale bot commented Dec 19, 2018

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@stale stale bot closed this as completed Dec 19, 2018
@wangxiao86
Copy link

wangxiao86 commented Apr 22, 2019

I think I have get the reason of the issue( one of two mgrs responds "Connection refused" due to the module failed to start. ):
"
[18/Apr/2019:12:36:34] ENGINE Error in 'start' listener <bound method Server.start of <cherrypy._cpserver.Server object at 0x7f978f94d050>>
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cherrypy/process/wspbus.py", line 197, in publish
output.append(listener(*args, **kwargs))
File "/usr/lib/python2.7/site-packages/cherrypy/_cpserver.py", line 176, in start
ServerAdapter.start(self)
File "/usr/lib/python2.7/site-packages/cherrypy/process/servers.py", line 180, in start
wait_for_free_port(*self.bind_addr)
File "/usr/lib/python2.7/site-packages/cherrypy/process/servers.py", line 423, in wait_for_free_port
raise IOError("Port %r not free on %r" % (port, host))
IOError: Port 9283 not free on '10.107.1.232'
"
Check the error message above, in mgr-a pod ( the module startup failed), i find the ip value of bind_addr is 10.107.1.232 , but it should be 10.107.1.248.

The assignment to server_addr traced to Cherrypy,As follows: mgr/prometheus/module.py
the StandbyModule:
image

the MgrModule:
image

and the server_addr is set by get_localize_config fun of ceph/mgr/mgr_module.py , As follows:
image

And then I found this command ("ceph config set mgr.a mgr/dashboard/server_addr $(ROOK_POD_IP)") not working ( the command is executed by this function: makeSetServerAddrInitContainer function of mgr/spec.go ). But I did not find mgr.a in the output result of "ceph config dump". Therefore, when there are two mgr pods (mgr-a-deployment and mgr-b-depolyment), the server_addr obtained by the two mgr pods is the same, resulting in the wrong startup time of one mgr pod: "IOError: Port xxx not free on 'xxx"
image

I run two mgrs (mgr-a and mgr-b) by set mgr Replicas=2. I have modified the makeSetServerAddrInitContainer function of mgr/spec.go : “cfgPath := fmt.Sprintf("mgr/%s/server_addr", mgrModule)” change to “cfgPath := fmt.Sprintf("mgr/%s/%s/server_addr", mgrModule, mgrConfig.DaemonID)”. After modification, mgr-a pod is started to get the correct IP through mgr//a/server_addr and mgr-b pod is started to get the correct IP through mgr//b/server_addr, so both mgr-a pod and mgr-b pod were running normally: " one mgr pod return infos normally, another mgr pod return empty instead of "Connection refused" "
image

But I'm not sure if my change will have any other impact.

@galexrt galexrt reopened this Apr 22, 2019
@stale stale bot removed the wontfix label Apr 22, 2019
wangxiao86 added a commit to wangxiao86/rook that referenced this issue Apr 23, 2019
…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty.
@travisn travisn mentioned this issue Apr 23, 2019
1 task
wangxiao86 added a commit to wangxiao86/rook that referenced this issue Apr 24, 2019
…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty.

Signed-off-by: wangxiao86 <1085038484@qq.com>
wangxiao86 added a commit to wangxiao86/rook that referenced this issue Apr 26, 2019
…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty.
wangxiao86 added a commit to wangxiao86/rook that referenced this issue Apr 26, 2019
…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty.

Signed-off-by: wangxiao86 <wangxiao1@sensetime.com>
wangxiao86 pushed a commit to wangxiao86/rook that referenced this issue Apr 26, 2019
…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty.

Signed-off-by: wangxiao86 <wangxiao1@sensetime.com>
@stale
Copy link

stale bot commented Jul 22, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 22, 2019
@travisn travisn removed the wontfix label Jul 22, 2019
@BlaineEXE
Copy link
Member

Commenting to remove staleness. This is still something we want to do, but the priority isn't as high as other things at this time.

@mmgaggle
Copy link

Relatedly, we should use anti-affinity placement features by failure-domain labels. Basically the same issue as #3332 this, but for MGR vs MDS.

@jmolmo
Copy link
Contributor

jmolmo commented Nov 19, 2019

After the Rook-Community meeting held past 11/15/2019 , we decided to not continue adding two managers in the rook cluster to avoid problems between k8 load balancer, k8 resiliency policies and Ceph system.
However, we continue with the modification to improve the time needed to have a new mgr pod running when the node where this pod runs is "notReady"
This modification is in:
Improve restarting time for mgr,mon,toolbox pods running in a k8's "NotReady" node

@jmolmo jmolmo closed this as completed Nov 19, 2019
@jshen28
Copy link
Contributor

jshen28 commented Jan 12, 2021

Hi I understand mgr is not going to work under load balancing, but I think we still could have more than 2 mgr and let them all registered in prometheus. By doing this we could associate individual services to each deployment. and I recall prometheus could collect metrics by labels..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph main ceph tag feature
Projects
None yet
Development

No branches or pull requests

9 participants