-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run two MGRs to have one in standby mode #1796
Comments
@jcsp Is there anything to worry about here with the way mgr failover happens? I seem to recall something about how failover across instances is slower than when the same instance name restarts. Perhaps that doesn't actually matter here, though, since if you have mgr.x and mgr.y and x fails, it will restart just as quickly regardless of whether there is also a mgr.y, so this can only improve things (when mgr.y takes over quickly) but wouldn't make anything worse. |
Yeah, recovery with the same instance name is faster (no timeout), just like how the MDS works. However, in this case I think we're talking about using Ceph's timeout-based failover just because it's snappier than the k8s node failure handling, so either way it's a timeout. I didn't realise the k8s timeout on a failed node was so long (5m) -- since mon_mgr_beacon_grace is only 30s, they'll clearly get a faster mgr recovery using Ceph's built in failover. However, if a node is already NotReady from k8s perspective, we shuldn't even wait for that timeout -- Rook could issue "ceph mgr fail" commands when it sees the active mgr's node go NotReady (or the active pod goes away for any other reason). These arguments also apply to MDS daemons -- I'd like Rook to be proactively issuing "ceph mds fail" commands when a pod (or its node) appears to have gone bad, when Ceph might still be waiting on a timeout. |
@jcsp Since the mgr and mds have a built-in health monitor and failover, what monitoring would be effective for rook to add in addition? I'm not sure we can do better than the K8s timeouts. When K8s determines a node is NotReady, its pods should already be evicted and started elsewhere immediately. Before that happens, the mgr and mds wouldn't perform their own failover of the active node since they have their own health checks and timeouts? |
For an unexpected node failure, in general Ceph's timeouts would have already kicked in at the point that k8s timeouts do. However, I imagine in some cases the k8s node death might be somehow intentional/orchestrated, so we would like the cue from k8s that the node is dead, rather than waiting for the Ceph timeout. Conceivably, k8s might also have better knowledge of node failures, e.g. tipped off by power management, and it could pass that immediate non-timeout-driven information to ceph with the fail commands. I'm totally hand waving though :-) |
Got it, will definitely keep that in mind to look for ways to proactively monitor the daemons. And hand waving is acceptable at this point too. :) |
There has been some discussion about this here: #2048 (comment) As I suggested in 2048, I think this is a feature we should opt to deprioritize for the 0.9 release, as I think there is already a full amount of work targeted for 0.9. We can reevaluate for 0.10/1.0. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation. |
…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty.
…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty. Signed-off-by: wangxiao86 <1085038484@qq.com>
…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty.
…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty. Signed-off-by: wangxiao86 <wangxiao1@sensetime.com>
…ormally : the active mgr pod accesses metrics normally, the standby mgr pod accesses metrics return empty. Signed-off-by: wangxiao86 <wangxiao1@sensetime.com>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Commenting to remove staleness. This is still something we want to do, but the priority isn't as high as other things at this time. |
Relatedly, we should use anti-affinity placement features by failure-domain labels. Basically the same issue as #3332 this, but for MGR vs MDS. |
After the Rook-Community meeting held past 11/15/2019 , we decided to not continue adding two managers in the rook cluster to avoid problems between k8 load balancer, k8 resiliency policies and Ceph system. |
Hi I understand mgr is not going to work under load balancing, but I think we still could have more than 2 mgr and let them all registered in prometheus. By doing this we could associate individual services to each deployment. and I recall prometheus could collect metrics by labels.. |
Is this a bug report or feature request?
Feature Request
Are there any similar features already existing:
No.
What should the feature do:
Run two MGRs (again #1334) instances.
See #1048.
What would be solved through this feature:
Running two MGRs would allow to have better HA for Ceph metrics, because if one node is
NotReady
for Kubernetes it takes up to 5 minutes till the Pod eviction starts causing the MGR Pod to be unavailable for this time causes metrics to be lost which may be important to see what the issue is for a failure.(For more info on Kubernetes node down detection and pod eviction, see https://fatalfailure.wordpress.com/2016/06/10/improving-kubernetes-reliability-quicker-detection-of-a-node-down/)
Does this have an impact on existing features:
No, the code is also implemented to be able to run two MGRs again in #1779.
The text was updated successfully, but these errors were encountered: