-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check mongo status via host, not localhost to ensure remote accessibility #217
Conversation
@rhcarvalho @php-coder ptal, I think this will address the mongo replication test flake we've been seeing in which the petset test fails with the pod logs showing:
My theory is that although mongo is up, the DNS isn't populated yet or something along those lines and so it causes a failure when we try to join. Changing the "wait for mongo up" to check based on the real host instead of localhost should avoid that problem. I've confirmed this at least passes the extended tests (but so does the existing logic, most of the time). @PI-Victor fyi. |
Sounds like kubernetes/kubernetes#39363 but we don't have readiness probe here. On the other hand "The default state of Readiness before the initial delay is Failure. The state of Readiness for a container when no probe is provided is assumed to be Success." Is it possible that even we don't have readiness probe, pod has Failed state from the beginning and after some delay is marked as Success? If it's our case we could try to set annotation |
@bparees sounds reasonable to me. For future reference, could you please add some PR description, e.g., linking to a flake? Hmm... on second thought...
I think |
@php-coder this is also in the logs:
so i'm still thinking DNS issue, not a readiness issue. I'm not even sure there are services at play here, since the pods are directly connecting to each other, no?
@rhcarvalho My logic is that since it's forcing mongo to resolve the hostname when it goes to connect, i would expect that it does at least guarantee the hostname can be resolved by DNS, and also that mongo is listening on the podIP which should be sufficient to guarantee remote accessibility. Per the above logs, i think the DNS resolution may be the primary issue. |
per the log i've included at the top of this issue, the failure occurs when the pod actually cannot reach itself during startup, so i don't think readiness probes are going to help. I'm going to merge this and hope for the best, it shouldn't make things worse, anyway. |
Hello @bparees , I confirm that the problem is due either to:
I have created a gist template that works: https://gist.github.com/jperville/faab63ab3704e5f2161c0feb81e82f06 . The template instanciates in a toy openshift v1.3.2 cluster created with If the tolerate-unready-endpoints is commented then issue #215 reproduces unless the readiness probe definition is also commented in the pod template. |
@jperville there is no readiness probe in the pod template that i see: https://github.com/sclorg/mongodb-container/blob/master/examples/petset/mongodb-petset-persistent.yaml#L98-L134 I do see how having one could cause this problem, but we don't currently. Did you see a mongodb petset template somewhere that does have a readiness probe? |
I am using my own template for the petset (which I shared in the above gist link). |
@jperville so i'm trying to understand the relevance of your comment to this issue. The template that we saw the issue with does not have a readiness probe, so it should also not need the tolerate unready annotation. If we were to add a readiness probe, i agree we'd also need to add tolerate unready for things to continue working. |
Using the current mongodb-petset-persistent.yaml template works for me right now (now that this PR has been merged). I was just reporting success with current version of the mongodb-32-centos7 image. |
@jperville got it. thanks for the additional info! |
hopefully addresses mongodb petset replica flakes as seen here:
https://ci.openshift.redhat.com/jenkins/view/Origin%20Test%20Jobs/job/origin_extended_image_tests/794/
in which one of the slave instances isn't able to access itself(!!):