Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check mongo status via host, not localhost to ensure remote accessibility #217

Merged
merged 1 commit into from
Jan 4, 2017

Conversation

bparees
Copy link
Collaborator

@bparees bparees commented Jan 3, 2017

hopefully addresses mongodb petset replica flakes as seen here:
https://ci.openshift.redhat.com/jenkins/view/Origin%20Test%20Jobs/job/origin_extended_image_tests/794/

in which one of the slave instances isn't able to access itself(!!):

Dec 22 06:10:06.677: INFO: Running 'oc logs --config=/tmp/extended-test-mongodb-petset-replica-hpydx-s1sgr-user.kubeconfig --namespace=extended-test-mongodb-petset-replica-hpydx-s1sgr mongodb-replicaset-1 --timestamps'
pod logs for 2016-12-22T11:08:55.402173000Z => [Thu Dec 22 11:08:55] Waiting for local MongoDB to accept connections ...
2016-12-22T11:08:55.548729000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] MongoDB starting : pid=16 port=27017 dbpath=/var/lib/mongodb/data 64-bit host=mongodb-replicaset-1
2016-12-22T11:08:55.549036000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] db version v3.2.6
2016-12-22T11:08:55.549311000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] git version: 05552b562c7a0b3143a729aaa0838e558dc49b25
2016-12-22T11:08:55.549569000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2016-12-22T11:08:55.549793000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2016-12-22T11:08:55.550007000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] modules: none
2016-12-22T11:08:55.550213000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] build environment:
2016-12-22T11:08:55.550485000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten]     distarch: x86_64
2016-12-22T11:08:55.550733000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten]     target_arch: x86_64
2016-12-22T11:08:55.550985000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { http: { enabled: false }, port: 27017 }, replication: { oplogSizeMB: 64, replSet: "rs0" }, security: { keyFile: "/var/lib/mongodb/keyfile" }, storage: { dbPath: "/var/lib/mongodb/data" }, systemLog: { quiet: true } }
2016-12-22T11:08:55.560117000Z 2016-12-22T11:08:55.555+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=3G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2016-12-22T11:08:55.637315000Z 2016-12-22T11:08:55.604+0000 I CONTROL  [initandlisten] 
2016-12-22T11:08:55.637609000Z 2016-12-22T11:08:55.604+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2016-12-22T11:08:55.637857000Z 2016-12-22T11:08:55.604+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2016-12-22T11:08:55.638087000Z 2016-12-22T11:08:55.604+0000 I CONTROL  [initandlisten] 
2016-12-22T11:08:55.638341000Z 2016-12-22T11:08:55.604+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2016-12-22T11:08:55.638590000Z 2016-12-22T11:08:55.606+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2016-12-22T11:08:55.638836000Z 2016-12-22T11:08:55.606+0000 I CONTROL  [initandlisten] 
2016-12-22T11:08:55.639094000Z 2016-12-22T11:08:55.612+0000 I REPL     [initandlisten] Did not find local voted for document at startup;  NoMatchingDocument: Did not find replica set lastVote document in local.replset.election
2016-12-22T11:08:55.639390000Z 2016-12-22T11:08:55.612+0000 I REPL     [initandlisten] Did not find local replica set configuration document at startup;  NoMatchingDocument: Did not find replica set configuration document in local.system.replset
2016-12-22T11:08:55.639635000Z 2016-12-22T11:08:55.613+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-12-22T11:08:55.639868000Z 2016-12-22T11:08:55.613+0000 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/var/lib/mongodb/data/diagnostic.data'
2016-12-22T11:08:55.640112000Z 2016-12-22T11:08:55.630+0000 I NETWORK  [initandlisten] waiting for connections on port 27017
2016-12-22T11:08:55.689320000Z 2016-12-22T11:08:55.685+0000 I ACCESS   [conn1] note: no users configured in admin.system.users, allowing localhost access
2016-12-22T11:08:55.694596000Z => [Thu Dec 22 11:08:55] Adding mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local to replica set ...
2016-12-22T11:08:55.819897000Z 2016-12-22T11:08:55.814+0000 I NETWORK  [thread1] Starting new replica set monitor for rs0/mongodb-replicaset-0.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local:27017
2016-12-22T11:08:55.820203000Z 2016-12-22T11:08:55.815+0000 I NETWORK  [ReplicaSetMonitorWatcher] starting
2016-12-22T11:08:56.283975000Z {
2016-12-22T11:08:56.284267000Z 	"ok" : 0,
2016-12-22T11:08:56.284532000Z 	"errmsg" : "Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: mongodb-replicaset-0.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local:27017; the following nodes did not respond affirmatively: mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local:27017 failed with HostUnreachable",
2016-12-22T11:08:56.284767000Z 	"code" : 74
2016-12-22T11:08:56.284997000Z }
2016-12-22T11:08:56.290041000Z => [Thu Dec 22 11:08:56] ERROR: couldn't add host to replica set!

@bparees
Copy link
Collaborator Author

bparees commented Jan 3, 2017

@rhcarvalho @php-coder ptal, I think this will address the mongo replication test flake we've been seeing in which the petset test fails with the pod logs showing:

2016-12-22T11:08:56.284532000Z 	"errmsg" : "Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: mongodb-replicaset-0.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local:27017; the following nodes did not respond affirmatively: mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local:27017 failed with HostUnreachable",

My theory is that although mongo is up, the DNS isn't populated yet or something along those lines and so it causes a failure when we try to join. Changing the "wait for mongo up" to check based on the real host instead of localhost should avoid that problem. I've confirmed this at least passes the extended tests (but so does the existing logic, most of the time).

@PI-Victor fyi.

@php-coder
Copy link
Contributor

My theory is that although mongo is up, the DNS isn't populated yet...

Sounds like kubernetes/kubernetes#39363 but we don't have readiness probe here.

On the other hand "The default state of Readiness before the initial delay is Failure. The state of Readiness for a container when no probe is provided is assumed to be Success." Is it possible that even we don't have readiness probe, pod has Failed state from the beginning and after some delay is marked as Success?

If it's our case we could try to set annotation service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"

@rhcarvalho
Copy link
Contributor

@bparees sounds reasonable to me. For future reference, could you please add some PR description, e.g., linking to a flake?

Hmm... on second thought...

check mongo status via host, not localhost to ensure remote accessibility

I think hostname -f doesn't guarantee remote accessibility, does it?

@bparees
Copy link
Collaborator Author

bparees commented Jan 4, 2017

@php-coder this is also in the logs:

2016-12-22T11:08:55.842195000Z 2016-12-22T11:08:55.841+0000 W NETWORK  [conn7] getaddrinfo("mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local") failed: Name or service not known
2016-12-22T11:08:55.843453000Z 2016-12-22T11:08:55.843+0000 I NETWORK  [conn7] getaddrinfo("mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local") failed: Name or service not known

so i'm still thinking DNS issue, not a readiness issue. I'm not even sure there are services at play here, since the pods are directly connecting to each other, no?

I think hostname -f doesn't guarantee remote accessibility, does it?

@rhcarvalho My logic is that since it's forcing mongo to resolve the hostname when it goes to connect, i would expect that it does at least guarantee the hostname can be resolved by DNS, and also that mongo is listening on the podIP which should be sufficient to guarantee remote accessibility. Per the above logs, i think the DNS resolution may be the primary issue.

@bparees
Copy link
Collaborator Author

bparees commented Jan 4, 2017

per the log i've included at the top of this issue, the failure occurs when the pod actually cannot reach itself during startup, so i don't think readiness probes are going to help. I'm going to merge this and hope for the best, it shouldn't make things worse, anyway.

@bparees bparees merged commit 7f9aa5d into sclorg:master Jan 4, 2017
@jperville
Copy link

Hello @bparees , I confirm that the problem is due either to:

  • presence of readiness probe in the pod template
  • lack of service.alpha.kubernetes.io/tolerate-unready-endpoints annotation in the service definition.

I have created a gist template that works: https://gist.github.com/jperville/faab63ab3704e5f2161c0feb81e82f06 . The template instanciates in a toy openshift v1.3.2 cluster created with oc cluster up.

If the tolerate-unready-endpoints is commented then issue #215 reproduces unless the readiness probe definition is also commented in the pod template.

@bparees
Copy link
Collaborator Author

bparees commented Jan 9, 2017

presence of readiness probe in the pod template

@jperville there is no readiness probe in the pod template that i see: https://github.com/sclorg/mongodb-container/blob/master/examples/petset/mongodb-petset-persistent.yaml#L98-L134

I do see how having one could cause this problem, but we don't currently. Did you see a mongodb petset template somewhere that does have a readiness probe?

@jperville
Copy link

I am using my own template for the petset (which I shared in the above gist link).

@bparees
Copy link
Collaborator Author

bparees commented Jan 9, 2017

I am using my own template for the petset (which I shared in the above gist link).

@jperville so i'm trying to understand the relevance of your comment to this issue. The template that we saw the issue with does not have a readiness probe, so it should also not need the tolerate unready annotation.

If we were to add a readiness probe, i agree we'd also need to add tolerate unready for things to continue working.

@jperville
Copy link

Using the current mongodb-petset-persistent.yaml template works for me right now (now that this PR has been merged). I was just reporting success with current version of the mongodb-32-centos7 image.

@bparees
Copy link
Collaborator Author

bparees commented Jan 9, 2017

Using the current mongodb-petset-persistent.yaml template works for me right now (now that this PR has been merged). I was just reporting success with current version of the mongodb-32-centos7 image.

@jperville got it. thanks for the additional info!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants