check mongo status via host, not localhost to ensure remote accessibility #217

bparees · 2017-01-03T21:37:49Z

hopefully addresses mongodb petset replica flakes as seen here:
https://ci.openshift.redhat.com/jenkins/view/Origin%20Test%20Jobs/job/origin_extended_image_tests/794/

in which one of the slave instances isn't able to access itself(!!):

Dec 22 06:10:06.677: INFO: Running 'oc logs --config=/tmp/extended-test-mongodb-petset-replica-hpydx-s1sgr-user.kubeconfig --namespace=extended-test-mongodb-petset-replica-hpydx-s1sgr mongodb-replicaset-1 --timestamps'
pod logs for 2016-12-22T11:08:55.402173000Z => [Thu Dec 22 11:08:55] Waiting for local MongoDB to accept connections ...
2016-12-22T11:08:55.548729000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] MongoDB starting : pid=16 port=27017 dbpath=/var/lib/mongodb/data 64-bit host=mongodb-replicaset-1
2016-12-22T11:08:55.549036000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] db version v3.2.6
2016-12-22T11:08:55.549311000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] git version: 05552b562c7a0b3143a729aaa0838e558dc49b25
2016-12-22T11:08:55.549569000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2016-12-22T11:08:55.549793000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2016-12-22T11:08:55.550007000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] modules: none
2016-12-22T11:08:55.550213000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] build environment:
2016-12-22T11:08:55.550485000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten]     distarch: x86_64
2016-12-22T11:08:55.550733000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten]     target_arch: x86_64
2016-12-22T11:08:55.550985000Z 2016-12-22T11:08:55.547+0000 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { http: { enabled: false }, port: 27017 }, replication: { oplogSizeMB: 64, replSet: "rs0" }, security: { keyFile: "/var/lib/mongodb/keyfile" }, storage: { dbPath: "/var/lib/mongodb/data" }, systemLog: { quiet: true } }
2016-12-22T11:08:55.560117000Z 2016-12-22T11:08:55.555+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=3G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2016-12-22T11:08:55.637315000Z 2016-12-22T11:08:55.604+0000 I CONTROL  [initandlisten] 
2016-12-22T11:08:55.637609000Z 2016-12-22T11:08:55.604+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2016-12-22T11:08:55.637857000Z 2016-12-22T11:08:55.604+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2016-12-22T11:08:55.638087000Z 2016-12-22T11:08:55.604+0000 I CONTROL  [initandlisten] 
2016-12-22T11:08:55.638341000Z 2016-12-22T11:08:55.604+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2016-12-22T11:08:55.638590000Z 2016-12-22T11:08:55.606+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2016-12-22T11:08:55.638836000Z 2016-12-22T11:08:55.606+0000 I CONTROL  [initandlisten] 
2016-12-22T11:08:55.639094000Z 2016-12-22T11:08:55.612+0000 I REPL     [initandlisten] Did not find local voted for document at startup;  NoMatchingDocument: Did not find replica set lastVote document in local.replset.election
2016-12-22T11:08:55.639390000Z 2016-12-22T11:08:55.612+0000 I REPL     [initandlisten] Did not find local replica set configuration document at startup;  NoMatchingDocument: Did not find replica set configuration document in local.system.replset
2016-12-22T11:08:55.639635000Z 2016-12-22T11:08:55.613+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-12-22T11:08:55.639868000Z 2016-12-22T11:08:55.613+0000 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/var/lib/mongodb/data/diagnostic.data'
2016-12-22T11:08:55.640112000Z 2016-12-22T11:08:55.630+0000 I NETWORK  [initandlisten] waiting for connections on port 27017
2016-12-22T11:08:55.689320000Z 2016-12-22T11:08:55.685+0000 I ACCESS   [conn1] note: no users configured in admin.system.users, allowing localhost access
2016-12-22T11:08:55.694596000Z => [Thu Dec 22 11:08:55] Adding mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local to replica set ...
2016-12-22T11:08:55.819897000Z 2016-12-22T11:08:55.814+0000 I NETWORK  [thread1] Starting new replica set monitor for rs0/mongodb-replicaset-0.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local:27017
2016-12-22T11:08:55.820203000Z 2016-12-22T11:08:55.815+0000 I NETWORK  [ReplicaSetMonitorWatcher] starting
2016-12-22T11:08:56.283975000Z {
2016-12-22T11:08:56.284267000Z 	"ok" : 0,
2016-12-22T11:08:56.284532000Z 	"errmsg" : "Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: mongodb-replicaset-0.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local:27017; the following nodes did not respond affirmatively: mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local:27017 failed with HostUnreachable",
2016-12-22T11:08:56.284767000Z 	"code" : 74
2016-12-22T11:08:56.284997000Z }
2016-12-22T11:08:56.290041000Z => [Thu Dec 22 11:08:56] ERROR: couldn't add host to replica set!

…lity

bparees · 2017-01-03T21:41:54Z

@rhcarvalho @php-coder ptal, I think this will address the mongo replication test flake we've been seeing in which the petset test fails with the pod logs showing:

2016-12-22T11:08:56.284532000Z 	"errmsg" : "Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: mongodb-replicaset-0.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local:27017; the following nodes did not respond affirmatively: mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local:27017 failed with HostUnreachable",

My theory is that although mongo is up, the DNS isn't populated yet or something along those lines and so it causes a failure when we try to join. Changing the "wait for mongo up" to check based on the real host instead of localhost should avoid that problem. I've confirmed this at least passes the extended tests (but so does the existing logic, most of the time).

@PI-Victor fyi.

php-coder · 2017-01-04T12:00:29Z

My theory is that although mongo is up, the DNS isn't populated yet...

Sounds like kubernetes/kubernetes#39363 but we don't have readiness probe here.

On the other hand "The default state of Readiness before the initial delay is Failure. The state of Readiness for a container when no probe is provided is assumed to be Success." Is it possible that even we don't have readiness probe, pod has Failed state from the beginning and after some delay is marked as Success?

If it's our case we could try to set annotation service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"

rhcarvalho · 2017-01-04T15:44:24Z

@bparees sounds reasonable to me. For future reference, could you please add some PR description, e.g., linking to a flake?

Hmm... on second thought...

check mongo status via host, not localhost to ensure remote accessibility

I think hostname -f doesn't guarantee remote accessibility, does it?

bparees · 2017-01-04T16:27:51Z

@php-coder this is also in the logs:

2016-12-22T11:08:55.842195000Z 2016-12-22T11:08:55.841+0000 W NETWORK  [conn7] getaddrinfo("mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local") failed: Name or service not known
2016-12-22T11:08:55.843453000Z 2016-12-22T11:08:55.843+0000 I NETWORK  [conn7] getaddrinfo("mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-hpydx-s1sgr.svc.cluster.local") failed: Name or service not known

so i'm still thinking DNS issue, not a readiness issue. I'm not even sure there are services at play here, since the pods are directly connecting to each other, no?

I think hostname -f doesn't guarantee remote accessibility, does it?

@rhcarvalho My logic is that since it's forcing mongo to resolve the hostname when it goes to connect, i would expect that it does at least guarantee the hostname can be resolved by DNS, and also that mongo is listening on the podIP which should be sufficient to guarantee remote accessibility. Per the above logs, i think the DNS resolution may be the primary issue.

bparees · 2017-01-04T18:33:47Z

per the log i've included at the top of this issue, the failure occurs when the pod actually cannot reach itself during startup, so i don't think readiness probes are going to help. I'm going to merge this and hope for the best, it shouldn't make things worse, anyway.

jperville · 2017-01-09T11:18:21Z

Hello @bparees , I confirm that the problem is due either to:

presence of readiness probe in the pod template
lack of service.alpha.kubernetes.io/tolerate-unready-endpoints annotation in the service definition.

I have created a gist template that works: https://gist.github.com/jperville/faab63ab3704e5f2161c0feb81e82f06 . The template instanciates in a toy openshift v1.3.2 cluster created with oc cluster up.

If the tolerate-unready-endpoints is commented then issue #215 reproduces unless the readiness probe definition is also commented in the pod template.

bparees · 2017-01-09T16:11:13Z

presence of readiness probe in the pod template

@jperville there is no readiness probe in the pod template that i see: https://github.com/sclorg/mongodb-container/blob/master/examples/petset/mongodb-petset-persistent.yaml#L98-L134

I do see how having one could cause this problem, but we don't currently. Did you see a mongodb petset template somewhere that does have a readiness probe?

jperville · 2017-01-09T16:13:29Z

I am using my own template for the petset (which I shared in the above gist link).

bparees · 2017-01-09T16:19:44Z

I am using my own template for the petset (which I shared in the above gist link).

@jperville so i'm trying to understand the relevance of your comment to this issue. The template that we saw the issue with does not have a readiness probe, so it should also not need the tolerate unready annotation.

If we were to add a readiness probe, i agree we'd also need to add tolerate unready for things to continue working.

jperville · 2017-01-09T16:46:28Z

Using the current mongodb-petset-persistent.yaml template works for me right now (now that this PR has been merged). I was just reporting success with current version of the mongodb-32-centos7 image.

bparees · 2017-01-09T17:03:01Z

Using the current mongodb-petset-persistent.yaml template works for me right now (now that this PR has been merged). I was just reporting success with current version of the mongodb-32-centos7 image.

@jperville got it. thanks for the additional info!

check mongo status via host, not localhost to ensure remote accessibi…

a90709d

…lity

bparees force-pushed the petset_host branch from 3563fa6 to a90709d Compare January 3, 2017 21:38

bparees merged commit 7f9aa5d into sclorg:master Jan 4, 2017

bparees deleted the petset_host branch January 13, 2017 15:19

bparees mentioned this pull request Jan 23, 2017

petset replica fails to initialize #215

Closed

bparees mentioned this pull request Feb 2, 2017

mongodb petsets extended tests: dns record of replica cannot be resolved openshift/origin#12588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check mongo status via host, not localhost to ensure remote accessibility #217

check mongo status via host, not localhost to ensure remote accessibility #217

bparees commented Jan 3, 2017 •

edited

Loading

bparees commented Jan 3, 2017

php-coder commented Jan 4, 2017

rhcarvalho commented Jan 4, 2017

bparees commented Jan 4, 2017

bparees commented Jan 4, 2017

jperville commented Jan 9, 2017

bparees commented Jan 9, 2017

jperville commented Jan 9, 2017

bparees commented Jan 9, 2017

jperville commented Jan 9, 2017

bparees commented Jan 9, 2017

check mongo status via host, not localhost to ensure remote accessibility #217

check mongo status via host, not localhost to ensure remote accessibility #217

Conversation

bparees commented Jan 3, 2017 • edited Loading

bparees commented Jan 3, 2017

php-coder commented Jan 4, 2017

rhcarvalho commented Jan 4, 2017

bparees commented Jan 4, 2017

bparees commented Jan 4, 2017

jperville commented Jan 9, 2017

bparees commented Jan 9, 2017

jperville commented Jan 9, 2017

bparees commented Jan 9, 2017

jperville commented Jan 9, 2017

bparees commented Jan 9, 2017

bparees commented Jan 3, 2017 •

edited

Loading