loosen liveness checks #226

bparees · 2018-01-29T18:34:33Z

No description provided.

centos-ci · 2018-01-29T18:34:33Z

Can one of the admins verify this patch?

centos-ci · 2018-01-29T18:34:34Z

Can one of the admins verify this patch?

bparees · 2018-01-29T18:35:04Z

we're seeing flakiness in the replication tests due to liveness probe failures, i'd like to loosen the checks a bit in hopes of resolving them.

@hhorak ptal.

hhorak · 2018-01-30T07:53:18Z

This makes sence to me. @pkubatrh @praiskup WDYT?

hhorak · 2018-01-30T07:53:24Z

[test-openshift]

praiskup · 2018-01-30T09:02:43Z

Looking at the livenes/readiness probe definition, shouldn't we actually use
pg_isready for readiness() check, and do some pid check for liveness probe?

The timeout prolong makes sense to me, +1, but I'm not sure why the
"initialDelaySeconds" is prolonged in liveness probe. The way it is
implemented, there's no safe default for the initial default (it might take
several minutes to become responding).

bparees · 2018-01-30T14:01:03Z

but I'm not sure why the
"initialDelaySeconds" is prolonged in liveness probe. The way it is
implemented, there's no safe default for the initial default (it might take
several minutes to become responding).

well that's why i made it longer :) As you say, i'm not sure what the safe value could be, so i'm just trying to raise it enough that we no longer get killed if psql takes a bit to start up. I'm happy to raise it even more if you think that's a good idea, but i'm hopeful that 2 minute is going to be sufficient.

praiskup · 2018-01-30T15:00:53Z

What about if we dropped the liveness probe then, or checked that PID (1?) is running ... and kept only the readiness probe?

bparees · 2018-01-30T15:11:28Z

What about if we dropped the liveness probe then, or checked that PID (1?) is running ... and kept only the readiness probe?

Checking that pid 1 is running doesn't guarantee pid 1 is not hung.

can we merge this for the short term (i'm hoping it will resolve some of my test flakes) and you guys can open a follow up issue to revisit how you want the liveness check to behave long term?

praiskup · 2018-01-30T16:27:31Z

Checking that pid 1 is running doesn't guarantee pid 1 is not hung.

This is rather hypothetical (in case of PostgreSQL), right? If the pid is "running" (and it is
the exec postgres thing already), it doesn't not seem to be wise to force container kill..
even if the container is hung somewhere (there must be a reason for the hung), and the
end-user requests are blocked by readiness check anyway, or?

can we merge this for the short term (i'm hoping it will resolve some of my test flakes) and you guys can open a follow up issue to revisit how you want the liveness check to behave long term?

I'm neutral, if @hhorak thinks it is OK then fine. To me this solves pretty much nothing; the
design should be pg_isready == 0 for readiness probe, and pg_isready in [1, 0] for
the liveness probe. But yeah, can be solved later..

bparees · 2018-01-30T16:43:34Z

This is rather hypothetical (in case of PostgreSQL), right? If the pid is "running" (and it is
the exec postgres thing already), it doesn't not seem to be wise to force container kill..
even if the container is hung somewhere (there must be a reason for the hung), and the
end-user requests are blocked by readiness check anyway, or?

since you're (presumably) using a PV for your data, if the postgres process has become unresponsive, killing the container makes sense to me. Now whether the existing liveness check is an appropriate way to determine whether postgres is "hung" or not, I cannot say.

praiskup · 2018-01-30T17:11:23Z

since you're (presumably) using a PV for your data, if the postgres process has become unresponsive, killing the container makes sense to me.

What's meant by "has become unresponsive" in example? The question: is kill&&redeploy
helpful in such case?

bparees · 2018-01-30T17:53:42Z

What's meant by "has become unresponsive" in example?

process deadlocks due to a bug.

Related: sclorg#226

praiskup · 2018-01-30T19:52:05Z

process deadlocks due to a bug.

Is this based on real story? What about #227?

bparees · 2018-01-30T20:01:56Z

Is this based on real story? What about #227?

you're proposing that there is never and will never be a case where the psql process is running, but no longer accepting connections on its port? That seems optimistic.

praiskup · 2018-01-30T20:17:41Z

I'm saying that (a) either PG is too busy where readiness check helps, and there are thus good reasons to wait to not loose data (b) PID file stops existing, or (c) there's a bug. If there's nothing in PostgreSQL itself to detect (c) and restart -- we can very hardly find a heuristic which has no consequences.

Related: sclorg#226

Related: #226

pkubatrh · 2018-02-05T07:05:13Z

Closed in favour of #227

loosen liveness checks

13c5d38

praiskup added a commit to praiskup/postgresql-container that referenced this pull request Jan 30, 2018

check-container: {live,readi}ness checks moved from templates

0e609a0

Related: sclorg#226

praiskup mentioned this pull request Jan 30, 2018

check-container: {live,ready}ness checks moved from templates #227

Merged

praiskup added a commit to praiskup/postgresql-container that referenced this pull request Jan 30, 2018

check-container: {live,ready}ness checks moved from templates

a47a49a

Related: sclorg#226

praiskup added a commit to praiskup/postgresql-container that referenced this pull request Jan 31, 2018

check-container: {live,ready}ness checks moved from templates

c43401a

Related: sclorg#226

praiskup added a commit to praiskup/postgresql-container that referenced this pull request Feb 1, 2018

check-container: {live,ready}ness checks moved from templates

185eb25

Related: sclorg#226

praiskup added a commit to praiskup/postgresql-container that referenced this pull request Feb 1, 2018

check-container: {live,ready}ness checks moved from templates

0dc49c0

Related: sclorg#226

pkubatrh pushed a commit that referenced this pull request Feb 5, 2018

check-container: {live,ready}ness checks moved from templates

28de1f1

Related: #226

pkubatrh closed this Feb 5, 2018

loosen liveness checks #226

loosen liveness checks #226

Uh oh!

Conversation

bparees commented Jan 29, 2018

Uh oh!

centos-ci commented Jan 29, 2018

Uh oh!

centos-ci commented Jan 29, 2018

Uh oh!

bparees commented Jan 29, 2018

Uh oh!

hhorak commented Jan 30, 2018

Uh oh!

hhorak commented Jan 30, 2018

Uh oh!

praiskup commented Jan 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bparees commented Jan 30, 2018

Uh oh!

praiskup commented Jan 30, 2018

Uh oh!

bparees commented Jan 30, 2018

Uh oh!

praiskup commented Jan 30, 2018

Uh oh!

bparees commented Jan 30, 2018

Uh oh!

praiskup commented Jan 30, 2018

Uh oh!

bparees commented Jan 30, 2018

Uh oh!

praiskup commented Jan 30, 2018

Uh oh!

bparees commented Jan 30, 2018

Uh oh!

praiskup commented Jan 30, 2018

Uh oh!

pkubatrh commented Feb 5, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

praiskup commented Jan 30, 2018 •

edited

Loading