Skip to content

Conversation

@bparees
Copy link
Collaborator

@bparees bparees commented Jan 29, 2018

No description provided.

@centos-ci
Copy link
Collaborator

Can one of the admins verify this patch?

1 similar comment
@centos-ci
Copy link
Collaborator

Can one of the admins verify this patch?

@bparees
Copy link
Collaborator Author

bparees commented Jan 29, 2018

we're seeing flakiness in the replication tests due to liveness probe failures, i'd like to loosen the checks a bit in hopes of resolving them.

@hhorak ptal.

@hhorak
Copy link
Member

hhorak commented Jan 30, 2018

This makes sence to me. @pkubatrh @praiskup WDYT?

@hhorak
Copy link
Member

hhorak commented Jan 30, 2018

[test-openshift]

@praiskup
Copy link
Contributor

praiskup commented Jan 30, 2018

Looking at the livenes/readiness probe definition, shouldn't we actually use
pg_isready for readiness() check, and do some pid check for liveness probe?

The timeout prolong makes sense to me, +1, but I'm not sure why the
"initialDelaySeconds" is prolonged in liveness probe. The way it is
implemented, there's no safe default for the initial default (it might take
several minutes to become responding).

@bparees
Copy link
Collaborator Author

bparees commented Jan 30, 2018

but I'm not sure why the
"initialDelaySeconds" is prolonged in liveness probe. The way it is
implemented, there's no safe default for the initial default (it might take
several minutes to become responding).

well that's why i made it longer :) As you say, i'm not sure what the safe value could be, so i'm just trying to raise it enough that we no longer get killed if psql takes a bit to start up. I'm happy to raise it even more if you think that's a good idea, but i'm hopeful that 2 minute is going to be sufficient.

@praiskup
Copy link
Contributor

What about if we dropped the liveness probe then, or checked that PID (1?) is running ... and kept only the readiness probe?

@bparees
Copy link
Collaborator Author

bparees commented Jan 30, 2018

What about if we dropped the liveness probe then, or checked that PID (1?) is running ... and kept only the readiness probe?

Checking that pid 1 is running doesn't guarantee pid 1 is not hung.

can we merge this for the short term (i'm hoping it will resolve some of my test flakes) and you guys can open a follow up issue to revisit how you want the liveness check to behave long term?

@praiskup
Copy link
Contributor

Checking that pid 1 is running doesn't guarantee pid 1 is not hung.

This is rather hypothetical (in case of PostgreSQL), right? If the pid is "running" (and it is
the exec postgres thing already), it doesn't not seem to be wise to force container kill..
even if the container is hung somewhere (there must be a reason for the hung), and the
end-user requests are blocked by readiness check anyway, or?

can we merge this for the short term (i'm hoping it will resolve some of my test flakes) and you guys can open a follow up issue to revisit how you want the liveness check to behave long term?

I'm neutral, if @hhorak thinks it is OK then fine. To me this solves pretty much nothing; the
design should be pg_isready == 0 for readiness probe, and pg_isready in [1, 0] for
the liveness probe. But yeah, can be solved later..

@bparees
Copy link
Collaborator Author

bparees commented Jan 30, 2018

This is rather hypothetical (in case of PostgreSQL), right? If the pid is "running" (and it is
the exec postgres thing already), it doesn't not seem to be wise to force container kill..
even if the container is hung somewhere (there must be a reason for the hung), and the
end-user requests are blocked by readiness check anyway, or?

since you're (presumably) using a PV for your data, if the postgres process has become unresponsive, killing the container makes sense to me. Now whether the existing liveness check is an appropriate way to determine whether postgres is "hung" or not, I cannot say.

@praiskup
Copy link
Contributor

since you're (presumably) using a PV for your data, if the postgres process has become unresponsive, killing the container makes sense to me.

What's meant by "has become unresponsive" in example? The question: is kill&&redeploy
helpful in such case?

@bparees
Copy link
Collaborator Author

bparees commented Jan 30, 2018

What's meant by "has become unresponsive" in example?

process deadlocks due to a bug.

praiskup added a commit to praiskup/postgresql-container that referenced this pull request Jan 30, 2018
praiskup added a commit to praiskup/postgresql-container that referenced this pull request Jan 30, 2018
@praiskup
Copy link
Contributor

process deadlocks due to a bug.

Is this based on real story? What about #227?

@bparees
Copy link
Collaborator Author

bparees commented Jan 30, 2018

Is this based on real story? What about #227?

you're proposing that there is never and will never be a case where the psql process is running, but no longer accepting connections on its port? That seems optimistic.

@praiskup
Copy link
Contributor

I'm saying that (a) either PG is too busy where readiness check helps, and there are thus good reasons to wait to not loose data (b) PID file stops existing, or (c) there's a bug. If there's nothing in PostgreSQL itself to detect (c) and restart -- we can very hardly find a heuristic which has no consequences.

praiskup added a commit to praiskup/postgresql-container that referenced this pull request Jan 31, 2018
praiskup added a commit to praiskup/postgresql-container that referenced this pull request Feb 1, 2018
praiskup added a commit to praiskup/postgresql-container that referenced this pull request Feb 1, 2018
pkubatrh pushed a commit that referenced this pull request Feb 5, 2018
@pkubatrh
Copy link
Member

pkubatrh commented Feb 5, 2018

Closed in favour of #227

@pkubatrh pkubatrh closed this Feb 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants