-
Notifications
You must be signed in to change notification settings - Fork 220
loosen liveness checks #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
1 similar comment
|
Can one of the admins verify this patch? |
|
we're seeing flakiness in the replication tests due to liveness probe failures, i'd like to loosen the checks a bit in hopes of resolving them. @hhorak ptal. |
|
[test-openshift] |
|
Looking at the livenes/readiness probe definition, shouldn't we actually use The timeout prolong makes sense to me, +1, but I'm not sure why the |
well that's why i made it longer :) As you say, i'm not sure what the safe value could be, so i'm just trying to raise it enough that we no longer get killed if psql takes a bit to start up. I'm happy to raise it even more if you think that's a good idea, but i'm hopeful that 2 minute is going to be sufficient. |
|
What about if we dropped the liveness probe then, or checked that PID (1?) is running ... and kept only the readiness probe? |
Checking that pid 1 is running doesn't guarantee pid 1 is not hung. can we merge this for the short term (i'm hoping it will resolve some of my test flakes) and you guys can open a follow up issue to revisit how you want the liveness check to behave long term? |
This is rather hypothetical (in case of PostgreSQL), right? If the pid is "running" (and it is
I'm neutral, if @hhorak thinks it is OK then fine. To me this solves pretty much nothing; the |
since you're (presumably) using a PV for your data, if the postgres process has become unresponsive, killing the container makes sense to me. Now whether the existing liveness check is an appropriate way to determine whether postgres is "hung" or not, I cannot say. |
What's meant by "has become unresponsive" in example? The question: is kill&&redeploy |
process deadlocks due to a bug. |
Is this based on real story? What about #227? |
you're proposing that there is never and will never be a case where the psql process is running, but no longer accepting connections on its port? That seems optimistic. |
|
I'm saying that (a) either PG is too busy where readiness check helps, and there are thus good reasons to wait to not loose data (b) PID file stops existing, or (c) there's a bug. If there's nothing in PostgreSQL itself to detect (c) and restart -- we can very hardly find a heuristic which has no consequences. |
|
Closed in favour of #227 |
No description provided.