Skip to content

Conversation

@praiskup
Copy link
Contributor

Related: #226

@praiskup
Copy link
Contributor Author

[test-openshift]

@praiskup praiskup force-pushed the separate-liveness-check branch from 0e609a0 to a47a49a Compare January 30, 2018 19:50
@praiskup praiskup mentioned this pull request Jan 30, 2018
@bparees
Copy link
Collaborator

bparees commented Jan 30, 2018

I don't know that a liveness check which simply ensures you can exec into the container, and that pid 1 exists, is worth having. (if pid1 didn't exist, the container would have terminated and been restarted, and if you can't exec into the container the container doesn't exist).

@praiskup
Copy link
Contributor Author

pg_isready is actually used for both checks, at least should be

@bparees
Copy link
Collaborator

bparees commented Jan 30, 2018

pg_isready is actually used for both checks, at least should be

my reading of the bash was that if --live is passed, it doesn't really matter what pg_isready returns (whether it returns 0 or 1, you'll exit 0 either way)

@praiskup
Copy link
Contributor Author

pg_isready should return 1 if the server is starting, 2 if the readiness check failed

@bparees
Copy link
Collaborator

bparees commented Jan 30, 2018

pg_isready should return 1 if the server is starting, 2 if the readiness check failed

ah, ok.

lgtm then.

@pkubatrh
Copy link
Member

Tried testing out locally, getting:

Readiness probe failed: rpc error: code = 13 desc = invalid header field value "oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"/usr/libexec/check-container\\\": stat /usr/libexec/check-container: no such file or directory\"\n"

@pkubatrh
Copy link
Member

pkubatrh commented Jan 31, 2018

Oh right that is because of the update test since the official docker images do not have the file in them yet...

@praiskup
Copy link
Contributor Author

praiskup commented Jan 31, 2018

I have to look closer, but can we somehow relax the tests (temporarily) so the tests are run only against git-version of the containers? (variable, option for jenkins...)

@pkubatrh
Copy link
Member

pkubatrh commented Jan 31, 2018

Well there would not be any reason to run the update test if we only used the git-built container. So the easiest temporary fix is to just not run it at all...

Long time we should just use the live Openshift templates instead of the local ones for this test. We have separate tests for local templates so should be ok.

@pkubatrh
Copy link
Member

Other test cases ran ok on my local box.

@praiskup
Copy link
Contributor Author

So basically nothing prevent's us to push this... right? Anyways, please during the review -- take attention to the timeouts. Previously, there was the (default) timeout for pg_isready of 3 seconds, I disabled this by --timeout 0. Also, previously pg_isready was only in liveness check, but now it is used on both places....

So to the limits (10s timeout for liveness check, checking after 30s after container start...). It is still not perfect... if the db initialization takes minutes (e.g. s2i, or slower storage), the script waits till we execute /bin/postgres (the /proc/1/exe check).... But openshift runs the script after 30s, gives it 10s time-frame ... and then kills it (somehow). @bparees is there documented what happens with the process, so we could "catch the signal" and return "success" in such case (initdb phase)?
To my interpretation, the container is considered to be "live at that signal time".

@praiskup
Copy link
Contributor Author

Basically, what I miss is that I'm still not sure that the liveness check won't kill the postgres pod
during the container initialization.. and I also don't want to set the initialDelaySeconds too
high, since I want to claim the pod is live as soon as possible. Is it possible to make
openshift more tolerant right after container start? The default threshold to claim container is
failed is 3 consecutive liveness check failures (times 10s), which is fine but not right after
container start.

@bparees
Copy link
Collaborator

bparees commented Jan 31, 2018

But openshift runs the script after 30s, gives it 10s time-frame ... and then kills it (somehow). @bparees is there documented what happens with the process, so we could "catch the signal" and return "success" in such case (initdb phase)?

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-command

you cannot catch the signal. once the liveness check fails, k8s will kill your container, period.

and I also don't want to set the initialDelaySeconds too high, since I want to claim the pod is live as soon as possible

claiming the pod is "live" doesn't serve any purpose i do not believe. readiness probes are what determine routeability. liveness probes are only for determining if the pod should be killed, so a loose liveness check with a long initialdelay just means it won't get killed quickly.

@praiskup praiskup force-pushed the separate-liveness-check branch from a47a49a to c43401a Compare January 31, 2018 14:01
@praiskup
Copy link
Contributor Author

@bparees aha, thanks, that's what docs say too ... sorry. In that case I understand
that you were right in #226 that we should delay the first liveness check. It might be
still be not enough ... but let's see. PTAL @pkubatrh.

@pkubatrh
Copy link
Member

pkubatrh commented Feb 1, 2018

@praiskup Please remove the openshift update test from test cases that are being run to make sure we have at least some of the openshift tests working. I will look into how to do the test properly in the meantime. Otherwise LGTM.

@bparees This change will only work with the centos versions of the images, since the check-container script will be missing from the rhel versions for now. Is this ok for you?

@praiskup praiskup changed the title check-container: {live,readi}ness checks moved from templates check-container: {live,ready}ness checks moved from templates Feb 1, 2018
@pkubatrh
Copy link
Member

pkubatrh commented Feb 1, 2018

I will look into how to do the test properly in the meantime.

I have created a PR that modifies the update test to use the template located in the openshift repository instead in #229

@praiskup praiskup force-pushed the separate-liveness-check branch from c43401a to 36a45c3 Compare February 1, 2018 12:05
praiskup added a commit to praiskup/postgresql-container that referenced this pull request Feb 1, 2018
@praiskup
Copy link
Contributor Author

praiskup commented Feb 1, 2018

[test-openshift]

@praiskup
Copy link
Contributor Author

praiskup commented Feb 1, 2018

So if you merge #229, I think I can drop the last patch and re-test. Otherwise feel free
to merge only the patch #1.

@bparees
Copy link
Collaborator

bparees commented Feb 1, 2018

@bparees This change will only work with the centos versions of the images, since the check-container script will be missing from the rhel versions for now. Is this ok for you?

No, the template is shared by both centos and rhel users. You need to publish a rhel image that works before changing the template to avoid breaking people who pick up the template change.

(yes most people get the template from origin but: 1) it's still not a good idea and 2) i don't want to risk us pulling this change into origin and breaking people using the rhel image)

@pkubatrh
Copy link
Member

pkubatrh commented Feb 1, 2018

@bparees Right. So the question now is are you ok with waiting to get the timeout changes merged in via this PR until a new rhel image containing these changes will be available or do you want to get at least the timeout changes (via #226) ASAP?

@praiskup I have merged #229 so feel free to rebase.

@bparees
Copy link
Collaborator

bparees commented Feb 1, 2018

@bparees Right. So the question now is are you ok with waiting to get the timeout changes merged in via this PR until a new rhel image containing these changes will be available or do you want to get at least the timeout changes (via #226) ASAP?

I don't really see a good reason to hold #226 so i guess i'd prefer to go ahead and merge it for now.

@praiskup praiskup force-pushed the separate-liveness-check branch from 36a45c3 to 0dc49c0 Compare February 1, 2018 14:04
@praiskup
Copy link
Contributor Author

praiskup commented Feb 1, 2018

[test-openshift]

@praiskup
Copy link
Contributor Author

praiskup commented Feb 1, 2018

No, the template is shared by both centos and rhel users. You need to publish a rhel image that works before changing the template to avoid breaking people who pick up the template change.

This is bad, I would rather prefer not to have such requirements on the source repository.

@praiskup
Copy link
Contributor Author

praiskup commented Feb 1, 2018

[test-openshift]

@praiskup
Copy link
Contributor Author

praiskup commented Feb 2, 2018

@bparees what's actually the problem with merging this? You are afraid that
people will use the templates from this source repo? Or is there some chance
this will be synced into openshift/origin? If the later, some CI should stop the
sync, or not?

@bparees
Copy link
Collaborator

bparees commented Feb 2, 2018

@bparees what's actually the problem with merging this? You are afraid that
people will use the templates from this source repo?

primarily this.

Or is there some chance
this will be synced into openshift/origin? If the later, some CI should stop the
sync, or not?

somewhat this. We don't currently have any CI that tests the rhel images I don't think, the CI tests that would run would confirm the template works when centos imagestreams are defined (which would work in this case).

All that said, i'm not going to block you guys on it. Moving forward I want to have the following ordering:

  1. the image repo template is the "upstream" and i guess can be broken
  2. openshift/library gets updated by you guys when you have something you want distributed (in this case you'd update it once the rhel image was available)
  3. origin pulls from openshift/library whenever we feel like it.

(Today we have a situation where we maintain the db-templates directly in origin(periodically updating from the image-repo manually), library pulls from origin).

@praiskup
Copy link
Contributor Author

praiskup commented Feb 2, 2018

Especially first step in the workflow is a must from our POV. Others we should do according
to your wishes. But I don't feel confident that we can setup some safe CI due to the
separate live-cycles of images and templates. It would be really nice if we could release
centos && rhel images && templates atomically, and everything was tested properly.
Since it is probably not possible:
It means that (a) we could theoretically release image which is not compatible with
released template, and we could (b) release incompatible template with released images.
CI protecting against (a) should be here in this repo, inherited into internal dist-git.
CI protecting against (b) should live in both openshift/library and openshift/origin, is
that right?

So I think the workflow described by Ben is fine, and Ben acked the merge, @hhorak
or @pkubatrh can you merge then?

@bparees
Copy link
Collaborator

bparees commented Feb 2, 2018

CI protecting against (b) should live in both openshift/library and openshift/origin, is
that right?

it's never going to live in library, there's just no way library can be responsible for CIing everyone's content.

But yes, having it live in origin is plausible. Even longer term i'd like to get to the point where origin is not even distributing these templates and if you want them, you install them as an "scl_templates_and_imagestreams" package or something.

@pkubatrh pkubatrh merged commit 28de1f1 into sclorg:master Feb 5, 2018
@praiskup
Copy link
Contributor Author

praiskup commented Feb 7, 2018

Just cc'ing @pvalena, does the "two-way" CI setup make sense to you?

@pvalena
Copy link
Member

pvalena commented Feb 8, 2018

But yes, having it live in origin is plausible. Even longer term i'd like to get to the point where origin is not even distributing these templates and if you want them, you install them as an "scl_templates_and_imagestreams" package or something.

Keeping developement templates in this repo and released templates in origin repo (with CI run in those respective repos) seems fine.

@bparees
Copy link
Collaborator

bparees commented Feb 27, 2018

@praiskup so when can we expect new rhel images that will include the check-container script so we can resume consuming the template that lives in this repo?

@pkubatrh
Copy link
Member

@bparees I will be pulling upstream changes (inc. this one) for the next RHAH release.

@praiskup praiskup deleted the separate-liveness-check branch March 28, 2018 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants