add support for custom TLS certificates #798

zimbatm · 2020-01-17T14:03:24Z

re-opened #690 from my account to allow pushes from maintainers.

erthalion · 2020-01-17T15:10:40Z

Well, tests are already better, since they do not fail with some obscure error, but rather with timeout. I'll check why.

FxKu · 2020-01-17T17:20:46Z

@zimbatm since we enabled CRD validation all new manifest fields must be covered by it. That's why the test is failing - validation doesn't know tls.

Have a look at the CRD and add the field there (alphabetical order).
Reflect it also in go code.

If you have questions feel free to ask. Had to deal with validation a lot in the past months.

zimbatm · 2020-01-25T15:34:01Z

thanks @FxKu. I just pushed updated to both schema files. Let's see what CI is saying now.

zimbatm · 2020-01-25T16:02:46Z

it looks like coveralls is unhappy. any idea how I could add test coverage for pkg/apis/acid.zalan.do/v1/zz_generated.deepcopy.go ?

cazter · 2020-02-06T19:04:19Z

stoked on this PR as it will be really helpful for what we have going.

but testing with the current spilo, and spilo (or maybe patroni), raises issue with the permissions:

2020-02-05T22:04:32.602243923Z 2020-02-05 22:04:32 UTC [31471]: [1-1] 5e3b3bf0.7aef 0 FATAL: data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
2020-02-05T22:04:32.602296208Z 2020-02-05 22:04:32 UTC [31471]: [2-1] 5e3b3bf0.7aef 0 DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).

cazter · 2020-02-07T19:43:29Z

Disregard my comment. Issue is related to the operator, not this PR. Opened issue #821

abh · 2020-02-23T09:18:55Z

This will fix #633 (mentioning it here just so it gets linked to that issue)

zimbatm · 2020-02-23T13:25:35Z

Rebased to fix the merge conflicts.

@FxKu is it possible to schedule 30min next week to get this merged? You can reserve a slot at https://calendly.com/zimbatm/

pkg/cluster/k8sres.go

pkg/cluster/k8sres_test.go

docs/user.md

docs/reference/cluster_manifest.md

manifests/complete-postgres-manifest.yaml

zimbatm · 2020-03-03T15:21:31Z

@FxKu ok, let me know if that's better!

FxKu · 2020-03-03T15:59:33Z

pkg/cluster/k8sres.go

Just found another issue. In one of your documented examples caFile is left out. But still it is appended as env variable and Postgres tries to look it up, but:

FATAL: could not load root certificate file "/tls/ca.crt": No such file or directory

Maybe the unit test could also be extended then to check that the env variable is not created when the CA is left out

When I did my testing, Postgres complains but keeps going. Is that not the case for you?

I will add documentation for the cafile.

make the CAFile optional and documented it a bit in 662e7e68e90ed199b6fd6df1ac0388a1bb40407c

I wanted both secret types to work out of the box but I understand if the log line is undesirable.

FxKu · 2020-03-11T08:51:16Z

docs/reference/cluster_manifest.md

Defaults to "ca.crt". missing (?)

Not anymore, I had to remove it to avoid the Postgres runtime error.

FxKu · 2020-03-11T08:52:49Z

docs/user.md

hm, can't it be read from the secret if it's in there?

The secret resource is not inspectable during schema generation. The secret resource might not exist, or might be also changed at later points in time. This would require to add another hook in the operator to listen for changes on that secret and update the schema accordingly. I don't know if it's worth the extra complexity.

Another solution would be to change the spilo image to unset the environment variable if the file doesn't exist. This would hide errors if the file really is supposed to exist and is not there.

That's why initially I just passed the env to the spilo image and let postgres complain. It's unfortunate that the log line contains "FATAL" because it didn't seem to affect the running instance the last time I tested it.

I see. Thanks for the explanation.

pkg/cluster/k8sres.go

FxKu · 2020-03-11T13:42:00Z

👍

zimbatm · 2020-03-11T14:19:12Z

@FxKu last commit to improve the user documentation

docs/user.md

FxKu · 2020-03-11T17:28:02Z

docs/user.md

That's not very precise. Are changes ignored? Does it break the cluster?

It inherits the same issue that changing spilo_fsgroup has where the pod needs to be re-created.

I will remove that section for now.

Once the user reads-up on spilo_fsgroup they will probably find the issue.

zimbatm · 2020-03-11T17:28:46Z

I think this is ready

zimbatm · 2020-03-11T17:48:24Z

rebased and squashed

FxKu · 2020-03-12T10:26:06Z

Thanks @zimbatm. Can you address my last comment? We are already telling in the reference docs that changing the spilo_fsgroup requires a Pod restart.

zimbatm · 2020-03-12T19:54:44Z

fair enough, I will just remove that section then.

zimbatm · 2020-03-12T19:58:38Z

it looks like the e2e test failures are also there in master

FxKu · 2020-03-13T08:55:06Z

👍

FxKu · 2020-03-13T08:56:43Z

yes that's one part in the e2e test which probably needs more waiting time, since it's passing at times and sometimes not :-/

erthalion · 2020-03-13T10:28:39Z

👍

FxKu · 2020-03-13T10:45:26Z

Thanks @zimbatm for adding this cool feature and your patience in the review process.

zimbatm · 2020-03-13T10:51:25Z

yesssssss!! 🎉

thanks to you too for helping me out on this PR!

ReSearchITEng · 2020-03-27T18:01:06Z

@FxKu @zimbatm
I tried to use this functionality with latest image opr image: v1.4.0-15-gba9cf68-dirty, in openshift, and it does not seem to work.
In the request I have added this section:

spec:
   tls:
      secretName: "pg-tls"

The statefulSet does not get the mount point for the secret. Nothing seems to happen, and no errors in the opr logs.
Things seem ok in k8s. Could it be something related to fsGroup perms ? Any changes in the rbac ?
Am I missing something else?

zimbatm · 2020-03-27T18:17:44Z

It's hard to say without more details. I haven't used openshift before so I don't know what changes from a vanilla k8s installation. The first thing I would do is look at the event log to see if anything stands out. If it's an existing pg cluster, try creating a fresh one as pods don't really like the fsGroup to be changed.

ReSearchITEng · 2020-03-27T18:29:46Z

right, it's the new section that adds fsGroup automatically that is causing the issue:

  Warning  FailedCreate      23s (x21 over 3m)  statefulset-controller  create Pod acid-tls-1 in StatefulSet acid-tls-1 failed error: pods "acid-tls-1-0" is forbidden: unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "persistentVolumeClaim": persistentVolumeClaim volumes are not allowed to be used fsGroup: Invalid value: []int64{103}: 103 is not an allowed group]

We must have it autodetect the group, or ideally simply remove it.

When we remove the fsGroup from the sts (for few minutes till opr puts it back), perms and all look like this:

ls -la /tls/..d*/*
-rw-r-----. 1 root 1000670000 1439 Mar 27 18:44 /tls/..data/tls.crt
-rw-r-----. 1 root 1000670000 1675 Mar 27 18:44 /tls/..data/tls.key

ls -la /tls/
total 0
drwxrwsrwt. 3 root 1000670000 120 Mar 27 18:44 .
drwxr-xr-x. 1 root root        62 Mar 27 18:44 ..
drwxr-sr-x. 2 root 1000670000  80 Mar 27 18:44 ..2020_03_27_18_44_06.799794438
lrwxrwxrwx. 1 root root        31 Mar 27 18:44 ..data -> ..2020_03_27_18_44_06.799794438
lrwxrwxrwx. 1 root root        14 Mar 27 18:44 tls.crt -> ..data/tls.crt
lrwxrwxrwx. 1 root root        14 Mar 27 18:44 tls.key -> ..data/tls.key

k8sres.go:

	// the gid of the postgres user in the default spilo image
	spiloPostgresGID = 103

and

		if effectiveFSGroup == nil {
			c.logger.Warnf("Setting the default FSGroup to satisfy the TLS configuration")
			fsGroup := int64(spiloPostgresGID)
			effectiveFSGroup = &fsGroup
		}

IMO adding tls should not automatically add functionality like fsGroup, when it was not specifically requested.
It's non-intuitive that adding tls will add securityContext.fsGroup, even when not requested.

This way, for these two reasons (not working on openshift as well as non-intuitive behaviour), I suggest to remove the above two section.

I have simulated removing that section in OpenShift, and all seems to work very nice (thanks for the code!)

If it's fsGroup is mandatory for k8s to fix perms, we should add it in the docs that user should add fsGroup, maybe change the docs:
from:

Before applying these changes, the operator must also be configured with the
`spilo_fsgroup` set to the GID matching the postgres user group. If the value
is not provided, the cluster will default to `103` which is the GID from the
default spilo image.

to something like:

Before applying these changes, in k8s (not in OpenShift) the operator must also 
be configured with the `spilo_fsgroup` set to the GID matching the postgres user 
group. If you don't know the value, use `103` which is the GID from the default 
spilo image.
If the value is not provided and you are in k8s (not OpenShift), the certificates 
will not have proper permissions and the server will not work as desired.

(ideally add more info in the last phrase).

@zimbatm @FxKu , WDYS?
If it makes it easier, here is a PR for solving it: #885

* solves #798 (comment) Co-authored-by: Felix Kunde <felix-kunde@gmx.de>

* solves zalando/postgres-operator#798 (comment) Co-authored-by: Felix Kunde <felix-kunde@gmx.de>

zimbatm requested review from avaczi, CyberDem0n, erthalion, FxKu, Jan-M, RafiaSabih and sdudoladov as code owners January 17, 2020 14:03

zimbatm mentioned this pull request Jan 17, 2020

Custom TLS secrets #690

Closed

zimbatm force-pushed the custom-tls-secrets branch from 4ac7c38 to 6b59a55 Compare January 25, 2020 15:34

FxKu added this to the 1.5 milestone Feb 21, 2020

zimbatm force-pushed the custom-tls-secrets branch from 6b59a55 to d152ea9 Compare February 23, 2020 13:23

FxKu reviewed Mar 2, 2020

View reviewed changes

pkg/cluster/k8sres.go Outdated Show resolved Hide resolved