Skip to content

Conversation

zimbatm
Copy link
Contributor

@zimbatm zimbatm commented Jan 17, 2020

re-opened #690 from my account to allow pushes from maintainers.

@erthalion
Copy link
Contributor

Well, tests are already better, since they do not fail with some obscure error, but rather with timeout. I'll check why.

@FxKu
Copy link
Member

FxKu commented Jan 17, 2020

@zimbatm since we enabled CRD validation all new manifest fields must be covered by it. That's why the test is failing - validation doesn't know tls.

Have a look at the CRD and add the field there (alphabetical order).
Reflect it also in go code.

If you have questions feel free to ask. Had to deal with validation a lot in the past months.

@zimbatm
Copy link
Contributor Author

zimbatm commented Jan 25, 2020

thanks @FxKu. I just pushed updated to both schema files. Let's see what CI is saying now.

@zimbatm zimbatm force-pushed the custom-tls-secrets branch from 4ac7c38 to 6b59a55 Compare January 25, 2020 15:34
@zimbatm
Copy link
Contributor Author

zimbatm commented Jan 25, 2020

it looks like coveralls is unhappy. any idea how I could add test coverage for pkg/apis/acid.zalan.do/v1/zz_generated.deepcopy.go ?

@cazter
Copy link

cazter commented Feb 6, 2020

stoked on this PR as it will be really helpful for what we have going.

but testing with the current spilo, and spilo (or maybe patroni), raises issue with the permissions:

2020-02-05T22:04:32.602243923Z 2020-02-05 22:04:32 UTC [31471]: [1-1] 5e3b3bf0.7aef 0 FATAL: data directory "/home/postgres/pgdata/pgroot/data" has invalid permissions
2020-02-05T22:04:32.602296208Z 2020-02-05 22:04:32 UTC [31471]: [2-1] 5e3b3bf0.7aef 0 DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).

@cazter
Copy link

cazter commented Feb 7, 2020

Disregard my comment. Issue is related to the operator, not this PR. Opened issue #821

@FxKu FxKu added this to the 1.5 milestone Feb 21, 2020
@abh
Copy link
Contributor

abh commented Feb 23, 2020

This will fix #633 (mentioning it here just so it gets linked to that issue)

@zimbatm
Copy link
Contributor Author

zimbatm commented Feb 23, 2020

Rebased to fix the merge conflicts.

@FxKu is it possible to schedule 30min next week to get this merged? You can reserve a slot at https://calendly.com/zimbatm/

@zimbatm
Copy link
Contributor Author

zimbatm commented Mar 3, 2020

@FxKu ok, let me know if that's better!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found another issue. In one of your documented examples caFile is left out. But still it is appended as env variable and Postgres tries to look it up, but:

FATAL:  could not load root certificate file "/tls/ca.crt": No such file or directory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the unit test could also be extended then to check that the env variable is not created when the CA is left out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I did my testing, Postgres complains but keeps going. Is that not the case for you?

I will add documentation for the cafile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the CAFile optional and documented it a bit in 662e7e68e90ed199b6fd6df1ac0388a1bb40407c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted both secret types to work out of the box but I understand if the log line is undesirable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults to "ca.crt". missing (?)

Copy link
Contributor Author

@zimbatm zimbatm Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not anymore, I had to remove it to avoid the Postgres runtime error.

docs/user.md Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, can't it be read from the secret if it's in there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The secret resource is not inspectable during schema generation. The secret resource might not exist, or might be also changed at later points in time. This would require to add another hook in the operator to listen for changes on that secret and update the schema accordingly. I don't know if it's worth the extra complexity.

Another solution would be to change the spilo image to unset the environment variable if the file doesn't exist. This would hide errors if the file really is supposed to exist and is not there.

That's why initially I just passed the env to the spilo image and let postgres complain. It's unfortunate that the log line contains "FATAL" because it didn't seem to affect the running instance the last time I tested it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the explanation.

@FxKu
Copy link
Member

FxKu commented Mar 11, 2020

👍

@zimbatm
Copy link
Contributor Author

zimbatm commented Mar 11, 2020

@FxKu last commit to improve the user documentation

docs/user.md Outdated
Comment on lines 505 to 564
Copy link
Member

@FxKu FxKu Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not very precise. Are changes ignored? Does it break the cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It inherits the same issue that changing spilo_fsgroup has where the pod needs to be re-created.

I will remove that section for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the user reads-up on spilo_fsgroup they will probably find the issue.

@zimbatm
Copy link
Contributor Author

zimbatm commented Mar 11, 2020

I think this is ready

@zimbatm zimbatm force-pushed the custom-tls-secrets branch from daf62d1 to 0803b90 Compare March 11, 2020 17:47
@zimbatm
Copy link
Contributor Author

zimbatm commented Mar 11, 2020

rebased and squashed

@FxKu
Copy link
Member

FxKu commented Mar 12, 2020

Thanks @zimbatm. Can you address my last comment? We are already telling in the reference docs that changing the spilo_fsgroup requires a Pod restart.

@zimbatm
Copy link
Contributor Author

zimbatm commented Mar 12, 2020

fair enough, I will just remove that section then.

@zimbatm
Copy link
Contributor Author

zimbatm commented Mar 12, 2020

it looks like the e2e test failures are also there in master

@FxKu
Copy link
Member

FxKu commented Mar 13, 2020

👍

@FxKu
Copy link
Member

FxKu commented Mar 13, 2020

yes that's one part in the e2e test which probably needs more waiting time, since it's passing at times and sometimes not :-/

@erthalion
Copy link
Contributor

👍

@FxKu FxKu merged commit 65fb2ce into zalando:master Mar 13, 2020
@FxKu
Copy link
Member

FxKu commented Mar 13, 2020

Thanks @zimbatm for adding this cool feature and your patience in the review process.

@zimbatm
Copy link
Contributor Author

zimbatm commented Mar 13, 2020

yesssssss!! 🎉

thanks to you too for helping me out on this PR!

@zimbatm zimbatm deleted the custom-tls-secrets branch March 13, 2020 10:52
@ReSearchITEng
Copy link
Contributor

ReSearchITEng commented Mar 27, 2020

@FxKu @zimbatm
I tried to use this functionality with latest image opr image: v1.4.0-15-gba9cf68-dirty, in openshift, and it does not seem to work.
In the request I have added this section:

spec:
   tls:
      secretName: "pg-tls"

The statefulSet does not get the mount point for the secret. Nothing seems to happen, and no errors in the opr logs.
Things seem ok in k8s. Could it be something related to fsGroup perms ? Any changes in the rbac ?
Am I missing something else?

@zimbatm
Copy link
Contributor Author

zimbatm commented Mar 27, 2020

It's hard to say without more details. I haven't used openshift before so I don't know what changes from a vanilla k8s installation. The first thing I would do is look at the event log to see if anything stands out. If it's an existing pg cluster, try creating a fresh one as pods don't really like the fsGroup to be changed.

@ReSearchITEng
Copy link
Contributor

ReSearchITEng commented Mar 27, 2020

right, it's the new section that adds fsGroup automatically that is causing the issue:

  Warning  FailedCreate      23s (x21 over 3m)  statefulset-controller  create Pod acid-tls-1 in StatefulSet acid-tls-1 failed error: pods "acid-tls-1-0" is forbidden: unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "persistentVolumeClaim": persistentVolumeClaim volumes are not allowed to be used fsGroup: Invalid value: []int64{103}: 103 is not an allowed group]

We must have it autodetect the group, or ideally simply remove it.

When we remove the fsGroup from the sts (for few minutes till opr puts it back), perms and all look like this:

ls -la /tls/..d*/*
-rw-r-----. 1 root 1000670000 1439 Mar 27 18:44 /tls/..data/tls.crt
-rw-r-----. 1 root 1000670000 1675 Mar 27 18:44 /tls/..data/tls.key

ls -la /tls/
total 0
drwxrwsrwt. 3 root 1000670000 120 Mar 27 18:44 .
drwxr-xr-x. 1 root root        62 Mar 27 18:44 ..
drwxr-sr-x. 2 root 1000670000  80 Mar 27 18:44 ..2020_03_27_18_44_06.799794438
lrwxrwxrwx. 1 root root        31 Mar 27 18:44 ..data -> ..2020_03_27_18_44_06.799794438
lrwxrwxrwx. 1 root root        14 Mar 27 18:44 tls.crt -> ..data/tls.crt
lrwxrwxrwx. 1 root root        14 Mar 27 18:44 tls.key -> ..data/tls.key

k8sres.go:

	// the gid of the postgres user in the default spilo image
	spiloPostgresGID = 103

and

		if effectiveFSGroup == nil {
			c.logger.Warnf("Setting the default FSGroup to satisfy the TLS configuration")
			fsGroup := int64(spiloPostgresGID)
			effectiveFSGroup = &fsGroup
		}

IMO adding tls should not automatically add functionality like fsGroup, when it was not specifically requested.
It's non-intuitive that adding tls will add securityContext.fsGroup, even when not requested.

This way, for these two reasons (not working on openshift as well as non-intuitive behaviour), I suggest to remove the above two section.

I have simulated removing that section in OpenShift, and all seems to work very nice (thanks for the code!)

If it's fsGroup is mandatory for k8s to fix perms, we should add it in the docs that user should add fsGroup, maybe change the docs:
from:

Before applying these changes, the operator must also be configured with the
`spilo_fsgroup` set to the GID matching the postgres user group. If the value
is not provided, the cluster will default to `103` which is the GID from the
default spilo image.

to something like:

Before applying these changes, in k8s (not in OpenShift) the operator must also 
be configured with the `spilo_fsgroup` set to the GID matching the postgres user 
group. If you don't know the value, use `103` which is the GID from the default 
spilo image.
If the value is not provided and you are in k8s (not OpenShift), the certificates 
will not have proper permissions and the server will not work as desired.

(ideally add more info in the last phrase).

@zimbatm @FxKu , WDYS?
If it makes it easier, here is a PR for solving it: #885

FxKu added a commit that referenced this pull request Apr 1, 2020
* solves #798 (comment)
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
YuriSalesquw pushed a commit to YuriSalesquw/postgres-operator that referenced this pull request Dec 26, 2022
* solves zalando/postgres-operator#798 (comment)
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants