Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acme resolver not working with persistence enabled #396

Closed
marcofranssen opened this issue Mar 22, 2021 · 28 comments
Closed

acme resolver not working with persistence enabled #396

marcofranssen opened this issue Mar 22, 2021 · 28 comments
Labels
kind/question Further information is requested

Comments

@marcofranssen
Copy link

The acme resolver isn't working with persistence enabled due to file permissions.

See below the log.

time="2021-03-22T11:03:24Z" level=error msg="The ACME resolver \"le\" is skipped from the resolvers list because: unable to get ACME account: permissions 660 for /data/acme.json are too open, please use 600"

This is an abstract of my values.yml

additionalArguments:
  - "--serverstransport.insecureskipverify=true"
  - "--certificatesresolvers.le.acme.dnschallenge=true"
  - "--certificatesresolvers.le.acme.email=me@tld.com"
  - "--certificatesresolvers.le.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory"
  - "--certificatesresolvers.le.acme.storage=/data/acme.json"

persistence:
  enabled: true

serviceAccountAnnotations:
  eks.amazonaws.com/role-arn: <<my_arn_with_a_policy_for_route53>>
@SantoDE SantoDE added the kind/question Further information is requested label Mar 22, 2021
@marcofranssen
Copy link
Author

Seems the lego provider is nicely writing with 600 permissions.

https://github.com/traefik/traefik/blob/1b21f0723fd985fe14bd357b5b9691d780d8378a/pkg/provider/acme/local_store.go#L111

@marcofranssen
Copy link
Author

marcofranssen commented Mar 24, 2021

I now figured that on initial creation the permissions are there as 600.

When deleting the pod, the deployment automatically creates a new one and mounts the now already existing volume with the acme.json file.

Now the file has 660 permissions and breaks…

Then using cli on the pod itself I can do a chmod 600 /data/acme.json.

When pod is replaced with another one again the permissions are back to 660.

Running now with the latest version of the Helm chart 9.18.

As soon I disable persistence, acme works again. However without persistence it is very easy to hit the API rate limits on Letsencrypt.

@marcofranssen
Copy link
Author

I found a solution by adding a initContainer.

persistence:
  enabled: true

# this is required to ensure the acme.json has the required 600 permissions when remounting the volume
deployment:
  initContainers:
    - name: fix-permissions-acme-json
      image: alpine:3.12.4
      command:
        - chmod
        - "600"
        - /data/acme.json
      volumeMounts:
        - name: data
          mountPath: /data

Would a PR be accepted that adds this initcontainer by default when persistence is enabled?

@jakubhajek
Copy link
Contributor

jakubhajek commented Mar 25, 2021

Hello @marcofranssen

As a workaround you can try to enable initcontainers section:

 initContainers:
    - name: volume-permissions
      image: busybox:1.31.1
      command: ["sh", "-c", "chmod -Rv 600 /data/*"]
      volumeMounts:
        - name: data
          mountPath: /data

Each time when a pod will be recycled the correct permissions (600) will be set.

@marcofranssen
Copy link
Author

@jakubhajek thanks, I found the solution 2 minutes before you answered here :)

@sjoerd-dijkstra
Copy link

sjoerd-dijkstra commented Mar 26, 2021

Experiencing the same problem here, posted it just a few minutes before in the wrong repo 👆.

I can confirm the workaround works 👍

@Haribo112
Copy link

Both the proposed solutions do in fact not work. @marcofranssen 's solutions leads to the error "no such file: /data/acme.json" and @jakubhajek 's solutions leads to "chmod: /data/lost+found: Operation not permitted"

@mloiseleur
Copy link
Contributor

@Haribo112 See https://github.com/traefik/traefik-helm-chart/blob/master/EXAMPLES.md#use-traefik-native-lets-encrypt-integration-without-cert-manager for a detailed working example.

@Haribo112
Copy link

When I follow those exact instructions I end up with a pod that cannot start, because the initcontainer keeps crashing. The logs for the init-container reveal: "touch: /data/acme.json: Permission denied
chmod: /data/acme.json: No such file or directory"

also, other issues on this repo and elsewhere seem to indicate that one should never try to touch or otherwise modify the acme.json file because traefik cannot handle the file being empty.

@huedaya
Copy link

huedaya commented Jul 2, 2023

Experiencing the same issue as @Haribo112, the odd this is /data/acme.json never exists instead replaced by lost+found dir.

I also tried to create Persistent Volumes with ReadOnlyMany modes on /data/ssl/, and also in Traefik deployment:

# Persistent Storage
persistence:
  enabled: true
  name: ssl-certs2
  size: 1Gi
  path: /data/ssl

When I run without initContainers, the /data/ssl is writable from inside the pod. But the logs says file/dir not found.

Then, I test creating acme.json, and enable initContainers. But it delete all file and replace with lost+found dir owned by root. The logs says this:

... unable to get ACME account: open /data/ssl/acme-cloudflare.json: permission denied

I've been tried many things, including putting on /tmp which technically writeable, but got nothing. Seems something changed because of Kubernetes (1.27.2) upgrade (?)

Traefik: 2.10.1

@monnierant
Copy link

Hi.

I exeperienced the same issue as @huedaya and @Haribo112 .

I found where the issue is but i'm not able to fix it right now.

TL;DR : The /data (or /cert-ssl for me) folder got 600 permission and is root:root, but the busy box and traefik are running as non root containers so they don't get permission to access the json files

Here are the step that i have done to investigate

I have run the following commands into the busybox initContainers

$ id root
uid=0(root) gid=0(root) groups=0(root),10(wheel)
$ ls -l / | grep ssl-certs
drwxr-xr-x    3 root     root          4096 Jul  8 14:47 ssl-certs
$ ls -l /ssl-certs
drwx------    2 root     root         16384 Jul  8 14:47 lost+found
$ whoami 
whoami: unknown uid 65532
$ touch /ssl-certs/acme-staging.json
touch: /ssl-certs/acme-staging.json: Permission denied
$ chmod -v 600 /ssl-certs/acme-staging.json
chmod: /ssl-certs/acme-staging.json: No such file or directory

We can see that the issue come from the runing user doesn't get the permissions on the /ssl-cert/ folder to performe changes.

I try to force the init container to run as root:

deployment:
  initContainers:
    # The "volume-permissions" init container is required if you run into permission issues.
    # Related issue: https://github.com/containous/traefik/issues/6972
    - name: volume-permissions
      image: busybox:1.31.1
      command: ["sh", "-c", "id root;ls -l /; ls -l /ssl-certs; whoami; touch /ssl-certs/acme-staging.json; chmod -v 600 /ssl-certs/acme-staging.json; touch /ssl-certs/acme-production.json; chmod -v 600 /ssl-certs/acme-production.json"]
      securityContext:
        runAsUser: 0
        runAsNonRoot: false
        allowPrivilegeEscalation: false
      volumeMounts:
        - name: ssl-certs
          mountPath: /ssl-certs

And now the chmod are well executed on Init container :

mode of '/ssl-certs/acme-staging.json' changed to 0600 (rw-------)  
mode of '/ssl-certs/acme-production.json' changed to 0600 (rw-------)

But now the issue is on the traefik container :

time="2023-07-09T07:17:47Z" level=error msg="The ACME resolver \"staging\" is skipped from the resolvers list because: unable to get ACME account: open /ssl-certs/acme-staging.json: permission denied"  
time="2023-07-09T07:17:47Z" level=error msg="The ACME resolver \"production\" is skipped from the resolvers list because: unable to get ACME account: open /ssl-certs/acme-production.json: permission denied"

@icodeforyou-dot-net
Copy link

I am just now experiencing the same issue as @huedaya and @Haribo112 and @monnierant before me. Trying to run the initContainer fails because touch does not have permission to create the file which subsequently does not exist.

  1. Please reopen this issue @SantoDE until a working solution is found.

  2. You made some progress @monnierant ... did you find a solution? I assume the issue comes from the fact that the uid differs between the traefik and the busybox containers after creating the file and setting permissions. So when we set the file permission to 0600 on uid 0 it is not accesible for user with uid 65532 because that user doesn't own the file. Does that sound right as a problem description?

@icodeforyou-dot-net
Copy link

I just decided to do something else, I am not sure if it solved the issue though because I am not yet trying to actually save a certificate in the file. But the initContainer appears to be working now.

  1. I decided to use the traefik container instead of busybox under the assumption that it will have the appropriate user with uid 65532.

  2. Then I create the file as root, but change the ownership to uid 65532.

  initContainers:
    - name: volume-permissions
      image: traefik:v2.10.4
      command:
        [
          "sh",
          "-c",
          "touch /data/acme.json; chown 65532 /data/acme.json; chmod -v 600 /data/acme.json",
        ]
      securityContext:
        runAsNonRoot: false
        runAsGroup: 0
        runAsUser: 0
      volumeMounts:
        - name: traefik-data
          mountPath: /data

The initContainer no longer crashes at least.

I will let people know once I am actually able verify that I can store my certs this way.

@huedaya
Copy link

huedaya commented Aug 2, 2023

Hi @icodeforyou-dot-net, just in case you fail to solve the issue, I ended up using cert-manager instead of traefik's built in features.

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-cluster-issuer
  namespace: cert-manager
spec:
  acme:
    email: email@example.com
    # Prod 
    # server: https://acme-v02.api.letsencrypt.org/directory
    # We use the staging server here for testing to avoid hitting
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      # if not existing, it will register a new account and stores it 
      name: letsencrypt-acc-key
    solvers:
      - http01:
          # The ingressClass used to create the necessary ingress routes
          ingress:
            class: traefik

@icodeforyou-dot-net
Copy link

icodeforyou-dot-net commented Aug 2, 2023

@huedaya

I am not totally sure because my experience with Traefik is still limited, but at least I am past the error from inside traefik container. I'd say it stores the cert now. I also killed the pod and it is still going.

I'd say it is working now.

Edit: My domain appears to have a working wildcard cert now. So I am even more inclined to say it is working now 😄

@tssgery
Copy link

tssgery commented Aug 5, 2023

I just decided to do something else, I am not sure if it solved the issue though because I am not yet trying to actually save a certificate in the file. But the initContainer appears to be working now.

  1. I decided to use the traefik container instead of busybox under the assumption that it will have the appropriate user with uid 65532.
  2. Then I create the file as root, but change the ownership to uid 65532.
  initContainers:
    - name: volume-permissions
      image: traefik:v2.10.4
      command:
        [
          "sh",
          "-c",
          "touch /data/acme.json; chown 65532 /data/acme.json; chmod -v 600 /data/acme.json",
        ]
      securityContext:
        runAsNonRoot: false
        runAsGroup: 0
        runAsUser: 0
      volumeMounts:
        - name: traefik-data
          mountPath: /data

The initContainer no longer crashes at least.

I will let people know once I am actually able verify that I can store my certs this way.

This seems to be working for me, thanks for the suggestion

@icodeforyou-dot-net
Copy link

I might create a pull request in a week or two to update the infos/comments the helm chart itself. The old solution that isn't working for everyone is still to be found there. So everyone will know in the future how to get around this. Just have to do some more tests.

@rsinha29
Copy link

i am having the same issue , is there any update on this one ?

@icodeforyou-dot-net
Copy link

@rsinha29 which is "the same issue"? Did you try my workaround? I did not have time to work on the helm chart itself so far.

@rsinha29
Copy link

@icodeforyou-dot-net , forgot to mention here i used your workaround and it worked
thanks again!!

@danktankk
Copy link

I just decided to do something else, I am not sure if it solved the issue though because I am not yet trying to actually save a certificate in the file. But the initContainer appears to be working now.

  1. I decided to use the traefik container instead of busybox under the assumption that it will have the appropriate user with uid 65532.
  2. Then I create the file as root, but change the ownership to uid 65532.
  initContainers:
    - name: volume-permissions
      image: traefik:v2.10.4
      command:
        [
          "sh",
          "-c",
          "touch /data/acme.json; chown 65532 /data/acme.json; chmod -v 600 /data/acme.json",
        ]
      securityContext:
        runAsNonRoot: false
        runAsGroup: 0
        runAsUser: 0
      volumeMounts:
        - name: traefik-data
          mountPath: /data

The initContainer no longer crashes at least.

I will let people know once I am actually able verify that I can store my certs this way.

Is this the answer to the issue currently; the "workaround?" It seems like a long time to have this issue, but it definitely still exists and has wasted a lot of my day so far.

@icodeforyou-dot-net
Copy link

icodeforyou-dot-net commented Oct 22, 2023

Well it is my answer at least 😄

It is working, right? I believe it should technically also work with the busybox container because chown with userid should change the file permissions regardless of the user exists or not. I just used the traefik container as that one should actually have a user with id 65532.

But I am using should here for a reason. I didn't actually check any of that.

I don't really know when I have the leisure to open a pull request on this one. I would need to test this with the busybox container.

@carpodiem
Copy link

carpodiem commented Nov 1, 2023

Well it is my answer at least 😄

It is working, right? I believe it should technically also work with the busybox container because chown with userid should change the file permissions regardless of the user exists or not. I just used the traefik container as that one should actually have a user with id 65532.

But I am using should here for a reason. I didn't actually check any of that.

I don't really know when I have the leisure to open a pull request on this one. I would need to test this with the busybox container.

I have tried it with busybox and it worked fine too. Probably you will want to change permissions for the whole directory (not only for acme.json)

✗ kubectl logs traefik-dsfkj874-hfjei -n kube-system -c volume-permissions
mode of '/data/acme.json' changed to 0600 (rw-------)
 initContainers:
    - name: volume-permissions
      image: busybox:latest
      command:
        [
          "sh",
          "-c",
          "touch /data/acme.json; chown 65532 /data/acme.json; chmod -v 600 /data/acme.json",
        ]
      securityContext:
        runAsNonRoot: false
        runAsGroup: 0
        runAsUser: 0
      volumeMounts:
        - name: data
          mountPath: /data

@badrgou
Copy link

badrgou commented Dec 5, 2023

I just decided to do something else, I am not sure if it solved the issue though because I am not yet trying to actually save a certificate in the file. But the initContainer appears to be working now.

  1. I decided to use the traefik container instead of busybox under the assumption that it will have the appropriate user with uid 65532.
  2. Then I create the file as root, but change the ownership to uid 65532.
  initContainers:
    - name: volume-permissions
      image: traefik:v2.10.4
      command:
        [
          "sh",
          "-c",
          "touch /data/acme.json; chown 65532 /data/acme.json; chmod -v 600 /data/acme.json",
        ]
      securityContext:
        runAsNonRoot: false
        runAsGroup: 0
        runAsUser: 0
      volumeMounts:
        - name: traefik-data
          mountPath: /data

The initContainer no longer crashes at least.

I will let people know once I am actually able verify that I can store my certs this way.

it works perfectly when using traefik image itself in the initcontainers

  • k8s on top of Openstack
  • Cinder CSI as PV

@life5ign
Copy link

life5ign commented Jan 1, 2024

TLDR proposed workaround doesn't work in chart version 26.0.0 (possibly others too) since setting uid 0 violates the pod security context.

per

  1. I decided to use the traefik container instead of busybox under the assumption that it will have the appropriate user with uid 65532.
  2. Then I create the file as root, but change the ownership to uid 65532.
  initContainers:
    - name: volume-permissions
      image: traefik:v2.10.4
      command:
        [
          "sh",
          "-c",
          "touch /data/acme.json; chown 65532 /data/acme.json; chmod -v 600 /data/acme.json",
        ]
      securityContext:
        runAsNonRoot: false
        runAsGroup: 0
        runAsUser: 0
      volumeMounts:
        - name: traefik-data
          mountPath: /data

The initContainer no longer crashes at least.

This doesn't work for me, since the default chart values (which I haven't changed), for chart version 26.0.0, specify a pod security context, which makes the uid of 0 not allowed for the volume-permissions init container:

The values:

podSecurityContext:
  # /!\ When setting fsGroup, Kubernetes will recursively change ownership and
  # permissions for the contents of each volume to match the fsGroup. This can
  # be an issue when storing sensitive content like TLS Certificates /!\
  # fsGroup: 65532
  # -- Specifies the policy for changing ownership and permissions of volume contents to match the fsGroup.
  fsGroupChangePolicy: "OnRootMismatch"
  # -- The ID of the group for all containers in the pod to run as.
  runAsGroup: 65532
  # -- Specifies whether the containers should run as a non-root user.
  runAsNonRoot: true
  # -- The ID of the user for all containers in the pod to run as.
  runAsUser: 65532

The error:
k describe -n traefik pods <your_pod_id>

Normal Pulled 68s kubelet Successfully pulled image "busybox:latest" in 551ms (551ms including waiting)
Warning Failed 56s (x8 over 2m19s) kubelet Error: container's runAsUser breaks non-root policy (pod: "traefik-blah_traefik(blah)", container: volume-permissions)
Normal Pulled 56s kubelet Successfully pulled image "busybox:latest" in 544ms (544ms including waiting)
Normal Pulling 41s (x9 over 2m20s) kubelet Pulling image "busybox:latest"

@life5ign
Copy link

life5ign commented Jan 1, 2024

Per my last comment, I just figured out the solution: fsGroup was commented out; we need it to be applied so that k8s changes the permissions on volumes to the id specified, before we try to us an initContainer with that id (don't need to run as root with id 0 anymore, if we do this). So the final config that worked was:

uncomment fsGroup

podSecurityContext:
  # /!\ When setting fsGroup, Kubernetes will recursively change ownership and
  # permissions for the contents of each volume to match the fsGroup. This can
  # be an issue when storing sensitive content like TLS Certificates /!\
  fsGroup: 65532
  # -- Specifies the policy for changing ownership and permissions of volume contents to match the fsGroup.
  fsGroupChangePolicy: "OnRootMismatch"
  # -- The ID of the group for all containers in the pod to run as.
  runAsGroup: 65532
  # -- Specifies whether the containers should run as a non-root user.
  runAsNonRoot: true
  # -- The ID of the user for all containers in the pod to run as.
  runAsUser: 65532

don't use id 0 anymore; just use 65532

initContainers:
  # The "volume-permissions" init container is required if you run into permission issues.
  # Related issue: https://github.com/traefik/traefik-helm-chart/issues/396
  - name: volume-permissions
    image: busybox:latest
    command: ["sh", "-c", "touch /data/acme.json; chmod -v 600 /data/acme.json"]
    securityContext:
      runAsNonRoot: true
      runAsGroup: 65532
      runAsUser: 65532
    volumeMounts:
      - name: data
        mountPath: /data

Then, we can see success:

 k logs -n traefik pods/<your_pod> -c volume-permissions                                                                                                   

outputs

mode of '/data/acme.json' changed to 0600 (rw-------)

More on pod security contexts and fsGroup: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods

action/PR follow-up

If a PR were submitted to add the comment, right above the initContainer: "uncomment podSecurityContext.fsGroup above and make sure it matches the runAs.. users here", this would solve the problem for people.

@igor9silva
Copy link

I did not manage to make @life5ign solution work for me (26.0.0), but got it working.

For reference, I'm doing this for EKS Fargate:

traefik-values.yaml

persistence:
  # -- Enable persistence using Persistent Volume Claims
  # ref: http://kubernetes.io/docs/user-guide/persistent-volumes/
  # It can be used to store TLS certificates, see `storage` in certResolvers
  enabled: true
  size: 128Mi
  storageClass: efs
  accessMode: ReadWriteOnce
...
deployment:
   ...
  initContainers:
    # The "volume-permissions" init container is required if you run into permission issues.
    # Related issue: https://github.com/traefik/traefik-helm-chart/issues/396
    - name: volume-permissions
      image: busybox:1.36
      command:
        ["sh", "-c", "touch /data/acme.json; chmod -v 600 /data/acme.json"]
      securityContext:
        runAsNonRoot: false
        runAsGroup: 0
        runAsUser: 0
      volumeMounts:
        - name: data
          mountPath: /data
...
podSecurityContext:
  # /!\ When setting fsGroup, Kubernetes will recursively change ownership and
  # permissions for the contents of each volume to match the fsGroup. This can
  # be an issue when storing sensitive content like TLS Certificates /!\
  # fsGroup: 65532
  # -- Specifies the policy for changing ownership and permissions of volume contents to match the fsGroup.
  fsGroupChangePolicy: "OnRootMismatch"
  # -- The ID of the group for all containers in the pod to run as.
  runAsGroup: 65532
  # -- Specifies whether the containers should run as a non-root user.
  runAsNonRoot: true
  # -- The ID of the user for all containers in the pod to run as.
  runAsUser: 65532

file-system.yaml

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs
provisioner: efs.csi.aws.com
---
# Traefik will use this to store TLS certificates
apiVersion: v1
kind: PersistentVolume
metadata:
  name: traefik-efs
spec:
  capacity:
    storage: 1Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce # Many is only available to dynamic volumes
  persistentVolumeReclaimPolicy: Retain # so disk is not delete after pod deletion
  storageClassName: efs
  csi:
    driver: efs.csi.aws.com
    # As of today, Amazon EFS CSI driver doesn't support dynamic provisioning on Fargate,
    # so we're manually deploying an EFS instance and hardcoding it here 😭
    #
    # Ref: https://github.com/kubernetes-sigs/aws-efs-csi-driver
    volumeHandle: fs-0cbf8b13198606563

@DrummyFloyd
Copy link

DrummyFloyd commented Apr 4, 2024

This issue seems to be still present .. in the actual Helm chart 26.1.0 adn 27.0.0

nvm, i remove the initContainers part and just use th podSecurityContext.fsGroup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.