Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd stuck in crashloopbackoff: permission denied to read config #1494

Closed
dajester2013 opened this issue Jul 30, 2021 · 14 comments
Closed

etcd stuck in crashloopbackoff: permission denied to read config #1494

dajester2013 opened this issue Jul 30, 2021 · 14 comments
Assignees

Comments

@dajester2013
Copy link

dajester2013 commented Jul 30, 2021

Environmental Info:
RKE2 Version:
rke2 version v1.21.3+rke2r1 (2ed0b0d)
go version go1.16.6b7

Node(s) CPU architecture, OS, and Version:
I do not have access at the moment, but it is a VM running RHEL 7.9, FIPS mode.

Cluster Configuration:
3 servers, but this error is happening on the first server I'm trying to deploy to, have not attempted the other servers.

Describe the bug:
etcd will not start with selinux: true and profile: cis-1.6 - it gets stuck in a crash loop stating permission denied.

Steps To Reproduce:

  • Installed RKE2: I used the quick-start script curl -sfL https://get.rke2.io | sh -
  • Verified installed: rke2-selinux, rke2-common, rke2-server
    # yum list installed | grep rke2
    rke2-common.x86_64 ...
    rke2-selinux.noarch ...
    rke2-server.x86_64 ...
    
  • Copied sysctl config
    # sudo cp -f /usr/share/rke2/rke2-cis-sysctl.conf /etc/sysctl.d/60-rke2-cis.conf
    # sudo systemctl restart systemd-sysctl
    
  • Created etcd user+group
    # useradd -r -c "etcd user" -s /sbin/nologin -M etcd -U
    
  • Configuration for rke2:
    selinux: true
    profile: cis-1.6
    kube-apiserver-arg: tls-min-version=VersionTLS12
    kube-scheduler-arg: tls-min-version=VersionTLS12
    kubelet-arg: feature-gates=DynamicKubeletConfig=false
    disable: rke2-ingress-nginx
    
  • Start rke2 server
    # rke2 server
    ... many log statements until it gets stuck in a loop waiting for etcd to start ...
    

Expected behavior:
etcd and related containers start normally

Actual behavior:
etcd gets stuck in an error loop

Additional context / logs:
etcd container logs

# crictl logs <etcd container id>

image
Verify etcd uid/gid:

# id etcd
uid=976(etcd) gid=976(etcd) groups=976(etcd)

etcd container security settings

# crictl inspect <etcd container id>

image

audit log search:
this one is interesting as it repeatedly shows these sync_file_range SYSCALL's as success=no

# ausearch -x etcd
...
----
time->Thu Jul 29 14:50:32 2021
type=PROCTITLE msg=audit(1627584632.141:637479): proctitle=72756E6300696E6974
type=PATH msg=audit(1627584632.141:637479): item=2 name="/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" objtype=UNKNOWN cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1627584632.141:637479): item=1 name="/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" objtype=UNKNOWN cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1627584632.141:637479): item=0 name="/sys/kernel/mm/transparent_hugepage/hpage_pmd_size" objtype=UNKNOWN cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=CWD msg=audit(1627584632.141:637479):  cwd="/"
type=SYSCALL msg=audit(1627584632.141:637479): arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=198d9e0 a2=0 a3=0 items=3 ppid=31482 pid=2327 auid=1016 uid=976 gid=976 euid=976 suid=976 fsuid=976 egid=976 sgid=976 fsgid=976 tty=(none) ses=262 comm="etcd" exe="/usr/local/bin/etcd" subj=system_u:system_r:rke2_service_db_t:s0:c369,c904 key="access"
----
time->Thu Jul 29 14:50:32 2021
type=PROCTITLE msg=audit(1627584632.162:637480): proctitle=72756E6300696E6974
type=PATH msg=audit(1627584632.162:637480): item=2 name="/var/lib/rancher/rke2/server/db/etcd/config" objtype=UNKNOWN cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1627584632.162:637480): item=1 name="/var/lib/rancher/rke2/server/db/etcd/config" objtype=UNKNOWN cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1627584632.162:637480): item=0 name="/var/lib/rancher/rke2/server/db/etcd/config" objtype=UNKNOWN cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=CWD msg=audit(1627584632.162:637480):  cwd="/"
type=SYSCALL msg=audit(1627584632.162:637480): arch=c000003e syscall=257 success=no exit=-13 a0=ffffffffffffff9c a1=c0000aede0 a2=80000 a3=0 items=3 ppid=31482 pid=2327 auid=1016 uid=976 gid=976 euid=976 suid=976 fsuid=976 egid=976 sgid=976 fsgid=976 tty=(none) ses=262 comm="etcd" exe="/usr/local/bin/etcd" subj=system_u:system_r:rke2_service_db_t:s0:c369,c904 key="access"
----
time->Thu Jul 29 14:50:32 2021
type=PROCTITLE msg=audit(1627584632.162:637481): proctitle=72756E6300696E6974
type=PATH msg=audit(1627584632.162:637481): item=2 name="/etc/localtime" objtype=UNKNOWN cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1627584632.162:637481): item=1 name="/etc/localtime" objtype=UNKNOWN cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1627584632.162:637481): item=0 name="/etc/localtime" objtype=UNKNOWN cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=CWD msg=audit(1627584632.162:637481):  cwd="/"
type=SYSCALL msg=audit(1627584632.162:637481): arch=c000003e syscall=257 success=no exit=-13 a0=ffffffffffffff9c a1=c000043760 a2=0 a3=0 items=3 ppid=31482 pid=2327 auid=1016 uid=976 gid=976 euid=976 suid=976 fsuid=976 egid=976 sgid=976 fsgid=976 tty=(none) ses=262 comm="etcd" exe="/usr/local/bin/etcd" subj=system_u:system_r:rke2_service_db_t:s0:c369,c904 key="access"
@brandond brandond added this to the v1.21.4+rke2r1 milestone Jul 30, 2021
@dajester2013
Copy link
Author

Additional note, if selinux is false AND profile is null, then rke2 starts no issue. If either selinux is true, OR profile is set (1.5 or 1.6), it gets stuck in this error loop.

@cjellick
Copy link
Contributor

@briandowns to reproduce

@briandowns
Copy link
Member

@dajester2013 I tried to reproduce what you've reported and am having some difficulty in finding similar behavior. Did you have RKE2 installed on the system previously? How did you enable FIPS mode (at install or after)? When was SELinux enabled?

@dajester2013
Copy link
Author

@briandowns So, the VMs were provisioned by our customer's IT organization. They were FIPS-enabled / SELinux-enabled from the point of provisioning. I installed RKE2 on these freshly-provisioned systems. I do know they are also running McAfee on these systems, and there are other OS hardening guides they have applied.

@briandowns
Copy link
Member

Would it be possible to get those additional hardening steps?

@briandowns
Copy link
Member

@dajester2013 I'm closing this as I can't reproduce in any form. Please feel free to reopen if you can aquire the additional hardening steps that have been applied to the nodes.

@dajester2013
Copy link
Author

@briandowns

Sorry for the delay, I was assigned other work, but now am back on this. I updated to 1.21.4, but it is still not working.

I do not know specifically what hardening steps have been taken. I have followed the installation instructions exactly as documented, but I am still stuck with etcd in a crash loop. I have tried everything I know to do, including checking selinux contexts and file permissions. Everything seems to match with my CentOS deployment (which works). The only way I can get it to work is if I disable the selinux and profile options in the config.yaml. It is really odd to me that it only works if it runs without these security settings.

I can explain further if you want to take this offline, even see if it is possible for you to see what we are seeing.

@briandowns
Copy link
Member

I think we need to know what the additional hardening steps are that the customer is taking so we can possibly determine that gap.

@dajester2013
Copy link
Author

dajester2013 commented Sep 24, 2021 via email

@dajester2013
Copy link
Author

dajester2013 commented Sep 24, 2021

So apparently it is an selinux issue. Placing the system into permissive mode allows everything to start, including the selinux and profile options enabled. I will raise the issue over in the selinux project.

@briandowns
Copy link
Member

Can you link here to the new issue you raise?

@brandond
Copy link
Member

FWIW based on the audit logs the denied syscall is openat which shouldn't be something that's blocked by default on systems with selinux enforcing. I am guessing that part of the "STIG" hardening process adds additional syscalls to the restricted list.

@dweomer
Copy link
Contributor

dweomer commented Sep 27, 2021

Possibly related to containers/container-selinux#147

@dajester2013
Copy link
Author

I opened issue #4313, as I encountered it again on a freshly installed RockyLinux 9 with the DoD STIG profile applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants