Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Operator-managed rabbitmq:3.8.21-management falls into CrashLoop on Arch Linux #959

Closed
ikavgo opened this issue Feb 14, 2022 · 13 comments
Labels
blocked Waiting on other work or on 3rd party. bug Something isn't working stale Issue or PR with long period of inactivity

Comments

@ikavgo
Copy link

ikavgo commented Feb 14, 2022

Describe the bug

As part of ongoing work on knative-extensions/eventing-rabbitmq#415 I decided to install Knative eventing RabbitMQ parts on my new Arch Linux. However, I wasn't able to get single-node operator-managed RabbitMQ cluster running. The rabbitmq node always goes to CrashLoop.

To Reproduce

Steps to reproduce the behavior:

  1. Install Arch
  2. Follow Knative eventing RabbitMQ testing guide.
  3. Create single-node RabbitmqCluster
    kubectl apply -f - << EOF
    apiVersion: rabbitmq.com/v1beta1
    kind: RabbitmqCluster
    metadata:
      name: rabbitmq
      namespace: default
    spec:
      replicas: 1
    EOF
  4. Watch rabbitmq-server-0 pod go on CrashLoop` with nothing more than
    WARNING: 'docker-entrypoint.sh' generated/modified the RabbitMQ configuration file, which will no longer happen in 3.9 and later! (https://github.com/docker-library/rabbitmq/pull/424)
    
    Generated end result, for reference:
    ------------------------------------
    loopback_users.guest = false
    listeners.tcp.default = 5672
    ------------------------------------
    Configuring logger redirection
    20:54:21.113 [warning] cluster_formation.randomized_startup_delay_range.min and 
    cluster_formation.randomized_startup_delay_range.max are deprecated
    

Include any YAML or manifest necessary to reproduce the problem.

Expected behavior
RabbitMQ node should start.

Screenshots

If applicable, add screenshots to help explain your problem.

Version and environment information

  • RabbitMQ: rabbitmq:3.8.21-management
  • RabbitMQ Cluster Operator: rabbitmqoperator/cluster-operator:1.11.1
  • Kubernetes: v1.21.1
  • Cloud provider or hardware configuration: kind

Additional context

Operator logs:

{"level":"info","ts":1644872348.3994126,"logger":"controller.rabbitmqcluster","msg":"updated resource rabbitmq-nodes of Type *v1.Service","reconciler group":"rabbitmq.com","reconciler kind":"RabbitmqCluster","name":"rabbitmq","namespace":"default"}
{"level":"info","ts":1644872348.4101553,"logger":"controller.rabbitmqcluster","msg":"updated resource rabbitmq-server of Type *v1.StatefulSet","reconciler group":"rabbitmq.com","reconciler kind":"RabbitmqCluster","name":"rabbitmq","namespace":"default"}
{"level":"info","ts":1644872363.464903,"logger":"controller.rabbitmqcluster","msg":"Start reconciling","reconciler group":"rabbitmq.com","reconciler kind":"RabbitmqCluster","name":"rabbitmq","namespace":"default","spec":"{\"replicas\":1,\"image\":\"rabbitmq:3.8.21-management\",\"service\":{\"type\":\"ClusterIP\"},\"persistence\":{\"storage\":\"10Gi\"},\"resources\":{\"limits\":{\"cpu\":\"2\",\"memory\":\"2Gi\"},\"requests\":{\"cpu\":\"1\",\"memory\":\"2Gi\"}},\"rabbitmq\":{},\"tls\":{},\"override\":{},\"terminationGracePeriodSeconds\":604800,\"secretBackend\":{}}"}
{"level":"info","ts":1644872363.478291,"logger":"controller.rabbitmqcluster","msg":"updated resource rabbitmq-nodes of Type *v1.Service","reconciler group":"rabbitmq.com","reconciler kind":"RabbitmqCluster","name":"rabbitmq","namespace":"default"}
{"level":"info","ts":1644872363.5235639,"logger":"controller.rabbitmqcluster","msg":"updated resource rabbitmq-server of Type *v1.StatefulSet","reconciler group":"rabbitmq.com","reconciler kind":"RabbitmqCluster","name":"rabbitmq","namespace":"default"}
{"level":"info","ts":1644872378.5512302,"logger":"controller.rabbitmqcluster","msg":"Start reconciling","reconciler group":"rabbitmq.com","reconciler kind":"RabbitmqCluster","name":"rabbitmq","namespace":"default","spec":"{\"replicas\":1,\"image\":\"rabbitmq:3.8.21-management\",\"service\":{\"type\":\"ClusterIP\"},\"persistence\":{\"storage\":\"10Gi\"},\"resources\":{\"limits\":{\"cpu\":\"2\",\"memory\":\"2Gi\"},\"requests\":{\"cpu\":\"1\",\"memory\":\"2Gi\"}},\"rabbitmq\":{},\"tls\":{},\"override\":{},\"terminationGracePeriodSeconds\":604800,\"secretBackend\":{}}"}
{"level":"info","ts":1644872378.5839353,"logger":"controller.rabbitmqcluster","msg":"updated resource rabbitmq-nodes of Type *v1.Service","reconciler group":"rabbitmq.com","reconciler kind":"RabbitmqCluster","name":"rabbitmq","namespace":"default"}
{"level":"info","ts":1644872378.6253545,"logger":"controller.rabbitmqcluster","msg":"updated resource rabbitmq-server of Type *v1.StatefulSet","reconciler group":"rabbitmq.com","reconciler kind":"RabbitmqCluster","name":"rabbitmq","namespace":"default"}

kubelet logs:

Feb 14 13:12:56 eventing-rabbitmq-e2e-worker2 kubelet[169]: E0214 13:12:56.991194     169 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"rabbitmq\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=rabbitmq pod=rabbitmq-server-0_default(f96b1841-da9b-4f18-95de-888cc18dab1a)\"" pod="default/rabbitmq-server-0" podUID=f96b1841-da9b-4f18-95de-888cc18dab1a
Feb 14 13:13:08 eventing-rabbitmq-e2e-worker2 kubelet[169]: I0214 13:13:08.991341     169 scope.go:111] "RemoveContainer" containerID="eea091a7b1fed806055dd7fc0482860360f56dd880ba07a91b41810eef045ba4"
Feb 14 13:13:08 eventing-rabbitmq-e2e-worker2 kubelet[169]: E0214 13:13:08.991731     169 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"rabbitmq\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=rabbitmq pod=rabbitmq-server-0_default(f96b1841-da9b-4f18-95de-888cc18dab1a)\"" pod="default/rabbitmq-server-0" podUID=f96b1841-da9b-4f18-95de-888cc18dab1a

What has been found with the help of my great colleagues @Zerpet @ChunyiLyu and @coro:

docker run -t -i --name bunny rabbitmq:3.8.21-management

works!

kubectl create deployment --image rabbitmq:3.8.21-management connor # ;-)

also works!

When we tried to loging to the container (by replacing endpoint with sleep 1000000 command. We found out that first run of
rabbitmq-server just produces the logs from above and exists with code 0. However, all subsequent runs of rabbitmq-server or any rabbitmq* command prints "Killed" and exists with code 137 (OOM 🤤).

@ikavgo ikavgo added the bug Something isn't working label Feb 14, 2022
@ikavgo
Copy link
Author

ikavgo commented Feb 15, 2022

dmesg -He from the kind node:

[Feb15 13:35] kauditd_printk_skb: 30 callbacks suppressed
[  +0.000004] audit: type=1325 audit(1644932130.712:4621): table=filter family=2 entries=35 op=xt_replace pid=1914848 comm="iptables-restor"
[  +0.000071] audit: type=1300 audit(1644932130.712:4621): arch=c000003e syscall=54 success=yes exit=0 a0=4 a1=0 a2=40 a3=55ef68dba190 items=0 ppid=537141 pid=1914848 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables-restor" exe="/usr/sbin/xtables-legacy-multi" key=(null)
[  +0.000003] audit: type=1327 audit(1644932130.712:4621): proctitle=69707461626C65732D726573746F7265002D770035002D5700313030303030002D2D6E6F666C757368002D2D636F756E74657273
[  +0.001003] audit: type=1325 audit(1644932130.712:4622): table=filter family=2 entries=35 op=xt_replace pid=1914850 comm="iptables-restor"
[  +0.000147] audit: type=1300 audit(1644932130.712:4622): arch=c000003e syscall=54 success=yes exit=0 a0=4 a1=0 a2=40 a3=55ae12d16190 items=0 ppid=537175 pid=1914850 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables-restor" exe="/usr/sbin/xtables-legacy-multi" key=(null)
[  +0.000004] audit: type=1327 audit(1644932130.712:4622): proctitle=69707461626C65732D726573746F7265002D770035002D5700313030303030002D2D6E6F666C757368002D2D636F756E74657273
[  +0.000766] audit: type=1325 audit(1644932130.712:4623): table=nat family=2 entries=158 op=xt_replace pid=1914848 comm="iptables-restor"
[  +0.000146] audit: type=1300 audit(1644932130.712:4623): arch=c000003e syscall=54 success=yes exit=0 a0=4 a1=0 a2=40 a3=55ef68dd4d40 items=0 ppid=537141 pid=1914848 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables-restor" exe="/usr/sbin/xtables-legacy-multi" key=(null)
[  +0.000004] audit: type=1327 audit(1644932130.712:4623): proctitle=69707461626C65732D726573746F7265002D770035002D5700313030303030002D2D6E6F666C757368002D2D636F756E74657273
[  +0.001012] audit: type=1325 audit(1644932130.712:4624): table=nat family=2 entries=158 op=xt_replace pid=1914850 comm="iptables-restor"
[ +15.141296] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  +0.073439] kauditd_printk_skb: 10 callbacks suppressed
[  +0.000003] audit: type=1334 audit(1644932145.928:4629): prog-id=1323 op=LOAD
Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.7 us,  3.4 sy,  0.0 ni, 92.6 id,  0.0 wa,  0.8 hi,  0.5 si,  0.0 st
MiB Mem :  32351.5 total,   9569.0 free,  11773.8 used,  11008.8 buff/cache
[  +0.000005] audit: type=1300 audit(1644932145.928:4629): arch=c000003e syscall=321 success=yes exit=13 a0=5 a1=c000240290 a2=78 a3=0 items=0 ppid=1915366 pid=1915378 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc" exe="/usr/local/sbin/runc" key=(null)
[  +0.000002] audit: type=1327 audit(1644932145.928:4629): proctitle=72756E63002D2D726F6F74002F72756E2F636F6E7461696E6572642F72756E632F6B38732E696F002D2D6C6F67002F72756E2F636F6E7461696E6572642F696F2E636F6E7461696E6572642E72756E74696D652E76322E7461736B2F6B38732E696F2F3761326366363737333265353035353334636566323530653430363131
[  +0.000198] audit: type=1334 audit(1644932145.928:4630): prog-id=1324 op=LOAD
[  +0.000004] audit: type=1300 audit(1644932145.928:4630): arch=c000003e syscall=321 success=yes exit=14 a0=5 a1=c000240038 a2=78 a3=0 items=0 ppid=1915366 pid=1915378 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc" exe="/usr/local/sbin/runc" key=(null)
[  +0.000002] audit: type=1327 audit(1644932145.928:4630): proctitle=72756E63002D2D726F6F74002F72756E2F636F6E7461696E6572642F72756E632F6B38732E696F002D2D6C6F67002F72756E2F636F6E7461696E6572642F696F2E636F6E7461696E6572642E72756E74696D652E76322E7461736B2F6B38732E696F2F3761326366363737333265353035353334636566323530653430363131
[  +0.000003] audit: type=1334 audit(1644932145.928:4631): prog-id=0 op=UNLOAD
[  +0.098020] audit: type=1334 audit(1644932146.025:4632): prog-id=1325 op=LOAD
[  +0.000006] audit: type=1300 audit(1644932146.025:4632): arch=c000003e syscall=321 success=yes exit=14 a0=5 a1=c0001c4290 a2=78 a3=0 items=0 ppid=1915366 pid=1915414 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc" exe="/usr/local/sbin/runc" key=(null)
[  +0.000002] audit: type=1327 audit(1644932146.025:4632): proctitle=72756E63002D2D726F6F74002F72756E2F636F6E7461696E6572642F72756E632F6B38732E696F002D2D6C6F67002F72756E2F636F6E7461696E6572642F696F2E636F6E7461696E6572642E72756E74696D652E76322E7461736B2F6B38732E696F2F3461623330613266333835353539633061626235333039626435353838
root@eventing-rabbitmq-e2e-worker:/# dmesg -He
[Feb15 13:35] kauditd_printk_skb: 30 callbacks suppressed
[  +0.000004] audit: type=1325 audit(1644932130.712:4621): table=filter family=2 entries=35 op=xt_replace pid=1914848 comm="iptables-restor"
[  +0.000071] audit: type=1300 audit(1644932130.712:4621): arch=c000003e syscall=54 success=yes exit=0 a0=4 a1=0 a2=40 a3=55ef68dba190 items=0 ppid=537141 pid=1914848 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables-restor" exe="/usr/sbin/xtables-legacy-multi" key=(null)
[  +0.000003] audit: type=1327 audit(1644932130.712:4621): proctitle=69707461626C65732D726573746F7265002D770035002D5700313030303030002D2D6E6F666C757368002D2D636F756E74657273
[  +0.001003] audit: type=1325 audit(1644932130.712:4622): table=filter family=2 entries=35 op=xt_replace pid=1914850 comm="iptables-restor"
[  +0.000147] audit: type=1300 audit(1644932130.712:4622): arch=c000003e syscall=54 success=yes exit=0 a0=4 a1=0 a2=40 a3=55ae12d16190 items=0 ppid=537175 pid=1914850 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables-restor" exe="/usr/sbin/xtables-legacy-multi" key=(null)
[  +0.000004] audit: type=1327 audit(1644932130.712:4622): proctitle=69707461626C65732D726573746F7265002D770035002D5700313030303030002D2D6E6F666C757368002D2D636F756E74657273
[  +0.000766] audit: type=1325 audit(1644932130.712:4623): table=nat family=2 entries=158 op=xt_replace pid=1914848 comm="iptables-restor"
[  +0.000146] audit: type=1300 audit(1644932130.712:4623): arch=c000003e syscall=54 success=yes exit=0 a0=4 a1=0 a2=40 a3=55ef68dd4d40 items=0 ppid=537141 pid=1914848 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iptables-restor" exe="/usr/sbin/xtables-legacy-multi" key=(null)
[  +0.000004] audit: type=1327 audit(1644932130.712:4623): proctitle=69707461626C65732D726573746F7265002D770035002D5700313030303030002D2D6E6F666C757368002D2D636F756E74[  +0.000002] audit: type=1327 audit(1644932146.025:4632): proctitle=72756E63002D2D726F6F74002F72756E2F636F6E7461696E6572642F72756E632F6B38732E696F002D2D6C6F67002F72756E2F636F6E7461696E6572642F696F2E636F6E7461696E6572642E72756E74696D652E76322E7461736B2F6B38732E696F2F3461623330613266333835353539633061626235333039626435353838
[Feb15 13:37] beam.smp invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=937
[  +0.000006] CPU: 2 PID: 1917262 Comm: beam.smp Not tainted 5.16.8-arch1-1 #1 c249e266f58b465060ccebc4c6a7c18b3fb76f91
[  +0.000002] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18452719.B64.2108091906 08/09/2021
[  +0.000001] Call Trace:
[  +0.000002]  <TASK>
[  +0.000003]  dump_stack_lvl+0x48/0x5e
[  +0.000004]  dump_header+0x4a/0x1fd
[  +0.000002]  oom_kill_process.cold+0xb/0x10
[  +0.000002]  out_of_memory+0x234/0x4e0
[  +0.000003]  mem_cgroup_out_of_memory+0x136/0x150
[  +0.000003]  try_charge_memcg+0x74a/0x810
[  +0.000003]  charge_memcg+0x32/0xa0
[  +0.000002]  __mem_cgroup_charge+0x29/0x80
[  +0.000002]  __handle_mm_fault+0xb13/0x1540
[  +0.000003]  handle_mm_fault+0xb2/0x280
[  +0.000002]  do_user_addr_fault+0x1ba/0x690
[  +0.000002]  exc_page_fault+0x72/0x170
[  +0.000003]  ? asm_exc_page_fault+0x8/0x30
[  +0.000002]  asm_exc_page_fault+0x1e/0x30
[  +0.000002] RIP: 0033:0x563528cf1d50
[  +0.000002] Code: ef 48 89 54 24 10 e8 2f e9 e3 ff 48 8b 54 24 10 48 89 85 c0 00 00 00 4c 8d 2c 10 4c 39 e8 73 1a 66 2e 0f 1f 84 00 00 00 00 00 <48> c7 00 00 00 00 00 48 83 c0 08 49 39 c5 77 f0 8b 44 24 18 bf 07
[  +0.000001] RSP: 002b:00007fff220c90a0 EFLAGS: 00010206
[  +0.000002] RAX: 00007fca81d21000 RBX: 0000000008000000 RCX: 0000000000000004
[  +0.000001] RDX: 0000000040000000 RSI: 0000000000000000 RDI: 0000563528ecbea8
[  +0.000001] RBP: 0000563528ee2340 R08: 0000000000000000 R09: 0000563528b438f4
[  +0.000001] R10: 7ffffffffffffff8 R11: 000000001ffffe08 R12: 0000563528e23ce0
[  +0.000000] R13: 00007fcaa7ffc1c0 R14: 000000000000027a R15: 0000000008000000
[  +0.000004]  </TASK>
[  +0.000001] memory: usage 2097152kB, limit 2097152kB, failcnt 140
[  +0.000001] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[  +0.000000] Memory cgroup stats for /system.slice/docker-d4d721e87ad36ad5c0da4f079df1a0960ff579e2de78a43b4682ad7ed4f378c5.scope/kubelet/kubepods/burstable/pod48c84eb6-8146-40cf-94df-4bfc45a1a54a:
[  +0.000010] anon 2129944576
              file 8679424
              kernel_stack 540672
              pagetables 4751360
              percpu 1150464
              sock 0
              shmem 8679424
              file_mapped 8679424
              file_dirty 0
              file_writeback 0
              swapcached 0
              anon_thp 0
              file_thp 0
              shmem_thp 0
              inactive_anon 2138173440
              active_anon 438272
              inactive_file 0
              active_file 0
              unevictable 0
              slab_reclaimable 1609688
              slab_unreclaimable 626552
              slab 2236240
              workingset_refault_anon 0
              workingset_refault_file 0
              workingset_activate_anon 0
              workingset_activate_file 0
              workingset_restore_anon 0
              workingset_restore_file 0
              workingset_nodereclaim 0
              pgfault 535091
              pgmajfault 0
              pgrefill 13
              pgscan 474
              pgsteal 276
              pgactivate 100
              pgdeactivate 13
              pglazyfree 440
              pglazyfreed 260
              thp_fault_alloc 0
              thp_collapse_alloc 0
[  +0.000001] Tasks state (memory values in pages):
[  +0.000001] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[  +0.000001] [1915388]   999 1915388      243        1    28672        0          -998 pause
[  +0.000003] [1915559]   999 1915559     1067      148    45056        0           937 sleep
[  +0.000001] [1916879]   999 1916879     1500      993    49152        0           937 bash
[  +0.000002] [1917148]   999 1917148      654      416    45056        0           937 rabbitmq-server
[  +0.000001] [1917152]   999 1917152  1173657   418599  3596288        0           937 beam.smp
[  +0.000002] [1917165]   999 1917165      626      131    40960        0           937 erl_child_setup
[  +0.000001] [1917262]   999 1917262   817302   109409   974848        0           937 beam.smp
[  +0.000002] [1917269]   999 1917269      973       23    45056        0           937 epmd
[  +0.000001] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=dadd4ae08b223bbc88cad329d723057c31b9a9696f9d5d8466398f78f02b0eec,mems_allowed=0,oom_memcg=/system.slice/docker-d4d721e87ad36ad5c0da4f079df1a0960ff579e2de78a43b4682ad7ed4f378c5.scope/kubelet/kubepods/burstable/pod48c84eb6-8146-40cf-94df-4bfc45a1a54a,task_memcg=/system.slice/docker-d4d721e87ad36ad5c0da4f079df1a0960ff579e2de78a43b4682ad7ed4f378c5.scope/kubelet/kubepods/burstable/pod48c84eb6-8146-40cf-94df-4bfc45a1a54a/dadd4ae08b223bbc88cad329d723057c31b9a9696f9d5d8466398f78f02b0eec,task=beam.smp,pid=1917152,uid=999
[  +0.000078] Memory cgroup out of memory: Killed process 1917152 (beam.smp) total-vm:4694628kB, anon-rss:1647628kB, file-rss:10116kB, shmem-rss:16652kB, UID:999 pgtables:3512kB oom_score_adj:937
[  +0.053409] oom_reaper: reaped process 1917152 (beam.smp), now anon-rss:0kB, file-rss:0kB, shmem-rss:16808kB

@ikavgo
Copy link
Author

ikavgo commented Feb 15, 2022

/proc/meminfo:

MemTotal:       33127960 kB
MemFree:         9458728 kB
MemAvailable:   19719920 kB
Buffers:          664872 kB
Cached:         10103040 kB
SwapCached:            0 kB
Active:          3480080 kB
Inactive:       16357720 kB
Active(anon):      80252 kB
Inactive(anon):  9563456 kB
Active(file):    3399828 kB
Inactive(file):  6794264 kB
Unevictable:         544 kB
Mlocked:             544 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               356 kB
Writeback:             0 kB
AnonPages:       9012632 kB
Mapped:          2006412 kB
Shmem:            627784 kB
KReclaimable:     534996 kB
Slab:             852720 kB
SReclaimable:     534996 kB
SUnreclaim:       317724 kB
KernelStack:       37032 kB
PageTables:        97560 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    16563980 kB
Committed_AS:   32703764 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       91060 kB
VmallocChunk:          0 kB
Percpu:           205824 kB
HardwareCorrupted:     0 kB
AnonHugePages:     94208 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     1156496 kB
DirectMap2M:    20862976 kB
DirectMap1G:    13631488 kB

@ikavgo
Copy link
Author

ikavgo commented Feb 16, 2022

Following @mkuratczyk 's hint to https://github.com/rabbitmq/cluster-operator/blob/main/internal/resource/configmap.go#L209-L213 I played with non-managed RabbitMQ deployment:

Non-managed RabbitMQ deployment:

2022-02-16 07:23:07.782 [info] <0.453.0> Memory high watermark set to 12940 MiB (13569212416 bytes) of 32351 MiB (33923031040 bytes) total

Non-manged RabbitMQ deployment with total_memory_available_overide_value from operator:

2022-02-16 07:23:58.473 [info] <0.416.0> Memory high watermark set to 655 MiB (687194767 bytes) of 1638 MiB (1717986919 bytes) total

So non-managed RabbitMQ works with both - default/unset total_memory_available_overide_value and one set by operator.

Increasing resources requests didn't help either. Increasing from 2Gb to 4 Gb made things work. So during starup rabbit peaks at around 4Gb then settles at around 1.6Gb

@ikavgo
Copy link
Author

ikavgo commented Feb 17, 2022

After going back and forth between my two VMs, one Ubuntu and one Arch I concluded that it's like relate to either

  1. RabbitMQ using Ubuntu as base image running on Arch
  2. Different defaults between Ubuntu and Arch

What has been ruled out/confirmed:

  1. make run-broker on Arch doesn't show any anomaly.
  2. docker run on Arch doesn't show any anomaly either.
  3. on the surface there is no difference in memory usage between manual rabbit deployment and operator managed deployment. Manual deployment works simply because there are no resource limits set in form of k8s requests.

I decided to ask google if it's possible to have memory usage difference when image built on different distro than host and this is what appeared:

moby/moby#8231
https://answers.ros.org/question/336963/rosout-high-memory-usage/

Basically some software/runtimes enumerates all possible fds to figure out the limit. That's why btw the suggested best practice is to explicitly set fds limit in Dockerfile.

This is what ulimit -nH shows on different systems:

Ubuntu host:

rabbitmq@rabbitmq-server-0:/$ ulimit -Hn
1000000

Rabbit running on Ubuntu host inside k8s:

rabbitmq@rabbitmq-server-0:/$ ulimit -Hn
1048576

Arch host:

[dead@arch eventing-rabbitmq]$ ulimit -Hn
524288

Rabbit running on Arch inside k8s:

root@connor-54dc6dfcf7-ks9pv:/# ulimit -Hn
1073741816

I will keep digging though the plan:

  • force fd limit for k8s on arch to 100% confirm this (see post below)
  • figure out what does this enumeration in Rabbit. Note: running just erl works with huge fd limit too.
  • create issue in docker-library to set ulimit to reasonable value Explicitly set FD limit docker-library/rabbitmq#545
  • [Bonus] Figure out where this fd limits are even set (cgroups again?), and why they are OS dependent.

@ikavgo
Copy link
Author

ikavgo commented Feb 17, 2022

So looks like I can confirm this problem is related to unset fds limit in our rabbitmq container:

rabbitmq@rabbitmq-server-0:/$ ulimit -uH
bash: ulimit: H: invalid number
rabbitmq@rabbitmq-server-0:/$ ulimit -u
unlimited
rabbitmq@rabbitmq-server-0:/$ ulimit -nH
bash: ulimit: H: invalid number
rabbitmq@rabbitmq-server-0:/$ ulimit -Hn
1073741816
rabbitmq@rabbitmq-server-0:/$ ulimit -n 1024 
rabbitmq@rabbitmq-server-0:/$ ulimit -Hn
1024
rabbitmq@rabbitmq-server-0:/$ rabbitmq-server 
Configuring logger redirection
17:11:51.670 [warning] cluster_formation.randomized_startup_delay_range.min and cluster_formation.randomized_startup_delay_range.max are deprecated
2022-02-17 17:11:53.944 [debug] <0.288.0> Lager installed handler error_logger_lager_h into error_logger
2022-02-17 17:11:53.945 [debug] <0.291.0> Lager installed handler lager_forwarder_backend into error_logger_lager_event
2022-02-17 17:11:53.947 [debug] <0.294.0> Lager installed handler lager_forwarder_backend into rabbit_log_lager_event
2022-02-17 17:11:53.949 [debug] <0.297.0> Lager installed handler lager_forwarder_backend into rabbit_log_channel_lager_event
2022-02-17 17:11:53.951 [debug] <0.300.0> Lager installed handler lager_forwarder_backend into rabbit_log_connection_lager_event
2022-02-17 17:11:53.953 [debug] <0.303.0> Lager installed handler lager_forwarder_backend into rabbit_log_feature_flags_lager_event
2022-02-17 17:11:53.954 [debug] <0.306.0> Lager installed handler lager_forwarder_backend into rabbit_log_federation_lager_event
2022-02-17 17:11:53.956 [debug] <0.309.0> Lager installed handler lager_forwarder_backend into rabbit_log_ldap_lager_event
2022-02-17 17:11:53.957 [debug] <0.312.0> Lager installed handler lager_forwarder_backend into rabbit_log_mirroring_lager_event
2022-02-17 17:11:53.958 [debug] <0.315.0> Lager installed handler lager_forwarder_backend into rabbit_log_prelaunch_lager_event
2022-02-17 17:11:53.960 [debug] <0.318.0> Lager installed handler lager_forwarder_backend into rabbit_log_queue_lager_event
2022-02-17 17:11:53.961 [debug] <0.321.0> Lager installed handler lager_forwarder_backend into rabbit_log_ra_lager_event
2022-02-17 17:11:53.963 [debug] <0.324.0> Lager installed handler lager_forwarder_backend into rabbit_log_shovel_lager_event
2022-02-17 17:11:53.965 [debug] <0.327.0> Lager installed handler lager_forwarder_backend into rabbit_log_upgrade_lager_event
2022-02-17 17:11:53.982 [info] <0.44.0> Application lager started on node 'rabbit@rabbitmq-server-0.rabbitmq-nodes.default'
2022-02-17 17:11:54.360 [info] <0.44.0> Application mnesia started on node 'rabbit@rabbitmq-server-0.rabbitmq-nodes.default'
2022-02-17 17:11:54.361 [info] <0.273.0> 
 Starting RabbitMQ 3.8.21 on Erlang 24.0.5 [jit]
 Copyright (c) 2007-2021 VMware, Inc. or its affiliates.
 Licensed under the MPL 2.0. Website: https://rabbitmq.com

What is going on here:

  1. paused reconcilation: with rabbitmq.com/pauseReconciliation=true label for my RabbitmqCluster.
  2. set STS command as command: ["sleep", "1000000"]
  3. went to shell of the new pod
  4. checked the fds limit,
  5. reduced it
  6. ran Rabbit, and it runs

@ikavgo ikavgo added the blocked Waiting on other work or on 3rd party. label Feb 17, 2022
@ikavgo
Copy link
Author

ikavgo commented Feb 17, 2022

According to this Mail thread: https://erlang.org/pipermail/erlang-questions/2009-August/046173.html Erlang preallocates ports table at the startup. Thus memory consumption directly depends on fd limit. Going to check if it's still the same mechanics 13 years later.

@ikavgo
Copy link
Author

ikavgo commented Feb 18, 2022

I can confirm that limiting ports number via ERL_MAX_PORTS works.

@ikavgo
Copy link
Author

ikavgo commented Feb 18, 2022

@coro suggested to use overrides:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: override-erl-max-ports
spec:
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
            - name: rabbitmq
              env:
              - name: ERL_MAX_PORTS
                value: "50000"

@github-actions
Copy link

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

@github-actions github-actions bot added the stale Issue or PR with long period of inactivity label Apr 20, 2022
@endeavour
Copy link

Same issue here :(

@ChunyiLyu
Copy link
Contributor

@endeavour Does the statefulSet override solve the issue for you?

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: override-erl-max-ports
spec:
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
            - name: rabbitmq
              env:
              - name: ERL_MAX_PORTS
                value: "50000"

@endeavour
Copy link

Yes that seems to workaround the issue :)

@ChunyiLyu
Copy link
Contributor

I'm going to close the issue now since there is a workaround. Please feel free to reopen if the above workaround does not work for anyone with similar problem. Thanks all :)

knative-prow bot pushed a commit to knative-extensions/eventing-rabbitmq that referenced this issue Jun 24, 2022
openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Sep 12, 2022
* Update kolla-ansible from branch 'master'
  to 1b74b18c2eb4eff7c38e010965aa34f2a353c4c5
  - Merge "Add CentOS Stream 9 / Rocky Linux 9 host support"
  - Add CentOS Stream 9 / Rocky Linux 9 host support
    
    Added c9s jobs are non voting, as agreed on PTG to focus on Rocky Linux 9.
    Since both CS9 and RL9 have higher default fd limit (1073741816 vs
    1048576 in CS8) - lowering that for:
    * RMQ - because Erlang allocates memory based on this (see [1], [2], [3]).
    * MariaDB - because Galera cluster bootstrap failed
    
    Changed openvswitch_db healthcheck, because for unknown reason
    the usual check (using lsof on /run/openvswitch/db.sock) is hanging
    on "Bad file descriptor" (even with privileged: true).
    
    [1]: docker-library/rabbitmq#545
    [2]: rabbitmq/cluster-operator#959 (comment)
    [3]: systemd/systemd@a8b627a
    
    Depends-On: https://review.opendev.org/c/openstack/tenks/+/856296
    Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/856328
    Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/856443
    Needed-By: https://review.opendev.org/c/openstack/kolla/+/836664
    Co-Authored-By: Michał Nasiadka <mnasiadka@gmail.com>
    Change-Id: I3f7b480519aea38c3927bee7fb2c23eea178554d
openstack-mirroring pushed a commit to openstack/kolla-ansible that referenced this issue Sep 12, 2022
Added c9s jobs are non voting, as agreed on PTG to focus on Rocky Linux 9.
Since both CS9 and RL9 have higher default fd limit (1073741816 vs
1048576 in CS8) - lowering that for:
* RMQ - because Erlang allocates memory based on this (see [1], [2], [3]).
* MariaDB - because Galera cluster bootstrap failed

Changed openvswitch_db healthcheck, because for unknown reason
the usual check (using lsof on /run/openvswitch/db.sock) is hanging
on "Bad file descriptor" (even with privileged: true).

[1]: docker-library/rabbitmq#545
[2]: rabbitmq/cluster-operator#959 (comment)
[3]: systemd/systemd@a8b627a

Depends-On: https://review.opendev.org/c/openstack/tenks/+/856296
Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/856328
Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/856443
Needed-By: https://review.opendev.org/c/openstack/kolla/+/836664
Co-Authored-By: Michał Nasiadka <mnasiadka@gmail.com>
Change-Id: I3f7b480519aea38c3927bee7fb2c23eea178554d
openstack-mirroring pushed a commit to openstack/kolla-ansible that referenced this issue Feb 17, 2023
This is a backport from Zed.

cephadm bits to use package from distro backported from
I30f071865b9b0751f1336414a0ae82571a332530

Added c9s jobs are non voting, as agreed on PTG to focus on Rocky Linux 9.
Since both CS9 and RL9 have higher default fd limit (1073741816 vs
1048576 in CS8) - lowering that for:
* RMQ - because Erlang allocates memory based on this (see [1], [2], [3]).
* MariaDB - because Galera cluster bootstrap failed

Changed openvswitch_db healthcheck, because for unknown reason
the usual check (using lsof on /run/openvswitch/db.sock) is hanging
on "Bad file descriptor" (even with privileged: true).

Added kolla_base_distro_version helper var.

[1]: docker-library/rabbitmq#545
[2]: rabbitmq/cluster-operator#959 (comment)
[3]: systemd/systemd@a8b627a

Depends-On: https://review.opendev.org/c/openstack/ansible-collection-kolla/+/864993
Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/864971
Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/864973
Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/870499

Co-Authored-By: Michał Nasiadka <mnasiadka@gmail.com>
Change-Id: I3f7b480519aea38c3927bee7fb2c23eea178554d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Waiting on other work or on 3rd party. bug Something isn't working stale Issue or PR with long period of inactivity
Projects
None yet
Development

No branches or pull requests

3 participants