-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Operator-managed rabbitmq:3.8.21-management falls into CrashLoop on Arch Linux #959
Comments
dmesg -He from the kind node:
|
/proc/meminfo:
|
Following @mkuratczyk 's hint to https://github.com/rabbitmq/cluster-operator/blob/main/internal/resource/configmap.go#L209-L213 I played with non-managed RabbitMQ deployment: Non-managed RabbitMQ deployment:
Non-manged RabbitMQ deployment with
So non-managed RabbitMQ works with both - default/unset
|
After going back and forth between my two VMs, one Ubuntu and one Arch I concluded that it's like relate to either
What has been ruled out/confirmed:
I decided to ask google if it's possible to have memory usage difference when image built on different distro than host and this is what appeared: moby/moby#8231 Basically some software/runtimes enumerates all possible fds to figure out the limit. That's why btw the suggested best practice is to explicitly set fds limit in Dockerfile. This is what Ubuntu host:
Rabbit running on Ubuntu host inside k8s:
Arch host:
Rabbit running on Arch inside k8s:
I will keep digging though the plan:
|
So looks like I can confirm this problem is related to unset fds limit in our rabbitmq container:
What is going on here:
|
According to this Mail thread: https://erlang.org/pipermail/erlang-questions/2009-August/046173.html Erlang preallocates ports table at the startup. Thus memory consumption directly depends on fd limit. Going to check if it's still the same mechanics 13 years later. |
I can confirm that limiting ports number via |
@coro suggested to use overrides: apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: override-erl-max-ports
spec:
override:
statefulSet:
spec:
template:
spec:
containers:
- name: rabbitmq
env:
- name: ERL_MAX_PORTS
value: "50000" |
This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring. |
Same issue here :( |
@endeavour Does the statefulSet override solve the issue for you? apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: override-erl-max-ports
spec:
override:
statefulSet:
spec:
template:
spec:
containers:
- name: rabbitmq
env:
- name: ERL_MAX_PORTS
value: "50000" |
Yes that seems to workaround the issue :) |
I'm going to close the issue now since there is a workaround. Please feel free to reopen if the above workaround does not work for anyone with similar problem. Thanks all :) |
* Update kolla-ansible from branch 'master' to 1b74b18c2eb4eff7c38e010965aa34f2a353c4c5 - Merge "Add CentOS Stream 9 / Rocky Linux 9 host support" - Add CentOS Stream 9 / Rocky Linux 9 host support Added c9s jobs are non voting, as agreed on PTG to focus on Rocky Linux 9. Since both CS9 and RL9 have higher default fd limit (1073741816 vs 1048576 in CS8) - lowering that for: * RMQ - because Erlang allocates memory based on this (see [1], [2], [3]). * MariaDB - because Galera cluster bootstrap failed Changed openvswitch_db healthcheck, because for unknown reason the usual check (using lsof on /run/openvswitch/db.sock) is hanging on "Bad file descriptor" (even with privileged: true). [1]: docker-library/rabbitmq#545 [2]: rabbitmq/cluster-operator#959 (comment) [3]: systemd/systemd@a8b627a Depends-On: https://review.opendev.org/c/openstack/tenks/+/856296 Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/856328 Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/856443 Needed-By: https://review.opendev.org/c/openstack/kolla/+/836664 Co-Authored-By: Michał Nasiadka <mnasiadka@gmail.com> Change-Id: I3f7b480519aea38c3927bee7fb2c23eea178554d
Added c9s jobs are non voting, as agreed on PTG to focus on Rocky Linux 9. Since both CS9 and RL9 have higher default fd limit (1073741816 vs 1048576 in CS8) - lowering that for: * RMQ - because Erlang allocates memory based on this (see [1], [2], [3]). * MariaDB - because Galera cluster bootstrap failed Changed openvswitch_db healthcheck, because for unknown reason the usual check (using lsof on /run/openvswitch/db.sock) is hanging on "Bad file descriptor" (even with privileged: true). [1]: docker-library/rabbitmq#545 [2]: rabbitmq/cluster-operator#959 (comment) [3]: systemd/systemd@a8b627a Depends-On: https://review.opendev.org/c/openstack/tenks/+/856296 Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/856328 Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/856443 Needed-By: https://review.opendev.org/c/openstack/kolla/+/836664 Co-Authored-By: Michał Nasiadka <mnasiadka@gmail.com> Change-Id: I3f7b480519aea38c3927bee7fb2c23eea178554d
This is a backport from Zed. cephadm bits to use package from distro backported from I30f071865b9b0751f1336414a0ae82571a332530 Added c9s jobs are non voting, as agreed on PTG to focus on Rocky Linux 9. Since both CS9 and RL9 have higher default fd limit (1073741816 vs 1048576 in CS8) - lowering that for: * RMQ - because Erlang allocates memory based on this (see [1], [2], [3]). * MariaDB - because Galera cluster bootstrap failed Changed openvswitch_db healthcheck, because for unknown reason the usual check (using lsof on /run/openvswitch/db.sock) is hanging on "Bad file descriptor" (even with privileged: true). Added kolla_base_distro_version helper var. [1]: docker-library/rabbitmq#545 [2]: rabbitmq/cluster-operator#959 (comment) [3]: systemd/systemd@a8b627a Depends-On: https://review.opendev.org/c/openstack/ansible-collection-kolla/+/864993 Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/864971 Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/864973 Depends-On: https://review.opendev.org/c/openstack/kolla-ansible/+/870499 Co-Authored-By: Michał Nasiadka <mnasiadka@gmail.com> Change-Id: I3f7b480519aea38c3927bee7fb2c23eea178554d
Describe the bug
As part of ongoing work on knative-extensions/eventing-rabbitmq#415 I decided to install Knative eventing RabbitMQ parts on my new Arch Linux. However, I wasn't able to get single-node operator-managed RabbitMQ cluster running. The rabbitmq node always goes to CrashLoop.
To Reproduce
Steps to reproduce the behavior:
rabbitmq-server-0
pod go on CrashLoop` with nothing more thanInclude any YAML or manifest necessary to reproduce the problem.
Expected behavior
RabbitMQ node should start.
Screenshots
If applicable, add screenshots to help explain your problem.
Version and environment information
Additional context
Operator logs:
kubelet logs:
What has been found with the help of my great colleagues @Zerpet @ChunyiLyu and @coro:
works!
also works!
When we tried to loging to the container (by replacing endpoint with
sleep 1000000
command. We found out that first run ofrabbitmq-server
just produces the logs from above and exists with code 0. However, all subsequent runs ofrabbitmq-server
or anyrabbitmq*
command prints "Killed" and exists with code 137 (OOM 🤤).The text was updated successfully, but these errors were encountered: