Exchange federation links not automatically restarted after rolling restart #11495

lfstuttgart · 2024-06-19T13:36:13Z

lfstuttgart
Jun 19, 2024

Describe the bug

Hi,

we seem to have a similar situation as in #3148 in RabbitMQ RabbitMQ 3.13.2 Erlang 26.2.4. It is not consistent. But sometimes during rolling cluster restart (one node at a time - without upgrade), we experience losing the federation link. When this happens it say "no links" on the Federation Status page. Once in this state the federation link almost never recovers - only did once in weeks.

In our case federation is used to transport messages from one vhost to another within the same cluster.

I have uploaded logs from our 2-node-cluster.

This cluster is not yet operational. We have another older cluster with RabbitMQ 3.8.2 Erlang 22.2.7 with the same configuration except cluster names as well as the introduction of certificates within die upstream URI needed since Erlang 26. We plan to replace the old cluster with the new one.
Another change we made is the use of seperate URI's for each node within the same Upstream instead of using a loadbalancer as we did in the old cluster. We did this to rule out loadbalancer issues and hoping it would solve the problem. It didn't.

Certificate data to show this should not be an issue:
CN: rabbit.stuttgart.de
SAN: DNS:rabbit.stuttgart.de, DNS:x17-rabbit-ha.stuttgart.de, DNS:x17-rabbit1.stuttgart.de, DNS:x17-rabbit2.stuttgart.de
IP Address:10.163.41.70, IP Address:10.163.41.71, IP Address:10.163.41.69
Validity
Not Before: Apr 8 13:12:00 2024 UTC
Not After : May 11 13:13:00 2025 UTC

There is one thing special in our environment:
We use an rsync mechanism to keep our machines up to date concerning OS and Packages regularly every 4 hours. At first I thought this might cause the issue. This process includes daemon-reload and restart of rabbitmq-server.service after each sync.

To check if the sync in itself might be the culprit I stopped the regular rsync and replaced it with a script which only restarts the rabbitmq-server.service regularly, but does nothing else:

for i in {1..60}
do
echo "Zähler $i"
echo "$(date) Restart x17-rabbit1"
ssh x17-rabbit1 'systemctl restart rabbitmq-server.service'
sleep 60
echo "$(date) Restart x17-rabbit2"
ssh x17-rabbit2 'systemctl restart rabbitmq-server.service'
sleep 60
done

Mostly around the 20th repetition the federation link goes missing. In the last case after the 6th repetition.

Around the same time we regularly see:

2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> Federation link could not create a disposable (one-off) channel due to an error error: {badmatch,
2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> {error,
2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> {noproc,
2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> {gen_server,
2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> call,
2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> [<0.2077.0>,
2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> {command,
2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> {open_channel,
2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> none,
2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> {amqp_selective_consumer,
2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> []}}},

2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> 130000]}}}}
...

Solution attempts

Cluster restart after losing federation link - did not come back
LimitNOFILE=infinity
LimitNPROC=infinity because of noproc within logged error.
Switch Upstream URI from one entry for loadbalancer to two entries - one for each node - still only on Upstream with two space seperated URIs
ruled out rsync issues as explained above

The only thing which works as a workaround is deleting the policy and add it again.

I have attached Logs and Screenshots

Is there any solution to this?

Thank you very much in advance for your help!

rabbit@x17-rabbit1.log
rabbit@x17-rabbit2.log

m/user-attachments/files/15901538/rabbit%40x17-rabbit1.log)

Reproduction steps

Create an Upstream (see screenshot)
Create a Policy for the upstream (see screenshot)
Do a rolling node restart mutltiple times - Node1 - Node2 - Node 1 - Node2... - give enough time for the each cluster node to rejoin the cluster before restart the other node. In this case the cluster is in preparation for production and there has minimal workload.

Expected behavior

After one node goes down the federation link gets picked up on the remaining node consistently.

Additional context

No response

michaelklishin · 2024-06-19T15:42:08Z

michaelklishin
Jun 19, 2024
Maintainer

I cannot reproduce (and this is not something that's been reported elsewhere). The details of my test are coming shortly in a separate comment.

0 replies

michaelklishin · 2024-06-19T15:59:17Z

michaelklishin
Jun 19, 2024
Maintainer

Given the following two standalone (enough to test shovels and federation links, and since two node clusters are explicitly recommended against),
when I restart either of them, exchange federation links recover perfectly fine, 5 times out of 5.

Node 1

It can use all defaults with rabbitmq_federation and rabbitmq_federation_management plugins enabled.

The definitions from that node are attached. There are two upstream definitions to complicate the scenario: one to the same node but a different virtual host (vh1) and another to node 2 using port 5673 for AMQP locally.

Node 2

Node 2 largely uses all defaults, has the same plugins enabled and uses the following config file that avoids port conflicts:

# rabbitmq.conf
management.tcp.ip = ::1
management.tcp.port = 15674

# stream.listeners.tcp.1 = 127.0.0.1:5553
# stream.listeners.tcp.2 = ::1:5553

# prometheus.tcp.ip = 15693

Steps Performed

Start node 1
Start node 2
Open two management UI tabs for both, the 2nd node uses port 15674
Import definitions or manually create two upstreams to localhost:5672/vh1 and localhost:5673

Declare a policy using

rabbitmqctl set_policy --apply-to exchanges fedx "^fed\." '{"federation-upstream-set":"all"}'

Declare an exchange called fed.fanouts.1 of type fanout
Observe that the exchange matches the policy on the exchange list
Observe that links have started and each node has connections from node 1 (node 1 connects to itself but a different virtual host)
Publish some messages to fed.fanouts.1 and observe some traffic flowing over the links
Restart node 1 a few times
Repeat steps 7 from 9
Conclude that the links are restarted

rabbit-1-discussions-11495.json

0 replies

michaelklishin · 2024-06-19T16:04:18Z

michaelklishin
Jun 19, 2024
Maintainer

In a multi-node cluster, a few things are different, namely that federation links will run on a single node that can be considered "a leader". Nothing in that area has changed in while, and the same mechanism is used for shovels.

My best guess is that a restart in your case triggers a condition where federation links
voluntarily stop: namely when the federated exchange on the upstream no longer exists. How can
an exchange stop existing between node restarts? For example, it can be transient or maybe the node is reset between restarts. We have seen this as part of the infamous grow-then-shrink upgrade strategy.

In any case, if a rolling restart would prevent federation links from starting every time, we'd
be flooded by similar reports by now but that's not the case. Enable debug logging and see the logs from all nodes for clues. Federation links log quite a bit about their lifecycle at debug level.

0 replies

michaelklishin · 2024-06-19T17:26:52Z

michaelklishin
Jun 19, 2024
Maintainer

As for 6 restarts a day may be well above average but does not seem to be very high.

20 restarts in a row will be time consuming to test but in theory over the lifetime of a cluster 20 restarts should be a matter of a few months (if not a couple of weeks) for most Kubernetes-based deployments, for example.

I cannot suggest anything without debug-level logs but since you restart your cluster so often from a script, you might as well (re)declare the policy there or use definition import for policies and exchanges. Boot time or using CLI tools (or the HTTP API) from a script, it should be matter much: in modern versions definitions are imported after plugin startup, so federation will already be in place.

0 replies

lfstuttgart · 2024-06-20T07:04:13Z

lfstuttgart
Jun 20, 2024
Author

@michaelklishin Thank you very much for your reconstruction efforts and detailed analysis. I really appreciate it.

Here are my thoughts on things after reading your comments

Debug level logs were included within the description of our situation. Something went wrong during the first upload. Maybe you read it before may edit of the descriptions doing the upload again. Here are the links:

load_definitions will not work
See: https://www.rabbitmq.com/docs/definitions#:~:text=Definition%20import%20happens%20after%20plugin,anything%20already%20in%20the%20broker.

The definitions in the file will not overwrite anything already in the broker.
That means, since the policy already exists it will not be recreated by load_definitions

That leaves us with definition import by script. We have shared responsibility with our customer. The customer is in charge of changing configuration by Management web UI. If he deletes something we should not reinstate that. Therefore we cannot export definitions once and reimport them on a regular basis via script.

If we cannot determine and extinguish the root cause, I think our only workaround option is to do it within the restart script using CLI or HTTP API as you suggested.

Root cause and possible further effects of it
It still would be great to know the root cause. We don't want that to maybe affect the cluster in other ways as soon as the cluster goes into production introducing more and different loads and conditions. I am new to RabbitMQ and therefore not experienced enough to determine if that could be the case.

Maybe as a clue you can point out what - maybe environmental - situations can cause the following error. It might be a good starting point for my own further investigations.

2024-06-19 14:54:02.961117+02:00 [info] <0.1956.0> Stopping application 'amqp_client'
2024-06-19 14:54:02.961625+02:00 [info] <0.1549.0> Federation exchange 'kopla_objects' in vhost 'infopoolBS' disconnected from exchange 'kopla_objects' in vhost 'kopla' on amqps://x17-rabbit1.stuttgart.de:5671/kopla
2024-06-19 14:54:02.961625+02:00 [info] <0.1549.0> {upstream_channel_down,shutdown}
2024-06-19 14:54:02.961673+02:00 [warning] <0.1578.0> closing AMQP connection <0.1578.0> ([redacted IP of rabbit1]:56942 -> [redacted IP of rabbit1]:5671 - Federation link (upstream: kopla-upstream-2024, policy: kopla_objects-federation-2024), vhost: 'kopla', user: 'mqadmin'):
2024-06-19 14:54:02.961673+02:00 [warning] <0.1578.0> client unexpectedly closed TCP connection
2024-06-19 14:54:02.961832+02:00 [debug] <0.1549.0> Exchange federation: link is shutting down, resource cleanup mode: default
2024-06-19 14:54:02.961882+02:00 [debug] <0.1549.0> Federated exchange 'kopla_objects' link will delete its internal queue 'federation: kopla_objects -> rabbit@x17-rabbit1.stuttgart.de:infopoolBS:kopla_objects'
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> Federation link could not create a disposable (one-off) channel due to an error error: {badmatch,
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> {error,
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> {noproc,
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> {gen_server,
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> call,
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> [<0.1568.0>,
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> {command,
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> {open_channel,
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> none,
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> {amqp_selective_consumer,
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> []}}},
2024-06-19 14:54:02.962017+02:00 [error] <0.1549.0> 130000]}}}}
... see logs

Again, thanks a lot!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exchange federation links not automatically restarted after rolling restart #11495

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Exchange federation links not automatically restarted after rolling restart #11495

lfstuttgart Jun 19, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 5 comments

michaelklishin Jun 19, 2024 Maintainer

michaelklishin Jun 19, 2024 Maintainer

Node 1

Node 2

Steps Performed

michaelklishin Jun 19, 2024 Maintainer

michaelklishin Jun 19, 2024 Maintainer

lfstuttgart Jun 20, 2024 Author

lfstuttgart
Jun 19, 2024

michaelklishin
Jun 19, 2024
Maintainer

michaelklishin
Jun 19, 2024
Maintainer

michaelklishin
Jun 19, 2024
Maintainer

michaelklishin
Jun 19, 2024
Maintainer

lfstuttgart
Jun 20, 2024
Author