Exchange federation links not automatically restarted after rolling restart #11495
Replies: 5 comments
-
I cannot reproduce (and this is not something that's been reported elsewhere). The details of my test are coming shortly in a separate comment. |
Beta Was this translation helpful? Give feedback.
-
Given the following two standalone (enough to test shovels and federation links, and since two node clusters are explicitly recommended against), Node 1It can use all defaults with The definitions from that node are attached. There are two upstream definitions to complicate the scenario: one to the same node but a different virtual host ( Node 2Node 2 largely uses all defaults, has the same plugins enabled and uses the following config file that avoids port conflicts: # rabbitmq.conf
management.tcp.ip = ::1
management.tcp.port = 15674
# stream.listeners.tcp.1 = 127.0.0.1:5553
# stream.listeners.tcp.2 = ::1:5553
# prometheus.tcp.ip = 15693 Steps Performed
|
Beta Was this translation helpful? Give feedback.
-
In a multi-node cluster, a few things are different, namely that federation links will run on a single node that can be considered "a leader". Nothing in that area has changed in while, and the same mechanism is used for shovels. My best guess is that a restart in your case triggers a condition where federation links In any case, if a rolling restart would prevent federation links from starting every time, we'd |
Beta Was this translation helpful? Give feedback.
-
As for 6 restarts a day may be well above average but does not seem to be very high. 20 restarts in a row will be time consuming to test but in theory over the lifetime of a cluster 20 restarts should be a matter of a few months (if not a couple of weeks) for most Kubernetes-based deployments, for example. I cannot suggest anything without debug-level logs but since you restart your cluster so often from a script, you might as well (re)declare the policy there or use definition import for policies and exchanges. Boot time or using CLI tools (or the HTTP API) from a script, it should be matter much: in modern versions definitions are imported after plugin startup, so federation will already be in place. |
Beta Was this translation helpful? Give feedback.
-
@michaelklishin Thank you very much for your reconstruction efforts and detailed analysis. I really appreciate it. Here are my thoughts on things after reading your comments Debug level logs were included within the description of our situation. Something went wrong during the first upload. Maybe you read it before may edit of the descriptions doing the upload again. Here are the links:
load_definitions will not work
That leaves us with definition import by script. We have shared responsibility with our customer. The customer is in charge of changing configuration by Management web UI. If he deletes something we should not reinstate that. Therefore we cannot export definitions once and reimport them on a regular basis via script. If we cannot determine and extinguish the root cause, I think our only workaround option is to do it within the restart script using CLI or HTTP API as you suggested. Root cause and possible further effects of it Maybe as a clue you can point out what - maybe environmental - situations can cause the following error. It might be a good starting point for my own further investigations.
Again, thanks a lot! |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
Hi,
we seem to have a similar situation as in #3148 in RabbitMQ RabbitMQ 3.13.2 Erlang 26.2.4. It is not consistent. But sometimes during rolling cluster restart (one node at a time - without upgrade), we experience losing the federation link. When this happens it say "no links" on the Federation Status page. Once in this state the federation link almost never recovers - only did once in weeks.
In our case federation is used to transport messages from one vhost to another within the same cluster.
I have uploaded logs from our 2-node-cluster.
This cluster is not yet operational. We have another older cluster with RabbitMQ 3.8.2 Erlang 22.2.7 with the same configuration except cluster names as well as the introduction of certificates within die upstream URI needed since Erlang 26. We plan to replace the old cluster with the new one.
Another change we made is the use of seperate URI's for each node within the same Upstream instead of using a loadbalancer as we did in the old cluster. We did this to rule out loadbalancer issues and hoping it would solve the problem. It didn't.
Certificate data to show this should not be an issue:
CN: rabbit.stuttgart.de
SAN: DNS:rabbit.stuttgart.de, DNS:x17-rabbit-ha.stuttgart.de, DNS:x17-rabbit1.stuttgart.de, DNS:x17-rabbit2.stuttgart.de
IP Address:10.163.41.70, IP Address:10.163.41.71, IP Address:10.163.41.69
Validity
Not Before: Apr 8 13:12:00 2024 UTC
Not After : May 11 13:13:00 2025 UTC
There is one thing special in our environment:
We use an rsync mechanism to keep our machines up to date concerning OS and Packages regularly every 4 hours. At first I thought this might cause the issue. This process includes daemon-reload and restart of rabbitmq-server.service after each sync.
To check if the sync in itself might be the culprit I stopped the regular rsync and replaced it with a script which only restarts the rabbitmq-server.service regularly, but does nothing else:
Mostly around the 20th repetition the federation link goes missing. In the last case after the 6th repetition.
Around the same time we regularly see:
2024-06-19 15:08:50.151602+02:00 [error] <0.2058.0> 130000]}}}}
...
Solution attempts
The only thing which works as a workaround is deleting the policy and add it again.
I have attached Logs and Screenshots
Is there any solution to this?
Thank you very much in advance for your help!
rabbit@x17-rabbit1.log
rabbit@x17-rabbit2.log
m/user-attachments/files/15901538/rabbit%40x17-rabbit1.log)
Reproduction steps
Expected behavior
After one node goes down the federation link gets picked up on the remaining node consistently.
Additional context
No response
Beta Was this translation helpful? Give feedback.
All reactions