Exchange federation links not automatically restarted after a rolling upgrade to 3.8.15 or later release #3148

johanrhodin · 2021-06-25T04:49:48Z

Up to, and including, RabbitMQ 3.8.14 restarting one node in a multi-node cluster would cause the associated exchange federation links to jump to another node in the cluster. (This is the documented behavior in https://www.rabbitmq.com/federation.html#clustering: "Exchange federation links will start on any node in the downstream cluster. They will fail over to other nodes if the node they are running on crashes or stops.")

If the node that has the links in 3.8.15 (and higher) goes down, the federation link is removed and the policy will need to be recreated for the links to reappear. While the node serving the link is stopping a status of "Starting" is shown in the management interface, followed by no links at all when the node has fully stopped.

In my example scenario I used a one node cluster (3.8.16 with Erlang 24.0.2) as upstream and a three node cluster (3.8.14/15 with Erlang 23.2.3 as downstream. After noticing which node ran the link I stopped that node.

The following policy was used:
curl -i -X PUT -H 'Content-Type: application/json' $HTTPS_DOWNSTREAM/api/policies/$DOWNSTREAM_VHOST/fedit/ -d '{"pattern":".", "definition": {"federation-upstream-set":"all"}, "priority":0, "apply-to": "exchanges"}'

and the upstream is defined as:
curl -i -XPUT -H "content-type:application/json" -d'{"value":{"uri":"'$UPSTREAM_URL'","expires":3600000}}' $HTTPS_DOWNSTREAM/api/parameters/federation-upstream/$DOWNSTREAM_VHOST/upstream

This issue was reported both to us (CloudAMQP) and to RabbitMQ Slack: https://rabbitmq.slack.com/archives/C1EDN83PA/p1623755314386000

The text was updated successfully, but these errors were encountered:

michaelklishin · 2021-06-25T05:07:00Z

Has any investigation into this been conducted by CloudAMQP?

The only changes relevant to Federation after 3.8.14 were

and the upgrade of Ranch to 2.0 which I doubt can matter here.

johanrhodin · 2021-06-26T03:42:32Z

No more investigation has been done apart for confirming the issue and constructing cases to test it.

I take it that one way to debug would be to build 3.8.x without 98724ef#diff-f478bae44be51a403ce03f61c0db27438df938766af16db7f2407c32b774eb8f and b9836cc#diff-aea6cb0036e8357870edde16f99c4b4dfd3d91e013d937ed66135da1703d18f6, respectively and see if the issue still remains, and go from there?

michaelklishin · 2021-06-28T14:01:57Z

Reverting those is one option but seeing what mirrored supervisor messages are produced at debug level might be enough.

johanrhodin · 2021-06-29T03:02:23Z

OK, attaching debug level logs (rabbitmqctl set_log_level debug) from the three downstream nodes.
rabbit@test-burly-silver-chamois-01.log
rabbit@test-burly-silver-chamois-02.log
rabbit@test-burly-silver-chamois-03.log

Node -02 was the one that had the federation link running, was stopped and then started again.

michaelklishin · 2021-07-02T11:28:59Z

I cannot reproduce this with three v3.8.x nodes. When I stop nodes that host some exchange federation links, the links migrate to one of the online nodes and recover their connections.

michaelklishin · 2021-07-02T11:41:40Z

@johanrhodin when you say

3.8.14/15 nodes

does this mean this is a mixed-version cluster? In a mixed version cluster, 3.8.15 nodes use a different process group membership library, pg instead of pg2 (as pg2 was removed), so mirrored supervisors that ultimately start
and monitoring federation links cannot see old cluster members after the upgrade (the nodes remain clustered, sure,
but not as far as mirrored supervisors go).

This is expected that during a rolling upgrade to 3.8.15 all features that rely on process groups will not observe
new cluster members on old nodes. This is the same problem as described in #3080 but in a difference place.

As explained in #3080, there isn't much our team can do about this. pg2 was removed in Erlang 24 and replaced
with pg (the original name of the process group module), so even if we wanted to invest time into introducing
a module that would use both and merge their member sets, we cannot do that while still supporting Erlang 24.

For federation users, clearing and re-enabling the policy should be sufficient to bring back the links on the upgraded post-3.8.15 nodes.

michaelklishin · 2021-07-03T12:18:21Z

I could not reproduce this with 3.8.16, 3.8.18 but could with a mixed 3.8.14 cluster upgraded to 3.8.16 in a rolling fashion.

All the symptoms from #3080 were present: this is a side-effect of a changed process group module used by plugins
that are distributed one way or another: Federation, Shovel, management.

The bad news is that we cannot do anything about this without dropping (and never re-introducing) Erlang 24 support.
The good news is that there is a trivial workaround: after all nodes are upgraded to 3.8.16 or later, remove the policy that enables exchange federation
and re-created it. The link will be started and migrated between cluster nodes as you'd expect.

… changes in the logs. Referenes #3148.

… changes in the logs. Referenes #3148. (cherry picked from commit 65ccf7c)

johanrhodin · 2021-07-19T21:53:21Z

@michaelklishin I should have been more explicit with 3.8.14/15. What I meant was that with 3.8.14 it worked and when the same cluster was upgraded to 3.8.14+ (3.8.15 and higher) it didn't work anymore. I didn't try with a mixed cluster.

I still see this with all nodes involved running Erlang 24.0.2 and RabbitMQ 3.8.19. I will try and create a minimal working example for reproduction.

johanrhodin · 2021-08-12T20:46:38Z

OK I can't reproduce with 3.8.21. I assume it is because of #3263.

I can reproduce with 3.8.16, with the following:

# 1. Two clusters defintions
UPSTREAM_URL="amqps://xcjvgoyg:PASSWD@test-myrtle-green-stingray.rmq2.cloudamqp.com/xcjvgoyg" # 1 node 3.8.21

DOWNSTREAM_VHOST=kjdjuxhr

HTTPS_DOWNSTREAM="https://kjdjuxhr:PASSWD@test-exotic-blond-duckbill.rmq2.cloudamqp.com" # 3 nodes 3.8.21

# 2. Create federation-upstream on downstream
curl -i -XPUT -H "content-type:application/json" -d'{"value":{"uri":"'$UPSTREAM_URL'","expires":3600000}}' $HTTPS_DOWNSTREAM/api/parameters/federation-upstream/$DOWNSTREAM_VHOST/upstream

# 3. Create a federation policy on downstream
curl -i -X PUT -H 'Content-Type: application/json' $HTTPS_DOWNSTREAM/api/policies/$DOWNSTREAM_VHOST/fedit/ -d '{"pattern":".", "definition": {"federation-upstream-set":"all"}, "priority":0, "apply-to": "exchanges"}'

# 4. Stop RabbitMQ on the node that has the link running on downstream.

lfstuttgart · 2024-06-17T15:34:15Z

Hi,

we seem to have the same situation in RabbitMQ RabbitMQ 3.13.2 Erlang 26.2.4. It is not consistent. But sometimes during Cluster restart (one node at a time), we experience losing the federation link.

In our case federation is used to transport messages from one vhost to another..

I have uploaded logs from our 2-node-cluster. The last time I see anything of the federation link in there ist at 2024-06-17 14:48:13.975211+02:00 and the lines following immediately after that. The federation upstream points to a loadbalancer address (x17-rabbit-ha.stuttgart.de)

This cluster is not yet operational. We have another older cluster with RabbitMQ 3.8.2 Erlang 22.2.7 with the same configuration except cluster names and address of the loadbalancer as well of the introduction of certificates within die upstream URI needed since Erlang 26.

Certificate data to show this should not be an issue:
CN: rabbit.stuttgart.de
SAN: DNS:rabbit.stuttgart.de, DNS:x17-rabbit-ha.stuttgart.de, DNS:x17-rabbit1.stuttgart.de, DNS:x17-rabbit2.stuttgart.de
IP Address:10.163.41.70, IP Address:10.163.41.71, IP Address:10.163.41.69
Validity
Not Before: Apr 8 13:12:00 2024 UTC
Not After : May 11 13:13:00 2025 UTC

Is there any solution to this?

Thank you very much in advance for your help!

rabbit@x17-rabbit2.log

rabbit@x17-rabbit1.log

johanrhodin · 2024-06-17T15:40:40Z

@lfstuttgart start a new discussion and provide exact steps to reproduce (even if it only happens sometimes). We can then take a look at this (potential) issue.
In that new discussion you can link to this one, for context.

lfstuttgart · 2024-06-19T13:40:26Z

@johanrhodin thanks! Here ist the new discussion: https://github.com/rabbitmq/rabbitmq-server/issues/11492

johanrhodin · 2024-06-20T19:22:30Z

Here is the new discussion: #11495

michaelklishin closed this as completed Jul 3, 2021

michaelklishin changed the title ~~Exchange federation links not automatically restarted in RabbitMQ >=3.8.15~~ Exchange federation links not automatically restarted after a rolling upgrade to 3.8.15 or later release Jul 3, 2021

michaelklishin added erlang-24 rabbitmq-federation labels Jul 3, 2021

michaelklishin added a commit that referenced this issue Jul 5, 2021

Mirrored supervisor: make it easier to keep track of group membership…

65ccf7c

… changes in the logs. Referenes #3148.

michaelklishin added a commit that referenced this issue Jul 5, 2021

Mirrored supervisor: make it easier to keep track of group membership…

ce40108

… changes in the logs. Referenes #3148. (cherry picked from commit 65ccf7c)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exchange federation links not automatically restarted after a rolling upgrade to 3.8.15 or later release #3148

Exchange federation links not automatically restarted after a rolling upgrade to 3.8.15 or later release #3148

johanrhodin commented Jun 25, 2021

michaelklishin commented Jun 25, 2021

johanrhodin commented Jun 26, 2021

michaelklishin commented Jun 28, 2021

johanrhodin commented Jun 29, 2021

michaelklishin commented Jul 2, 2021

michaelklishin commented Jul 2, 2021

michaelklishin commented Jul 3, 2021

johanrhodin commented Jul 19, 2021

johanrhodin commented Aug 12, 2021

lfstuttgart commented Jun 17, 2024

johanrhodin commented Jun 17, 2024

lfstuttgart commented Jun 19, 2024

johanrhodin commented Jun 20, 2024

Exchange federation links not automatically restarted after a rolling upgrade to 3.8.15 or later release #3148

Exchange federation links not automatically restarted after a rolling upgrade to 3.8.15 or later release #3148

Comments

johanrhodin commented Jun 25, 2021

michaelklishin commented Jun 25, 2021

johanrhodin commented Jun 26, 2021

michaelklishin commented Jun 28, 2021

johanrhodin commented Jun 29, 2021

michaelklishin commented Jul 2, 2021

michaelklishin commented Jul 2, 2021

michaelklishin commented Jul 3, 2021

johanrhodin commented Jul 19, 2021

johanrhodin commented Aug 12, 2021

lfstuttgart commented Jun 17, 2024

johanrhodin commented Jun 17, 2024

lfstuttgart commented Jun 19, 2024

johanrhodin commented Jun 20, 2024