Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove randomized startup delays #3075

Merged
merged 1 commit into from
Jun 5, 2021

Conversation

ansd
Copy link
Member

@ansd ansd commented Jun 2, 2021

On initial cluster formation, only one node in a multi node cluster
should initialize the Mnesia database schema (i.e. form the cluster).
To ensure that for nodes starting up in parallel,
RabbitMQ peer discovery backends have used
either locks or randomized startup delays.

Locks work great: When a node holds the lock, it either starts a new
blank node (if there is no other node in the cluster), or it joins
an existing node. This makes it impossible to have two nodes forming
the cluster at the same time.
Consul and etcd peer discovery backends use locks. The lock is acquired
in the consul and etcd infrastructure, respectively.

For other peer discovery backends (classic, K8s, AWS), randomized
startup delays were used. They work good enough in most cases.
However, in rabbitmq/cluster-operator#662 we
observed that in 1% - 10% of the cases (the more nodes or the
smaller the randomized startup delay range, the higher the chances), two
nodes decide to form the cluster. That's bad since it will end up in a
single Erlang cluster, but in two RabbitMQ clusters. Even worse, no
obvious alert got triggered or error message logged.

To solve this issue, one could increase the randomized startup delay
range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster
formation very slow since it will take up to 3 minutes until
every node is ready. In rare cases, we still end up with two nodes
forming the cluster.

Another way to solve the problem is to name a dedicated node to be the
seed node (forming the cluster). This was explored in
rabbitmq/cluster-operator#689 and works well.
Two minor downsides to this approach are: 1. If the seed node never
becomes available, the whole cluster won't be formed (which is okay),
and 2. it doesn't integrate with existing dynamic peer discovery backends
(e.g. K8s, AWS) since nodes are not yet known at deploy time.

In this PR, we take a better approach: We remove randomized startup
delays altogether. We replace them with locks. However, instead of
implementing our own lock implementation in an external system (e.g. in K8s),
we re-use Erlang's locking mechanism global:set_lock/3.

global:set_lock/3 has some convenient properties:

  1. It accepts a list of nodes to set the lock on.
  2. The nodes in that list connect to each other (i.e. create an Erlang
    cluster).
  3. The method is synchronous with a timeout (number of retries). It
    blocks until the lock becomes available.
  4. If a process that holds a lock dies, or the node goes down, the lock
    held by the process is deleted.

The list of nodes passed to global:set_lock/3 corresponds to the nodes
the peer discovery backend discovers (lists).

Two special cases worth mentioning:

  1. That list can be all desired nodes in the cluster
    (e.g. in classic peer discovery where nodes are known at
    deploy time) while only a subset of nodes is available.
    In that case, global:set_lock/3 still sets the lock not
    blocking until all nodes can be connected to. This is good since
    nodes might start sequentially (non-parallel).

  2. In dynamic peer discovery backends (e.g. K8s, AWS), this
    list can be just a subset of desired nodes since nodes might not startup
    in parallel. That's also not a problem as long as the following
    requirement is met: "The peer discovery backend does not list two disjoint
    sets of nodes (on different nodes) at the same time."
    For example, in a 2-node cluster, the peer discovery backend must not
    list only node 1 on node 1 and only node 2 on node 2.

Existing peer discovery backends fulfil that requirement because the
resource the nodes are discovered from is global.
For example, in K8s, once node 1 is part of the Endpoints object, it
will be returned on both node 1 and node 2.
Likewise, in AWS, once node 1 started, the described list of instances
with a specific tag will include node 1 when the AWS peer discovery backend
runs on node 1 or node 2.

Removing randomized startup delays also makes cluster formation
considerably faster (up to 1 minute faster if that was the
upper bound in the range).

How these changes were tested

K8s peer discovery backend:

Deployed cluster-operator v1.7.0 on a GKE cluster.

rabbitmq.yml
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: cluster-((index))
spec:
  replicas: ((replicas))
  image: "pivotalrabbitmq/rabbitmq:remove-randomized-startup-delays"
  rabbitmq:
    additionalConfig: |
      log.console.level = debug
  resources:
    limits:
      cpu: "1"
      memory: 512Mi
    requests:
      cpu: "0.5"
      memory: 512Mi

Deployed 100 9-node RabbitMQ clusters in sequence checking that all 9 nodes got clustered and checking that there were no container restarts (thanks @mkuratczyk):

The following script also stores all the pod logs
#!/usr/bin/env bash
set -euxo pipefail

clusters=100
replicas=9

max_pod_index=$(( replicas - 1 ))
for i in $(seq 1 "$clusters"); do
sed "s/((index))/$i/ ; s/((replicas))/$replicas/" rabbitmq.yml | kubectl apply -f -
sleep 5
printf "\nRun %d:\n" "$i" >> results.txt
for pod in $(seq 0 $max_pod_index); do
kubectl wait --for=condition=Ready=true pod/cluster-"$i"-server-"$pod" --timeout=5m
done
for pod in $(seq 0 $max_pod_index); do
kubectl exec cluster-"$i"-server-"$pod" -c rabbitmq -- rabbitmqadmin list nodes | grep -c disc >> results.txt
done
kubectl get pods | grep server | awk '{print $4}' | xargs -I {} echo Container restarts: {} >> results.txt
mkdir -p logs/"$i"
for pod in $(seq 0 $max_pod_index); do
# Save logs from all RabbitMQ nodes
kubectl logs cluster-"$i"-server-"$pod" > logs/"$i"/cluster-"$i"-server-"$pod".log
if [[ $(kubectl get pod cluster-"$i"-server-"$pod" | grep server | awk '{print $4}') != 0 ]];then
# There was a container restart => get logs from previous container
kubectl logs --previous cluster-"$i"-server-"$pod" > logs/"$i"/cluster-"$i"-server-"$pod"-previous.log
fi
done
kubectl rabbitmq delete cluster-"$i"
kubectl wait --for=delete pods -l=app.kubernetes.io/name=cluster-"$i" --timeout=5m
sleep 30
done

Classic peer discovery backend:
Same as K8s peer discovery backend but using https://github.com/ansd/cluster-operator/tree/peer-discovery-classic

AWS peer discovery backend:

  1. Create 3 EC2 instances with Ubuntu 20.04 and the same tag mykey: myvalue.
  2. For the security group, allow inbound 4369 and 25672 ports between nodes (i.e. from the same security group) and ssh from my IP address.

On all 3 instances, do the following:

  1. ssh ubuntu@<public IP> -i <pem file>
  2. Install Erlang and Elixir via apt-get.
  3. rabbit_common depends on python: sudo ln -s /usr/bin/python3 /usr/bin/python
  4. git clone rabbitmq-server and:
    make run-broker TEST_TMPDIR="/home/ubuntu/rabbitmq-server/tmp/test" RABBITMQ_CONFIG_FILE="/home/ubuntu/rabbitmq-server/tmp/rabbitmq.conf" RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="-setcookie mycookie"

rabbitmq.conf:

cluster_formation.peer_discovery_backend = aws
cluster_formation.aws.region = eu-central-1
cluster_formation.aws.access_key_id = XXX
cluster_formation.aws.secret_key = XXX/aaaa
cluster_formation.aws.use_autoscaling_group = false
cluster_formation.aws.instance_tags.mykey = myvalue

TODOs

@ansd ansd force-pushed the remove-randomized-startup-delays branch from 08b5aca to e3bdf65 Compare June 2, 2021 10:59
On initial cluster formation, only one node in a multi node cluster
should initialize the Mnesia database schema (i.e. form the cluster).
To ensure that for nodes starting up in parallel,
RabbitMQ peer discovery backends have used
either locks or randomized startup delays.

Locks work great: When a node holds the lock, it either starts a new
blank node (if there is no other node in the cluster), or it joins
an existing node. This makes it impossible to have two nodes forming
the cluster at the same time.
Consul and etcd peer discovery backends use locks. The lock is acquired
in the consul and etcd infrastructure, respectively.

For other peer discovery backends (classic, DNS, AWS), randomized
startup delays were used. They work good enough in most cases.
However, in rabbitmq/cluster-operator#662 we
observed that in 1% - 10% of the cases (the more nodes or the
smaller the randomized startup delay range, the higher the chances), two
nodes decide to form the cluster. That's bad since it will end up in a
single Erlang cluster, but in two RabbitMQ clusters. Even worse, no
obvious alert got triggered or error message logged.

To solve this issue, one could increase the randomized startup delay
range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster
formation very slow since it will take up to 3 minutes until
every node is ready. In rare cases, we still end up with two nodes
forming the cluster.

Another way to solve the problem is to name a dedicated node to be the
seed node (forming the cluster). This was explored in
rabbitmq/cluster-operator#689 and works well.
Two minor downsides to this approach are: 1. If the seed node never
becomes available, the whole cluster won't be formed (which is okay),
and 2. it doesn't integrate with existing dynamic peer discovery backends
(e.g. K8s, AWS) since nodes are not yet known at deploy time.

In this commit, we take a better approach: We remove randomized startup
delays altogether. We replace them with locks. However, instead of
implementing our own lock implementation in an external system (e.g. in K8s),
we re-use Erlang's locking mechanism global:set_lock/3.

global:set_lock/3 has some convenient properties:
1. It accepts a list of nodes to set the lock on.
2. The nodes in that list connect to each other (i.e. create an Erlang
cluster).
3. The method is synchronous with a timeout (number of retries). It
blocks until the lock becomes available.
4. If a process that holds a lock dies, or the node goes down, the lock
held by the process is deleted.

The list of nodes passed to global:set_lock/3 corresponds to the nodes
the peer discovery backend discovers (lists).

Two special cases worth mentioning:

1. That list can be all desired nodes in the cluster
(e.g. in classic peer discovery where nodes are known at
deploy time) while only a subset of nodes is available.
In that case, global:set_lock/3 still sets the lock not
blocking until all nodes can be connected to. This is good since
nodes might start sequentially (non-parallel).

2. In dynamic peer discovery backends (e.g. K8s, AWS), this
list can be just a subset of desired nodes since nodes might not startup
in parallel. That's also not a problem as long as the following
requirement is met: "The peer disovery backend does not list two disjoint
sets of nodes (on different nodes) at the same time."
For example, in a 2-node cluster, the peer discovery backend must not
list only node 1 on node 1 and only node 2 on node 2.

Existing peer discovery backends fullfil that requirement because the
resource the nodes are discovered from is global.
For example, in K8s, once node 1 is part of the Endpoints object, it
will be returned on both node 1 and node 2.
Likewise, in AWS, once node 1 started, the described list of instances
with a specific tag will include node 1 when the AWS peer discovery backend
runs on node 1 or node 2.

Removing randomized startup delays also makes cluster formation
considerably faster (up to 1 minute faster if that was the
upper bound in the range).
@ansd ansd force-pushed the remove-randomized-startup-delays branch from 39de499 to 0876746 Compare June 3, 2021 06:02
@ansd ansd changed the title WIP: Remove randomized startup delays Remove randomized startup delays Jun 3, 2021
@ansd ansd marked this pull request as ready for review June 3, 2021 08:12
{undefined, undefined} ->
ok;
_ ->
cuttlefish:warn("cluster_formation.randomized_startup_delay_range.min and "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, Cuttlefish schema translation happens so early that this won't be visible in the regular or prelaunch log, at least that what I observed. This is a general problem with logging at such really early node boot stages, we shouldn't try to address it in this PR (or even for 3.9).

The point is that existing config files with these options will pass validation 👍

@michaelklishin
Copy link
Member

Merging into master (will backport after 3.8.17 ships) because:

  • Code changes look reasonable
  • Kubernetes was extensively tested and will be tested more (as part of the cluster Operator, for example)
  • AWS test suite passes
  • Classic config suite passes and cluster formation works as expected

@michaelklishin michaelklishin merged commit ed0ba6a into master Jun 5, 2021
@michaelklishin michaelklishin deleted the remove-randomized-startup-delays branch June 5, 2021 12:47
ansd added a commit to rabbitmq/rabbitmq-website that referenced this pull request Jun 7, 2021
This is the doc change for
rabbitmq/rabbitmq-server#3075.

In K8s, the cluster-operator nowadays uses Parallel (instead of OrderedReady)
pod management policy. Therefore, we delete some sentences on
recommending OrderedReady.

There is no need to document the new config value
`cluster_formation.internal_lock_retries` since it's too much
implementation detail for this doc and we don't expect users to change
this value.
ansd added a commit to rabbitmq/rabbitmq-website that referenced this pull request Jun 7, 2021
This is the doc change for
rabbitmq/rabbitmq-server#3075.

In K8s, the cluster-operator nowadays uses Parallel (instead of OrderedReady)
pod management policy. Therefore, we delete some sentences on
recommending OrderedReady.

There is no need to document the new config value
`cluster_formation.internal_lock_retries` since it's too much
implementation detail for this doc and we don't expect users to change
this value.
michaelklishin added a commit that referenced this pull request Jun 10, 2021
Remove randomized startup delays

(cherry picked from commit ed0ba6a)

Conflicts:
	deps/rabbit/src/rabbit_mnesia.erl
	deps/rabbit/src/rabbit_peer_discovery.erl
	deps/rabbit/test/peer_discovery_classic_config_SUITE.erl
	deps/rabbitmq_peer_discovery_k8s/src/rabbit_peer_discovery_k8s.erl
@michaelklishin
Copy link
Member

Backported to v3.8.x.

@michaelklishin michaelklishin added this to the 3.8.18 milestone Jun 14, 2021
michaelklishin pushed a commit to rabbitmq/rabbitmq-website that referenced this pull request Aug 9, 2021
This is the doc change for
rabbitmq/rabbitmq-server#3075.

In K8s, the cluster-operator nowadays uses Parallel (instead of OrderedReady)
pod management policy. Therefore, we delete some sentences on
recommending OrderedReady.

There is no need to document the new config value
`cluster_formation.internal_lock_retries` since it's too much
implementation detail for this doc and we don't expect users to change
this value.
ansd added a commit to rabbitmq/cluster-operator that referenced this pull request Aug 15, 2022
Since we bumped the minimum supported RabbitMQ version to v3.9.0 in
#1110, we can delete
the deprecated `cluster_formation.randomized_startup_delay_range`
configurations.

See rabbitmq/rabbitmq-server#3075.

Prior to this commit, the RabbitMQ logs contained the following warning:
```
2022-08-15 08:18:03.870480+00:00 [warn] <0.130.0> cluster_formation.randomized_startup_delay_range.min and cluster_formation.randomized_startup_delay_range.max are deprecated
```
ansd added a commit to rabbitmq/cluster-operator that referenced this pull request Aug 15, 2022
Since we bumped the minimum supported RabbitMQ version to v3.9.0 in
#1110, we can delete
the deprecated `cluster_formation.randomized_startup_delay_range`
configurations.

See rabbitmq/rabbitmq-server#3075.

Prior to this commit, the RabbitMQ logs contained the following warning:
```
2022-08-15 08:18:03.870480+00:00 [warn] <0.130.0> cluster_formation.randomized_startup_delay_range.min and cluster_formation.randomized_startup_delay_range.max are deprecated
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants