redpanda: cluster will not form without a node with an empty seed server list #333

rkruze · 2020-12-22T19:36:42Z

When setting up a cluster, you want to make sure all nodes have the same seed servers. This includes the initial node since if it was to come back with an empty data directory you would want it to be able to join the cluster automatically without user intervention. This does not work today. If you set up a three-node cluster with each node having all three nodes in its seeds list it will never form a cluster. You see the following from the node:

Dec 22 19:26:25 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:25,254 [shard 0] cluster - members_manager.cc:309 - Processing node '1' join reques
t
Dec 22 19:26:25 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:25,503 [shard 0] cluster - members_manager.cc:309 - Processing node '0' join reques
t
Dec 22 19:26:25 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:25,503 [shard 0] cluster - members_manager.cc:274 - Error joining cluster using 0 s
eed server
Dec 22 19:26:25 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:25,503 [shard 0] cluster - members_manager.cc:211 - Sending join request to 1 @ {ho
st: 172.31.53.238, port: 33145}
Dec 22 19:26:25 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:25,503 [shard 0] cluster - members_manager.cc:198 - Next cluster join attempt in 66
14 milliseconds
Dec 22 19:26:29 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:29,782 [shard 0] cluster - members_manager.cc:309 - Processing node '2' join reques
t
Dec 22 19:26:32 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:32,117 [shard 0] cluster - members_manager.cc:309 - Processing node '0' join reques
t
Dec 22 19:26:32 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:32,118 [shard 0] cluster - members_manager.cc:274 - Error joining cluster using 0 s
eed server
Dec 22 19:26:32 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:32,118 [shard 0] cluster - members_manager.cc:211 - Sending join request to 1 @ {ho
st: 172.31.53.238, port: 33145}
Dec 22 19:26:32 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:32,118 [shard 0] cluster - members_manager.cc:198 - Next cluster join attempt in 63
93 milliseconds
Dec 22 19:26:32 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:32,376 [shard 0] cluster - members_manager.cc:309 - Processing node '1' join reques
t
Dec 22 19:26:36 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:36,329 [shard 0] cluster - members_manager.cc:309 - Processing node '2' join reques
t
Dec 22 19:26:38 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:38,449 [shard 0] cluster - members_manager.cc:309 - Processing node '1' join reques
t
Dec 22 19:26:38 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:38,511 [shard 0] cluster - members_manager.cc:309 - Processing node '0' join reques
t
Dec 22 19:26:38 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:38,511 [shard 0] cluster - members_manager.cc:274 - Error joining cluster using 0 s
eed server
Dec 22 19:26:38 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:38,511 [shard 0] cluster - members_manager.cc:211 - Sending join request to 1 @ {ho
st: 172.31.53.238, port: 33145}
Dec 22 19:26:38 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:38,512 [shard 0] cluster - members_manager.cc:198 - Next cluster join attempt in 64
18 milliseconds
Dec 22 19:26:42 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:42,537 [shard 0] cluster - members_manager.cc:309 - Processing node '2' join reques
t
Dec 22 19:26:44 ip-172-31-61-36 rpk[17533]: INFO  2020-12-22 19:26:44,553 [shard 0] cluster - members_manager.cc:309 - Processing node '1' join reques
t

It seems no one knows who should be the bootstrap server and thus the cluster never forms.

The text was updated successfully, but these errors were encountered:

BenPope · 2020-12-22T19:56:03Z

Probably related/prerequisite: #245

emaxerrno · 2020-12-23T07:35:09Z

i remember @mmaslankaprv mentioning we had a restriction to form the raft groups from one node, i.e.: to bootstrap from one node. we needed to differentiate node joining vs node bootstrap. i think we have more metadata tracking now we can make that distinction. basically if not in set.

mmaslankaprv · 2021-01-04T10:50:07Z

Currently we operate with the following assumptions:

node without seed servers is designated as a cluster root, it is the only node that will initiate cluster formation, when no more
nodes are present the node with empty seed server list will form a single node cluster
node with seed server list will use the seed servers to communicate with the cluster to join

BenPope · 2021-01-04T10:59:26Z

How does one restart the cluster root? Should it have seeds? What if it lost its data dir?

Would it make sense to have a two-phase initialisation, where a tool, perhaps RPK, triggers cluster formation?

mmaslankaprv · 2021-01-04T11:23:13Z

Restarting node is not a problem. The problem is when it lose the data directory, then we have to change configuration to point it to different nodes to join the cluster. I am wondering how is this solved in CoackroachDB. They similar approach to seed servers.
We were targeting the easiest way of bootstrapping the cluster. Two phase approach is an option but certainly more complicated one. Maybe we can use incoming connection as a trigger to change node behavior ?

BenPope · 2021-01-04T11:46:02Z

I think CockroachDB does two-phase. The concern is that during a network partition, bootstrapping must form within the majority partition. An unusual situation for sure, but it comes up if the cluster root loses it's data and is restarted.

The context here is within Kubernetes. It's not so easy with StatefulSet, to have a node behave differently depending on whether it is the cluster root, and whether it has lost its data. It could be punted to an operator, but it might make sense to have Redpanda perform the magic.

rkruze · 2021-01-04T16:08:01Z

Yes, CockroachDB uses a two-phased init. Each server is brought up with the same "join" list. Once those nodes are up, if a cluster hasn't been formed in the past, they go into standby until a node gets an "init" command. https://www.cockroachlabs.com/docs/v20.2/cockroach-init.html

mmaslankaprv · 2021-01-04T16:45:11Z

I think we can make the operation to be two step without implementing the centralized configuration. We can introduce the centralized configuration as a follow up.

jcsp · 2021-10-27T08:18:23Z

I think the two-phase init makes sense -- that should probably be hidden behind an rpk setup command that runs it for the user after daemons start.

We should also retain the current behaviour that writing a config with seed_servers=[] causes a node to auto-init, so that a single node cluster init is still a trivial case of just running a binary.

jcsp · 2021-10-27T09:40:19Z

Related to #2793 -- once both are done, a cluster could realistically use the same redpanda.yml on all nodes.

nicolaferraro · 2022-06-13T13:39:13Z

Leaving this here as it may affect the solution to this issue.
I've been working on a two phase initialization in the operator and I was expecting that Redpanda node 0 could start alone if the seeds server list contained a single entry (i.e. the root node 0 itself).

It turns out this is the case, except when the cluster has TLS and mutual authentication on the Kafka API endpoint.
In that specific case, the cluster is never formed:

cluster-tls-0 redpanda DEBUG 2022-06-13 13:32:50,014 [shard 0] cluster - health_monitor_backend.cc:284 - unable to refresh health metadata, no leader controller
cluster-tls-0 redpanda INFO  2022-06-13 13:32:50,014 [shard 0] cluster - health_monitor_backend.cc:426 - error refreshing cluster health state - Currently there is no leader controller elected in the cluster
cluster-tls-0 redpanda INFO  2022-06-13 13:32:50,014 [shard 0] cluster - metadata_dissemination_service.cc:357 - unable to retrieve cluster health report - Currently there is no leader controller elected in the cluster
cluster-tls-0 redpanda DEBUG 2022-06-13 13:32:52,845 [shard 0] cluster - members_manager.cc:435 - Using current node as a seed server
cluster-tls-0 redpanda INFO  2022-06-13 13:32:52,845 [shard 0] cluster - members_manager.cc:499 - Processing node '0' join request (version 3)
cluster-tls-0 redpanda INFO  2022-06-13 13:32:52,845 [shard 0] cluster - members_manager.cc:370 - Next cluster join attempt in 5996 milliseconds
cluster-tls-0 redpanda DEBUG 2022-06-13 13:32:53,013 [shard 0] compaction_ctrl - backlog_controller.cc:129 - updating shares 10
cluster-tls-0 redpanda INFO  2022-06-13 13:32:53,014 [shard 0] group-metadata-migration - group_metadata_migration.cc:710 - kafka_internal/group topic does not exists, activating consumer_offsets feature
cluster-tls-0 redpanda DEBUG 2022-06-13 13:32:53,014 [shard 0] cluster - health_monitor_backend.cc:400 - requesing cluster state report with filter: {per_node_filter: {include_partitions: true, ntp_filters: {}}, nodes: {}}, force refresh: false

So, the seed server list needs to be completely empty for the initial cluster to be created.

…nitial raft group This tries to solve the problem with empty seed_servers on node 0. With this change, all fresh clusters will be initially set to 1 replica (via `status.currentReplicas`), until a cluster is created and the operator can verify it via admin API. Then the cluster is scaled to the number of instances desired by the user. After the cluster is initialized, and for the entire lifetime of the cluster, the `seed_servers` property will be populated with the full list of available servers, in every node of the cluster. This overcomes redpanda-data#333. Previously, node 0 was always forced to have an empty seed_servers property, but this caused problems when it lost the data dir, as it tried to create a brand-new cluster. With this change, even if node 0 loses the data dir, the seed_servers property will always point to other nodes, so it will try to join the existing cluster.

twmb · 2022-08-12T15:37:26Z

@jcsp am I reading the thread above correctly, we need the code to exist in redpanda first, and once redpanda handles an empty seed servers, we can change rpk to emit no seed servers. I'll move this too our own "awaiting other team" queue. cc @piyushredpanda

jcsp · 2022-08-12T15:57:03Z

There will be a bit more to it than that. We haven't nailed this down yet, but probably:

The default will still be to auto-form a cluster if one of the nodes has an empty seed_servers, as it does today
A new configuration property called something like "cluster_await_initialize" (false by default)
if cluster_await initialize is true, then redpanda will not form a cluster (i.e. will not write controller log or fully come up) until one of the nodes receives an admin API call asking it to initialize (and this call can have some limited set of parameters like an initial superuser account credentials or an initial license file).
seed_servers can then be set to the same value on all nodes (e.g. leave it out if you have one node, or set it symmetrically to the full list of nodes on all the nodes).

Auto-selection node_id is a separate but complementary thing: that enables orchestators to avoid picking node Ids for redpanda nodes, just leave it out the config file and redpanda will make one up.

initializing a single node cluster should not set seeds in order to trigger auto-init per redpanda-data/redpanda#333 (comment) also remove apparently invalid `empty_seed_starts_cluster` flag per: ``` INFO 2022-11-09 02:19:46,745 [shard 0] redpanda::main - application.cc:255 - Failure during startup: std::invalid_argument (Unknown property empty_seed_starts_cluster) ```

…create initial raft group This tries to solve the problem with empty seed_servers on node 0. With this change, all fresh clusters will be initially set to 1 replica (via `status.currentReplicas`), until a cluster is created and the operator can verify it via admin API. Then the cluster is scaled to the number of instances desired by the user. After the cluster is initialized, and for the entire lifetime of the cluster, the `seed_servers` property will be populated with the full list of available servers, in every node of the cluster. This overcomes redpanda-data#333. Previously, node 0 was always forced to have an empty seed_servers property, but this caused problems when it lost the data dir, as it tried to create a brand-new cluster. With this change, even if node 0 loses the data dir, the seed_servers property will always point to other nodes, so it will try to join the existing cluster.

…0 create initial raft group This tries to solve the problem with empty seed_servers on node 0. With this change, all fresh clusters will be initially set to 1 replica (via `status.currentReplicas`), until a cluster is created and the operator can verify it via admin API. Then the cluster is scaled to the number of instances desired by the user. After the cluster is initialized, and for the entire lifetime of the cluster, the `seed_servers` property will be populated with the full list of available servers, in every node of the cluster. This overcomes redpanda-data#333. Previously, node 0 was always forced to have an empty seed_servers property, but this caused problems when it lost the data dir, as it tried to create a brand-new cluster. With this change, even if node 0 loses the data dir, the seed_servers property will always point to other nodes, so it will try to join the existing cluster.

rkruze changed the title ~~redpanda: cannot handle when seed server list includes itself~~ redpanda: cluster will not form without a node with an empty seed server list Dec 22, 2020

emaxerrno assigned dotnwat and mmaslankaprv Dec 23, 2020

emaxerrno added area/raft good first issue Good for newcomers kind/bug Something isn't working labels Dec 23, 2020

mmaslankaprv reopened this Jan 4, 2021

mmaslankaprv closed this as completed Jan 4, 2021

dotnwat added the mark-and-sweep label Sep 22, 2021

dotnwat removed their assignment Sep 22, 2021

dotnwat removed the mark-and-sweep label Sep 22, 2021

jcsp mentioned this issue May 24, 2022

Auto-select node_id when node joins cluster #2793

Closed

RafalKorepta mentioned this issue May 30, 2022

Track persistent volume in k8s environement #4965

Closed

nicolaferraro mentioned this issue Jun 13, 2022

Operator: add support for downscaling #5019

Merged

twmb added the area/rpk label Jul 7, 2022

vuldin mentioned this issue Jul 29, 2022

Properly handle node_id and seed_servers redpanda-data/helm-charts#91

Closed

piyushredpanda assigned andrwng and dlex and unassigned mmaslankaprv Aug 17, 2022

mmedenjak unassigned andrwng Sep 13, 2022

dlex mentioned this issue Oct 12, 2022

Seeds Driven Cluster Bootstrap #6744

Merged

6 tasks

dlex closed this as completed in #6744 Oct 20, 2022

dlex mentioned this issue Nov 3, 2022

Init Cluster UUID in upgraded clusters #7079

Merged

6 tasks

dlex mentioned this issue Nov 10, 2022

Node UUID map initialized from cluster discovery #7204

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redpanda: cluster will not form without a node with an empty seed server list #333

redpanda: cluster will not form without a node with an empty seed server list #333

rkruze commented Dec 22, 2020

BenPope commented Dec 22, 2020

emaxerrno commented Dec 23, 2020

mmaslankaprv commented Jan 4, 2021

BenPope commented Jan 4, 2021

mmaslankaprv commented Jan 4, 2021

BenPope commented Jan 4, 2021 •

edited

Loading

rkruze commented Jan 4, 2021

mmaslankaprv commented Jan 4, 2021

jcsp commented Oct 27, 2021

jcsp commented Oct 27, 2021

nicolaferraro commented Jun 13, 2022

twmb commented Aug 12, 2022

jcsp commented Aug 12, 2022

redpanda: cluster will not form without a node with an empty seed server list #333

redpanda: cluster will not form without a node with an empty seed server list #333

Comments

rkruze commented Dec 22, 2020

BenPope commented Dec 22, 2020

emaxerrno commented Dec 23, 2020

mmaslankaprv commented Jan 4, 2021

BenPope commented Jan 4, 2021

mmaslankaprv commented Jan 4, 2021

BenPope commented Jan 4, 2021 • edited Loading

rkruze commented Jan 4, 2021

mmaslankaprv commented Jan 4, 2021

jcsp commented Oct 27, 2021

jcsp commented Oct 27, 2021

nicolaferraro commented Jun 13, 2022

twmb commented Aug 12, 2022

jcsp commented Aug 12, 2022

BenPope commented Jan 4, 2021 •

edited

Loading