`--exit-when-healthy` does not wait for the node to join the cluster first #15818

layus · 2023-12-20T19:37:10Z

Version & Environment

Redpanda version: (use rpk version): 23.2.17
Operating System (e.g. from /etc/os-release): NixOS

What went wrong?

We had to insert a sleep instruction after restarting redpanda to let it join (or start joining at least) its cluster. Otherwise it would happily return immediately from --exit-when-healthy and apply maintenance disable with no effect.

systemctl restart redpanda.service
sleep 2
rpk cluster health --exit-when-healthy
rpk cluster maintenance disable $node_id

What should have happened instead?

I would have expected rpk cluster health --exit-when-healthy to wait for the node to join the cluster before reporting testing if the cluster is healthy.

How to reproduce the issue?

See above snippet, without sleep.

Additional information

https://github.com/fornybar/redpanda.nix and #nix channel on slack

The text was updated successfully, but these errors were encountered:

r-vasquez · 2024-01-15T21:34:45Z

Hello @layus, thanks for reporting this issue.

Currently, --exit-when-healthy only works properly if you provide the --watch flag, however, this is not very intuitive, so I created #16106 to improve the UX here.

In the meantime, adding a --watch to the script above should fix the issue.

layus · 2024-01-15T22:30:55Z

Thanks, will give it a try with --watch in the meantime :-)

As per redpanda-data/redpanda#15818 it also requires --watch to function properly.

dotnwat · 2024-01-16T20:22:49Z

Thanks @r-vasquez. @layus let us know if you hit any other issues or --watch doesn't work.

layus · 2024-02-02T14:24:04Z

@dotnwat As you can see in fornybar/redpanda.nix#32, adding --watch and removing the hacky sleep 2 fails the test. I think the sleep 2 is still needed, and my diagnostic may still be valid. I think a node can be healthy before attempting to rejoin its cluster.

layus · 2024-02-05T23:16:00Z

@r-vasquez friendly ping regarding above message ^

r-vasquez · 2024-02-07T00:11:05Z

Hello @layus, is it possible to get the rpk logs as well?

I was looking into the logs but the failure doesn't seem to be the typical rpk error error disabling maintenance mode... Instead, I found a couple of:

(server: waiting for success: rpk cluster health --exit-when-healthy --watch)
(finished: waiting for success: rpk cluster health --exit-when-healthy --watch, in 2.15 seconds)

Which may indicate that rpk is indeed waiting. But I'm not certain what the specific test failure that I should look into is. Maybe is:

client # Unable to request metadata from &quot;authserver:9092&quot;: KafkaConnectionError: Connection at authserver:9092 closed
client # Unable to request metadata from &quot;authserver:9092&quot;: KafkaConnectionError: Connection at authserver:9092 closed
authserver # [   21.620898] rpk[971]: WARN  2024-01-16 09:22:33,073 [shard 0] kafka - connection_context.cc:500 - Error processing request: kafka::kafka_authentication_exception (requests.cc:171 - Unexpected request during authentication: 3)
client # Unauthorized produce and consume: {} [KafkaConnectionError(&quot;Unable to bootstrap from [(&#39;authserver&#39;, 9092, &lt;AddressFamily.AF_UNSPEC: 0&gt;)]&quot;), KafkaConnectionError(&quot;Unable to bootstrap from [(&#39;authserver&#39;, 9092, &lt;AddressFamily.AF_UNSPEC: 0&gt;)]&quot;)]
client # Unclosed AIOKafkaConsumer
client # consumer: &lt;aiokafka.consumer.consumer.AIOKafkaConsumer object at 0x7f8fc16cb0a0&gt;
authserver # [   21.630503] rpk[971]: WARN  2024-01-16 09:22:33,073 [shard 0] kafka - connection_context.cc:500 - Error processing request: kafka::kafka_authentication_exception (requests.cc:171 - Unexpected request during authentication: 3)
client # Unclosed AIOKafkaProducer
client # producer: &lt;aiokafka.producer.producer.AIOKafkaProducer object at 0x7f8fc16cb1c0&gt;

I also tried reproducing the rpk wait time with a simple script using rpk container:

#!/bin/bash

rpk container purge

rpk container start -n 3

rpk cluster maintenance enable 0 --wait

docker stop rp-node-0
sleep 5 # wait until the node down is reported.
docker start rp-node-0

rpk cluster health --exit-when-healthy --watch

rpk cluster maintenance disable 0

But it works, it waits for the cluster to be healthy before finishing:

Cluster started!
NODE-ID  STATUS   KAFKA-ADDRESS    ADMIN-ADDRESS    PROXY-ADDRESS    SCHEMA-REGISTRY-ADDRESS
0        running  127.0.0.1:37057  127.0.0.1:37789  127.0.0.1:35329  127.0.0.1:33167
1        running  127.0.0.1:41577  127.0.0.1:35451  127.0.0.1:35487  127.0.0.1:46403
2        running  127.0.0.1:39305  127.0.0.1:33295  127.0.0.1:37489  127.0.0.1:43701

Created "rpk-container" profile.

You can use rpk to interact with this cluster. E.g:

    rpk cluster info
    rpk cluster health

Successfully enabled maintenance mode for node 0. Waiting for node to drain...
NODE-ID  ENABLED  FINISHED  ERRORS  PARTITIONS  ELIGIBLE  TRANSFERRING  FAILED
0        true     true      false   1           0         1             0

All partitions on node 0 have drained.
rp-node-0
rp-node-0
CLUSTER HEALTH OVERVIEW
=======================
Healthy:                          false
Unhealthy reasons:                [nodes_down]
Controller ID:                    1
All nodes:                        [0 1 2]
Nodes down:                       [0]
Leaderless partitions (0):        []
Under-replicated partitions (0):  []
CLUSTER HEALTH OVERVIEW
=======================
Healthy:                          true
Unhealthy reasons:                []
Controller ID:                    1
All nodes:                        [0 1 2]
Nodes down:                       []
Leaderless partitions (0):        []
Under-replicated partitions (0):  []

Let me know if I'm missing something from the logs or if I am looking at the wrong error here 😅

layus · 2024-02-08T21:30:41Z

@r-vasquez Thank you for your detailed review. The error is related to a VM crashing, probably due to lack of ram. Redpanda seems okay now with the --watch flag 😀 .

layus added the kind/bug Something isn't working label Dec 20, 2023

layus mentioned this issue Dec 20, 2023

Upstream changes tracking fornybar/redpanda.nix#17

Open

4 tasks

r-vasquez added area/rpk area/admin-api and removed area/rpk area/admin-api labels Jan 15, 2024

r-vasquez mentioned this issue Jan 15, 2024

rpk: improve --exit-when-healthy in cluster health #16106

Merged

7 tasks

layus added a commit to fornybar/redpanda.nix that referenced this issue Jan 16, 2024

Proper use of --exit-when-healthy

1caed0f

As per redpanda-data/redpanda#15818 it also requires --watch to function properly.

layus mentioned this issue Jan 16, 2024

Proper use of --exit-when-healthy fornybar/redpanda.nix#32

Merged

dotnwat closed this as completed Jan 16, 2024

vbotbuildovich mentioned this issue Jan 30, 2024

[v23.3.x] --exit-when-healthy does not wait for the node to join the cluster first #16362

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`--exit-when-healthy` does not wait for the node to join the cluster first #15818

`--exit-when-healthy` does not wait for the node to join the cluster first #15818

layus commented Dec 20, 2023

r-vasquez commented Jan 15, 2024 •

edited

layus commented Jan 15, 2024

dotnwat commented Jan 16, 2024

layus commented Feb 2, 2024

layus commented Feb 5, 2024

r-vasquez commented Feb 7, 2024

layus commented Feb 8, 2024

--exit-when-healthy does not wait for the node to join the cluster first #15818

--exit-when-healthy does not wait for the node to join the cluster first #15818

Comments

layus commented Dec 20, 2023

Version & Environment

What went wrong?

What should have happened instead?

How to reproduce the issue?

Additional information

r-vasquez commented Jan 15, 2024 • edited

layus commented Jan 15, 2024

dotnwat commented Jan 16, 2024

layus commented Feb 2, 2024

layus commented Feb 5, 2024

r-vasquez commented Feb 7, 2024

layus commented Feb 8, 2024

`--exit-when-healthy` does not wait for the node to join the cluster first #15818

`--exit-when-healthy` does not wait for the node to join the cluster first #15818

r-vasquez commented Jan 15, 2024 •

edited