Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--exit-when-healthy does not wait for the node to join the cluster first #15818

Closed
layus opened this issue Dec 20, 2023 · 7 comments · Fixed by #16106
Closed

--exit-when-healthy does not wait for the node to join the cluster first #15818

layus opened this issue Dec 20, 2023 · 7 comments · Fixed by #16106
Labels
area/rpk kind/bug Something isn't working

Comments

@layus
Copy link
Contributor

layus commented Dec 20, 2023

Version & Environment

  • Redpanda version: (use rpk version): 23.2.17
  • Operating System (e.g. from /etc/os-release): NixOS

What went wrong?

We had to insert a sleep instruction after restarting redpanda to let it join (or start joining at least) its cluster. Otherwise it would happily return immediately from --exit-when-healthy and apply maintenance disable with no effect.

systemctl restart redpanda.service
sleep 2
rpk cluster health --exit-when-healthy
rpk cluster maintenance disable $node_id

What should have happened instead?

I would have expected rpk cluster health --exit-when-healthy to wait for the node to join the cluster before reporting testing if the cluster is healthy.

How to reproduce the issue?

See above snippet, without sleep.

Additional information

https://github.com/fornybar/redpanda.nix and #nix channel on slack

@r-vasquez
Copy link
Contributor

r-vasquez commented Jan 15, 2024

Hello @layus, thanks for reporting this issue.

Currently, --exit-when-healthy only works properly if you provide the --watch flag, however, this is not very intuitive, so I created #16106 to improve the UX here.

In the meantime, adding a --watch to the script above should fix the issue.

@layus
Copy link
Contributor Author

layus commented Jan 15, 2024

Thanks, will give it a try with --watch in the meantime :-)

layus added a commit to fornybar/redpanda.nix that referenced this issue Jan 16, 2024
As per redpanda-data/redpanda#15818 it also
requires --watch to function properly.
@dotnwat
Copy link
Member

dotnwat commented Jan 16, 2024

Thanks @r-vasquez. @layus let us know if you hit any other issues or --watch doesn't work.

@layus
Copy link
Contributor Author

layus commented Feb 2, 2024

@dotnwat As you can see in fornybar/redpanda.nix#32, adding --watch and removing the hacky sleep 2 fails the test. I think the sleep 2 is still needed, and my diagnostic may still be valid. I think a node can be healthy before attempting to rejoin its cluster.

@layus
Copy link
Contributor Author

layus commented Feb 5, 2024

@r-vasquez friendly ping regarding above message ^

@r-vasquez
Copy link
Contributor

Hello @layus, is it possible to get the rpk logs as well?

I was looking into the logs but the failure doesn't seem to be the typical rpk error error disabling maintenance mode... Instead, I found a couple of:

(server: waiting for success: rpk cluster health --exit-when-healthy --watch)
(finished: waiting for success: rpk cluster health --exit-when-healthy --watch, in 2.15 seconds)

Which may indicate that rpk is indeed waiting. But I'm not certain what the specific test failure that I should look into is. Maybe is:

client # Unable to request metadata from "authserver:9092": KafkaConnectionError: Connection at authserver:9092 closed
client # Unable to request metadata from "authserver:9092": KafkaConnectionError: Connection at authserver:9092 closed
authserver # [   21.620898] rpk[971]: WARN  2024-01-16 09:22:33,073 [shard 0] kafka - connection_context.cc:500 - Error processing request: kafka::kafka_authentication_exception (requests.cc:171 - Unexpected request during authentication: 3)
client # Unauthorized produce and consume: {} [KafkaConnectionError("Unable to bootstrap from [('authserver', 9092, <AddressFamily.AF_UNSPEC: 0>)]"), KafkaConnectionError("Unable to bootstrap from [('authserver', 9092, <AddressFamily.AF_UNSPEC: 0>)]")]
client # Unclosed AIOKafkaConsumer
client # consumer: <aiokafka.consumer.consumer.AIOKafkaConsumer object at 0x7f8fc16cb0a0>
authserver # [   21.630503] rpk[971]: WARN  2024-01-16 09:22:33,073 [shard 0] kafka - connection_context.cc:500 - Error processing request: kafka::kafka_authentication_exception (requests.cc:171 - Unexpected request during authentication: 3)
client # Unclosed AIOKafkaProducer
client # producer: <aiokafka.producer.producer.AIOKafkaProducer object at 0x7f8fc16cb1c0>

I also tried reproducing the rpk wait time with a simple script using rpk container:

#!/bin/bash

rpk container purge

rpk container start -n 3

rpk cluster maintenance enable 0 --wait

docker stop rp-node-0
sleep 5 # wait until the node down is reported.
docker start rp-node-0

rpk cluster health --exit-when-healthy --watch

rpk cluster maintenance disable 0

But it works, it waits for the cluster to be healthy before finishing:

Cluster started!
NODE-ID  STATUS   KAFKA-ADDRESS    ADMIN-ADDRESS    PROXY-ADDRESS    SCHEMA-REGISTRY-ADDRESS
0        running  127.0.0.1:37057  127.0.0.1:37789  127.0.0.1:35329  127.0.0.1:33167
1        running  127.0.0.1:41577  127.0.0.1:35451  127.0.0.1:35487  127.0.0.1:46403
2        running  127.0.0.1:39305  127.0.0.1:33295  127.0.0.1:37489  127.0.0.1:43701

Created "rpk-container" profile.

You can use rpk to interact with this cluster. E.g:

    rpk cluster info
    rpk cluster health

Successfully enabled maintenance mode for node 0. Waiting for node to drain...
NODE-ID  ENABLED  FINISHED  ERRORS  PARTITIONS  ELIGIBLE  TRANSFERRING  FAILED
0        true     true      false   1           0         1             0

All partitions on node 0 have drained.
rp-node-0
rp-node-0
CLUSTER HEALTH OVERVIEW
=======================
Healthy:                          false
Unhealthy reasons:                [nodes_down]
Controller ID:                    1
All nodes:                        [0 1 2]
Nodes down:                       [0]
Leaderless partitions (0):        []
Under-replicated partitions (0):  []
CLUSTER HEALTH OVERVIEW
=======================
Healthy:                          true
Unhealthy reasons:                []
Controller ID:                    1
All nodes:                        [0 1 2]
Nodes down:                       []
Leaderless partitions (0):        []
Under-replicated partitions (0):  []

Let me know if I'm missing something from the logs or if I am looking at the wrong error here 😅

@layus
Copy link
Contributor Author

layus commented Feb 8, 2024

@r-vasquez Thank you for your detailed review. The error is related to a VM crashing, probably due to lack of ram. Redpanda seems okay now with the --watch flag 😀 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rpk kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants