sensuctl cannot get cluster health when an etcd node is down and quorum is lost #3398

gtarnaras · 2019-11-18T15:52:44Z

Expected Behavior

Running sensuctl cluster health against a healthy node of the cluster should return the cluster status no matter if one or more nodes are down.

Current Behavior

I am having a 3 node cluster and i am currently testing it. When i terminate a node and then run sensuctl cluster healthi am getting
Error: GET "/health": Get https://x.x.x.x:8080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

OR using the api
++ curl -k https://x.x.x.x:8080/health
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0++ jq -r '.ClusterHealth[] | select(.Healthy==false) | .MemberID'
0 0 0 0 0 0 0 0 --:--:-- 0:24:57 --:--:-- 0

It is running for 25 mins...

Since i cannot detect the unhealthy cluster i cannot add new nodes due to:

sensuctl cluster member-add <new_node> https://x.x.x.x:2380
Error: couldn't add cluster member: etcdserver: unhealthy cluster

neither detect which node is unhealthy to delete it. i.e. if a node is in failed condition i cannot restore the cluster.

Steps to Reproduce (for bugs)

Create a 3 node cluster using embedded etcd
run sensuctl cluster health
Terminate a node
run sensuctl cluster health

Context

I am trying to build a robust Sensu cluster on AWS using autoscaling groups and i am currently checking how sensuctl reacts in case of unexpected failures. I am trying to follow the "remove-first" practice as described here -> https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/runtime-configuration.md#replace-a-failed-machine . I am able to run all other sensuctl commands, authenticate etc. but i cannot get the health status.

Your Environment

The OS is RHEL7.7
I am hosting the packages on a custom yum repo. The following packages are installed
sensu-go-agent.x86_64 5.14.2-7022
sensu-go-backend.x86_64 5.14.2-7022
sensu-go-cli.x86_64 5.14.2-7022

p.s. i am using the embedded etcd version

The text was updated successfully, but these errors were encountered:

gtarnaras changed the title ~~sensuctl cannot get cluster health when a node an etcd node is down~~ sensuctl cannot get cluster health when an etcd node is down and quorum is lost Nov 18, 2019

palourde added the bug label Nov 19, 2019

palourde self-assigned this Nov 19, 2019

palourde mentioned this issue Nov 19, 2019

Pass context with timeout to etcd health requests #3402

Merged

palourde closed this as completed in #3402 Nov 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sensuctl cannot get cluster health when an etcd node is down and quorum is lost #3398

sensuctl cannot get cluster health when an etcd node is down and quorum is lost #3398

gtarnaras commented Nov 18, 2019 •

edited

Loading

sensuctl cannot get cluster health when an etcd node is down and quorum is lost #3398

sensuctl cannot get cluster health when an etcd node is down and quorum is lost #3398

Comments

gtarnaras commented Nov 18, 2019 • edited Loading

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Your Environment

gtarnaras commented Nov 18, 2019 •

edited

Loading