Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sensuctl cannot get cluster health when an etcd node is down and quorum is lost #3398

Closed
gtarnaras opened this issue Nov 18, 2019 · 0 comments · Fixed by #3402
Closed

sensuctl cannot get cluster health when an etcd node is down and quorum is lost #3398

gtarnaras opened this issue Nov 18, 2019 · 0 comments · Fixed by #3402
Assignees
Labels

Comments

@gtarnaras
Copy link

gtarnaras commented Nov 18, 2019

Expected Behavior

Running sensuctl cluster health against a healthy node of the cluster should return the cluster status no matter if one or more nodes are down.

Current Behavior

I am having a 3 node cluster and i am currently testing it. When i terminate a node and then run sensuctl cluster healthi am getting
Error: GET "/health": Get https://x.x.x.x:8080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

OR using the api
++ curl -k https://x.x.x.x:8080/health
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0++ jq -r '.ClusterHealth[] | select(.Healthy==false) | .MemberID'
0 0 0 0 0 0 0 0 --:--:-- 0:24:57 --:--:-- 0

It is running for 25 mins...

Since i cannot detect the unhealthy cluster i cannot add new nodes due to:

sensuctl cluster member-add <new_node> https://x.x.x.x:2380
Error: couldn't add cluster member: etcdserver: unhealthy cluster

neither detect which node is unhealthy to delete it. i.e. if a node is in failed condition i cannot restore the cluster.

Steps to Reproduce (for bugs)

  1. Create a 3 node cluster using embedded etcd
  2. run sensuctl cluster health
  3. Terminate a node
  4. run sensuctl cluster health

Context

I am trying to build a robust Sensu cluster on AWS using autoscaling groups and i am currently checking how sensuctl reacts in case of unexpected failures. I am trying to follow the "remove-first" practice as described here -> https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/runtime-configuration.md#replace-a-failed-machine . I am able to run all other sensuctl commands, authenticate etc. but i cannot get the health status.

Your Environment

The OS is RHEL7.7
I am hosting the packages on a custom yum repo. The following packages are installed
sensu-go-agent.x86_64 5.14.2-7022
sensu-go-backend.x86_64 5.14.2-7022
sensu-go-cli.x86_64 5.14.2-7022

p.s. i am using the embedded etcd version

@gtarnaras gtarnaras changed the title sensuctl cannot get cluster health when a node an etcd node is down sensuctl cannot get cluster health when an etcd node is down and quorum is lost Nov 18, 2019
@palourde palourde added the bug label Nov 19, 2019
@palourde palourde self-assigned this Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants