Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

30s down-moratorium before allowing suspension #14455

Merged

Conversation

hakonhall
Copy link
Member

@hakonhall hakonhall commented Sep 18, 2020

If all services of a node are down, we used allow suspension. If those services are monitored with /state/v1/health, we have the timestamp the service became unhealthy - the "since" timestamp. Now we will require the services to have been down for at least 30s before allowing suspension based on unhealthiness.

Also, for config servers only, we will log all healthiness transitions to track down some orchestrator issues.

Copy link
Member

@hmusum hmusum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggested change, otherwise LGTM

…el/ClusterId.java

Co-authored-by: Harald Musum <musum@verizonmedia.com>
@hakonhall
Copy link
Member Author

I'm unable to reproduce the Travis failure (ControllerTest, testDevDeployment), and I no longer have the ability to trigger another run on Travis. I'll merge and make a revert just in case.

@hakonhall hakonhall merged commit 3a6bcde into master Sep 18, 2020
@hakonhall hakonhall deleted the hakonhall/30s-down-moratorium-before-allowing-suspension branch September 18, 2020 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants