Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validator: default health check slot distance #33553

Closed
diman-io opened this issue Oct 5, 2023 · 2 comments · Fixed by #33568
Closed

validator: default health check slot distance #33553

diman-io opened this issue Oct 5, 2023 · 2 comments · Fixed by #33568
Assignees
Labels
community Community contribution

Comments

@diman-io
Copy link
Contributor

diman-io commented Oct 5, 2023

Problem

Here's another concern I have about v1.17

The accounts hash interval was used to determine the health of a node, which in turn is used for solana-validator wait-for-restart-window, a highly sought-after command during updates.

A node is considered unhealthy when the difference between the max accounts hash interval from known validators and one we've sent in gossip is greater than health check slot distance, which is just 150 by default.

So, considering that the accounts hash interval is now essentially equal to the incremental snapshot interval, it may be worth setting the health check slot distance (if it's not specified) to be accounts hash interval + 50 (the magic 50 is here just because now default health check slot distance = 150 and default incremental snapshot archive slots = 100).
It might be worth simply removing the health check from the wait-for-restart-window.

Otherwise, the node will almost always be unhealthy, and the wait-for-restart-window will not complete. Of course, I understand that if an operator overrides the incremental snapshot interval, they can also override the health check slot distance. However, this will likely happen in almost 100% of cases, and it seems more convenient to define this as the default behaviour to reduce the number of misconfigurations.

@diman-io diman-io added the community Community contribution label Oct 5, 2023
@diman-io
Copy link
Contributor Author

diman-io commented Oct 6, 2023

So, considering that the accounts hash interval is now essentially equal to the incremental snapshot interval, it may be worth setting the health check slot distance (if it's not specified) to be accounts hash interval + 50 (the magic 50 is here just because now default health check slot distance = 150 and default incremental snapshot archive slots = 100).

It might be worth simply removing the health check from the wait-for-restart-window.

@steviez
Copy link
Contributor

steviez commented Oct 6, 2023

Issues with getHealth are known: #16957

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants