Catchup & RPC getHealth Results Do Not Match #16957

brianlong · 2021-04-29T22:35:19Z

Problem

Administrators of RPC pools depend on the getHealth RPC call to determine the health of a node. We dynamically move unhealthy nodes out of the pool when they fall behind. On v1.6.6, we see errors between the CLI catchup command and RPC getHealth.

$ solana catchup ~/keypair.json http://127.0.0.1:8899/ &&  
         curl http://localhost:8899/ -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":1, "method":"getHealth"}' && 
        curl http://localhost:8899/ -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","id":1, "method":"getSlot"}'

4MCpdngneZhRWB1EzX1hcmnb3GPEAQSY6saqS4H1TmSa has caught up (us:76048793 them:76048769)
{"jsonrpc":"2.0","error":{"code":-32005,"message":"Node is behind by 213 slots","data":{"numSlotsBehind":213}},"id":1}
{"jsonrpc":"2.0","result":76048761,"id":1}

In the example above, it looks like we are both caught up and 213 slots behind at the same time. The bigger problem is that the 'unhealthy' result from getHealth might keep healthy RPC nodes out of the pool.

Proposed Solution

Please investigate the source of the problem. The getHealth endpoint is a critical RPC admin tool.

The text was updated successfully, but these errors were encountered:

mvines · 2021-04-30T04:25:36Z

The current getHealth result is very coarse, it compares the latest slot in the local validator against the highest snapshot slot over gossip, which is typically ~100-~200 slots behind.

I've been thinking that a better health check is to compare the highest slot that the validator has received data for and the current confirmed slot of the validator:

$ curl -s http://localhost:8899 -X POST -H "Content-Type: application/json" -d '
  [{"jsonrpc":"2.0","id":1, "method":"getMaxRetransmitSlot"},
   {"jsonrpc":"2.0","id":1, "method":"getSlot", "params":[{"commitment": "confirmed"}]}]
' | jq
[
  {
    "jsonrpc": "2.0",
    "result": 76095287,
    "id": 1
  },
  {
    "jsonrpc": "2.0",
    "result": 76095285,
    "id": 1
  }
]

These should be very close, within single digits of each other normally. Would you be able to prototype a health check based on these two values? If that looks better overall in the real world, I'm game to reimplement getHealth with this logic instead of what it does currently.

linuskendall · 2021-05-01T12:31:25Z

I added it to our health checks, but I am not sure how to make sense of it?

From this it would look like our nodes are terribly behind, about 20k differnce from maxretransmitslot.

id  rpcNode                          minSlot   curSlot   maxRetransmitSlot  slotsStored  prevEpochBlocks  curEpochBlocks
0   http://fra2.rpcpool.com:8899     75154550  76335047  76315666           1161116      0                0
1   http://api.stakeconomy.com       76161940  76335047  76315850           153910       0                0
2   http://fsn1.rpcpool.com:8899     76161940  76335047  76315850           153910       0                0
3   http://ams1.rpcpool.com:8899     75743484  76335047  76315829           572345       0                0
4   http://fra3.rpcpool.com:8899     75153906  76335047  76315693           1161787      0                0
5   http://nyc1.rpcpool.com:8899     74909394  76335047  76315664           1406270      0                0
6   http://mia1.rpcpool.com:8899     75743977  76335047  76315850           571873       0                0
7   http://nyc1.rpcpool.com:8899     74909394  76335047  76315664           1406270      0                0
8   http://mia2.rpcpool.com:8899     75743485  76335047  76315834           572349       0                0
9   http://bl-lax1.rpcpool.com:8899  75743485  76335047  76315850           572365       0                0
10  http://lax2.rpcpool.com:8899     75743487  76335047  76315772           572285       0                0
11  http://sfo1.rpcpool.com:8899     75320189  76335047  76315850           995661       0                0
12  http://hk1.rpcpool.com:8899      75743346  76335047  76315784           572438       0                0

Follows also on command line:

linuskendall@workstation:~/work/solana-rpc-node$ curl -s http://mia2.rpcpool.com:8899 -X POST -H "Content-Type: application/json" -d '
  [{"jsonrpc":"2.0","id":1, "method":"getMaxRetransmitSlot"},
   {"jsonrpc":"2.0","id":1, "method":"getSlot", "params":[{"commitment": "confirmed"}]}]'
[{"jsonrpc":"2.0","result":76335047,"id":1},{"jsonrpc":"2.0","result":76316891,"id":1}]

mvines · 2021-05-02T19:59:33Z

Hmm, I see the same here now too:

$ curl -s http://localhost:8899 -X POST -H "Content-Type: application/json" -d '
  [{"jsonrpc":"2.0","id":1, "method":"getMaxRetransmitSlot"},
   {"jsonrpc":"2.0","id":1, "method":"getMaxShredInsertSlot"},
   {"jsonrpc":"2.0","id":1, "method":"getSlot", "params":[{"commitment": "confirmed"}]}]
' | jq
[
  {
    "jsonrpc": "2.0",
    "result": 76568995,
    "id": 1
  },
  {
    "jsonrpc": "2.0",
    "result": 76568995,
    "id": 1
  },
  {
    "jsonrpc": "2.0",
    "result": 76530218,
    "id": 1
  }
]

The leader for slot 76568995 is Dfc73Czi5K3xa6yKMq98LHJx69PDsSzUvSdCSSqr8NDQ, and that node is currently delinquent. I think what happened is Dfc73C... forked off and is now running ahead of the cluster, and transmitting shreds for the fork it's on whenever it encounters a slot that it's leader for.

This generally means that my proposal at #16957 (comment) is 👎🏼

mvines · 2021-05-04T23:37:03Z

Note to self: perhaps something vote related for a health check instead. A node should see vote transactions being sent over gossip, and then shortly after see most of those vote transactions land in blocks. If the vote transactions in gossip aren't observed to be landing in blocks promptly then the node knows it's not at the tip of the cluster yet.

t-nelson · 2021-05-05T00:53:37Z

Note to self: perhaps something vote related for a health check instead. A node should see vote transactions being sent over gossip, and then shortly after see most of those vote transactions land in blocks. If the vote transactions in gossip aren't observed to be landing in blocks promptly then the node knows it's not at the tip of the cluster yet.

Could maybe generalize the restart leader slot skipping logic Stephen added in #15607

machacekch · 2021-05-05T01:53:49Z

Just a quick note that I just ran into this issue, setting up a v1.6.7 non-voting validator on mainnet. I was getting RPC errors ala "Node is behind by 375 slots", while getSlot showed my node to be at the same slot as api.mainnet-beta.solana.com.
(I'm using the trusted validators from https://docs.solana.com/clusters )

As a workaround, I added "--health-check-slot-distance 1000", though I suppose that defeats the purpose of the health check.

ryoqun · 2021-10-25T09:11:03Z

seems this naughty error message (Node is behind by {100-1000} slots) could also be seen from sendTransaction rpc method, which is kinda important rpc method. (i got a support inquiry about this)

ruuda · 2023-03-22T14:32:55Z

We are also running into this, with --health-check-slot-distance 100, nodes regularly report that they are behind, even though from the metrics that we added in #27665 we can tell that they are perfectly in sync.

The current getHealth result is very coarse, it compares the latest slot in the local validator against the highest snapshot slot over gossip, which is typically ~100-~200 slots behind.

I wonder this about the gossip network in general because Solana relies on the latest snapshot in a few more places, but what happens when a malicious node starts advertising snapshot that is too high? Will all valid nodes report themselves as behind?

mvines · 2023-03-22T14:39:15Z

but what happens when a malicious node starts advertising snapshot that is too high

The --known-validator flag prevents this, as when specified only the snapshot hashes from the whitelisted identities are considered

brooksprumo · 2023-06-06T16:55:43Z

Here's my "attempt" to "fix" getHealth: #31989

The implementation of this PR is tongue-in-cheek, but it does feel almost as arbitrary as the current version.

getHealth feels pretty optimistic, and mostly trusts the known validators already; can we query our latest rooted slot and the known validator's latest rooted slot and call it sufficient? To really know a validator's health, the operator has to do more monitoring anyway, right?

mvines · 2023-06-06T17:16:01Z

can we query our latest rooted slot and the known validator's latest rooted slot and call it sufficient

Perhaps the latest gossip vote from known validators could be used instead. That should much better resolution than using the account hashes, and is information traveling over gossip already (for now)

mvines · 2023-06-06T17:16:13Z

can we query our latest rooted slot and the known validator's latest rooted slot and call it sufficient

Perhaps the latest gossip vote from known validators could be used instead. That should much better resolution than using the account hashes, and is information traveling over gossip already (for now)

steviez · 2023-06-06T17:26:17Z

can we query our latest rooted slot and the known validator's latest rooted slot and call it sufficient?

So I think the current implementation is trying to approximate this. As for directly using the latest roots, how those latest roots are queried matters.

If we grab the latest roots from our latest Bank such as how getVoteAccounts does so, we're completely dependent on our local view of the network. For example, during node catchup after a restart, our node would be recognizing roots as we proceed through blocks. Those roots would likely only be a few slots off of our known validators, even if we are 10k+ slots behind the tip of the cluster.
I think I mentioned it in my other issue and mvines just reiterated, but I agree with him that looking at gossip votes is seemingly a more robust option with much better resolution. As Brooks and I discussed in DMs, relying on any item that is published at a set interval (snapshot / bank / etc hashes) is prone to these sampling glitches and non-ideal

To really know a validator's health, the operator has to do more monitoring anyway, right?

Arguably yes, and if something goes wrong, the operator probably needs to dig in deeper. But, I think an ideal getHealth provides an indication of "is my node keeping up with the cluster".

mvines added this to Needs triage in JSON RPC API Improvements via automation Apr 30, 2021

mvines added this to the v1.6.6 milestone Apr 30, 2021

mvines self-assigned this May 2, 2021

mvines modified the milestones: v1.6.7, v1.6.8 May 4, 2021

CriesofCarrots modified the milestones: v1.6.9, v1.6.10 May 15, 2021

CriesofCarrots modified the milestones: v1.6.10, v1.6.11, v1.6.12 May 26, 2021

mvines removed their assignment Jun 7, 2021

mvines modified the milestones: v1.6.12, v1.8.0 Jun 7, 2021

ryoqun mentioned this issue Oct 25, 2021

error on Devnet for transfer #18691

Closed

seal mentioned this issue Sep 23, 2022

getHealth API not working as expect #27737

Closed

steviez mentioned this issue Jun 2, 2023

Snapshot pipeline takes longer on v1.16 over v1.14 #31955

Closed

brooksprumo mentioned this issue Jun 6, 2023

Reduce reliance on getHealth #31989

Closed

steviez mentioned this issue Oct 6, 2023

validator: default health check slot distance #33553

Closed

diman-io mentioned this issue Oct 6, 2023

validator: skip health check #33568

Merged

steviez mentioned this issue Oct 12, 2023

Change getHealth to compare optimistically confirmed slots #33651

Merged

steviez closed this as completed in #33651 Oct 16, 2023

JSON RPC API Improvements automation moved this from Needs triage to Closed Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catchup & RPC getHealth Results Do Not Match #16957

Catchup & RPC getHealth Results Do Not Match #16957

brianlong commented Apr 29, 2021

mvines commented Apr 30, 2021

linuskendall commented May 1, 2021 •

edited

Loading

mvines commented May 2, 2021

mvines commented May 4, 2021

t-nelson commented May 5, 2021

machacekch commented May 5, 2021

ryoqun commented Oct 25, 2021

ruuda commented Mar 22, 2023

mvines commented Mar 22, 2023

brooksprumo commented Jun 6, 2023 •

edited

Loading

mvines commented Jun 6, 2023

mvines commented Jun 6, 2023

steviez commented Jun 6, 2023 •

edited

Loading

Catchup & RPC getHealth Results Do Not Match #16957

Catchup & RPC getHealth Results Do Not Match #16957

Comments

brianlong commented Apr 29, 2021

Problem

Proposed Solution

mvines commented Apr 30, 2021

linuskendall commented May 1, 2021 • edited Loading

mvines commented May 2, 2021

mvines commented May 4, 2021

t-nelson commented May 5, 2021

machacekch commented May 5, 2021

ryoqun commented Oct 25, 2021

ruuda commented Mar 22, 2023

mvines commented Mar 22, 2023

brooksprumo commented Jun 6, 2023 • edited Loading

mvines commented Jun 6, 2023

mvines commented Jun 6, 2023

steviez commented Jun 6, 2023 • edited Loading

linuskendall commented May 1, 2021 •

edited

Loading

brooksprumo commented Jun 6, 2023 •

edited

Loading

steviez commented Jun 6, 2023 •

edited

Loading