Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calling /eth/v1/validator/duties/proposer/ for old epochs is very slow #3770

Closed
wmitsuda opened this issue Dec 2, 2022 · 8 comments
Closed
Labels
question Further information is requested

Comments

@wmitsuda
Copy link

wmitsuda commented Dec 2, 2022

LH 3.3, macos

Current epoch is 164537.

Calling /eth/v1/validator/duties/proposer/ REST API for recent epochs replies in reasonable time:

http http://10.147.18.96:5052/eth/v1/validator/duties/proposer/164536 0.23s user 0.05s system 34% cpu 0.814 total
http http://10.147.18.96:5052/eth/v1/validator/duties/proposer/164535 0.23s user 0.05s system 16% cpu 1.646 total

However, calling it for some older epochs (older than 4 or 5, I guess from my tests) is incredibly slow:

http http://10.147.18.96:5052/eth/v1/validator/duties/proposer/164533 0.22s user 0.05s system 0% cpu 3:03.33 total
http http://10.147.18.96:5052/eth/v1/validator/duties/proposer/164532 0.22s user 0.05s system 0% cpu 2:08.00 total

@michaelsproul
Copy link
Member

What do you have the --slots-per-restore-point value set to? Any query that requires loading historic state is going to be slow with the default of 8192. More info here: https://lighthouse-book.sigmaprime.io/advanced_database.html

The best way to make these queries faster is to set --slots-per-restore-point=32 but that currently uses heaps of space (4TB+). We have an experimental branch that reduces the size of a SPRP=32 node to ~250GB which we've started using for our own internal analytics, see: #3206, #3626. It makes extensive changes to the database schema so we want to make sure we get it right before releasing it (we also need to re-write the schema migration for it, as that's broken currently). I've been working on this for more than a year, but hoping to finally get it across the line soon. If you'd like to run the tree-states branch let me know and I can help you build it. The gist of it is that you need to clone this repo alongside Lighthouse.

@wmitsuda
Copy link
Author

wmitsuda commented Dec 8, 2022

Hi, I'm using all default settings.

Thanks for the info. Just giving more context: my use case is getting and displaying the elected validators for a certain epoch. For regular slots, that's a no-brainer, this info is returned by querying the block.

However that is not available for (a) missed slots (b) scheduled slots for current epoch. Those two use cases are covered by calling the /validators/duties/proposer API. (b) is fast since the data is hot, (a) is slow because of what you explained.

Is there any other API where I could get this info in a faster way? As a workaround measure for now.

@michaelsproul
Copy link
Member

Is there any other API where I could get this info in a faster way? As a workaround measure for now.

There's no other Lighthouse or standard API. You could use beaconcha.in's API, but I'm not sure if that would meet your criteria (as it's an external dependency).

@wmitsuda
Copy link
Author

wmitsuda commented Dec 8, 2022

I'm doing something like beaconcha.in does, so yeah, I'm hoping to get everything from the CL node, not from another external provider.

But that's fine for now, I can put this info on the docs (around ~2-3 min on standard LH), and it is lazily loaded on the page. If users want it faster, they can trade-off disk space or keep an eye on the future improvements.

@djbakerman
Copy link

I might be having a similar issue. I'm running 3 validators and noticed an anomaly around EPOCH 170,801.

2 of the 3 validators (presumably) querying eth1/attester/170801 returned in microseconds.
attester/170801, status: 200 OK, elapsed: 93.987µs, module: http_api:164

the other validator, missed attestation incorrect block head, in the milliseconds:
/attester/170801, status: 200 OK, elapsed: 10.931236ms, module: http_api:164

the head of the working validator:
(epoch: 170801, delay_ms: 36, index: 17, head: 0xb39e80fa54124195ae110fbc34015f48388a81ba2805d7e62869c3129c278725, service: val_mon, module: beacon_chain::validator_monitor:772)

but the longer time validator, the validator that missed, had a different head:
epoch: 170801, delay_ms: 116, index: 36, head: 0x5bee4c0ed8be85262f22df1bd58c1e2deb650fc3342474156eb05b9a88b6165b, service: val_mon, module: beacon_chain::validator_monitor:887

@michaelsproul
Copy link
Member

hey @djbakerman I think there's something else going on with your node. Delays on the order of 10ms close to the head have a different cause to the multi-second delays that @wmitsuda was getting for historic queries.

The main reason your validators attested to different blocks is because they were attesting at different slots. The validator that voted correctly on block 0xb39e80fa54124195ae110fbc34015f48388a81ba2805d7e62869c3129c278725 must have been attesting at slot 5465646. The validator that attested incorrectly voted on the block from slot 5465656, which probably means it was attesting at slot 5465657 and voted for the parent block. This will happen if it's slow to receive the block at 57, or slow to process it. You can grep your debug logs for Delayed head block.*5465657 to see if that block was delayed. Or DM me your logs from around this time and I can take a look (I'm @sproul#3907 on Discord). The reason for the 10ms delay might be that the attestation routine was trying to obtain a lock that was held by block processing. These can be held and released on the order of 1-30ms.

@djbakerman
Copy link

djbakerman commented Dec 31, 2022

I didn't see a delayed head block for that slot; however, I do seem to get those errors sometimes in the seconds:

beacon.log.1:Dec 30 14:59:27.538 DEBG Delayed head block, set_as_head_delay: Some(45.496071ms), imported_delay: Some(2.520048868s), observed_delay: Some(1.973336235s), block_delay: 4.538881174s, slot: 5465695, proposer_index: 189875, block_root: 0xd710bea2603fd38adb6b1838677c55b69f6a23b0be502fa54686cce34b2b37ef, service: beacon, module: beacon_chain::canonical_head:1351

It's interesting that 1 of 3 validators (all same host) would experience a delay because there is no resource contention on the host. Why would there be delay, from a slow/non-responding peer?

Another interesting observation, while those "delayed head block" errors appear from time to time, occasionally the events occur closer together (within a minute) while most are around 10min. For example, these two are far apart:

Dec 31 08:00:28.617 DEBG Delayed head block, set_as_head_delay: Some(49.14801ms), imported_delay: Some(465.096983ms), observed_delay: Some(5.103477108s), block_delay: 5.617722101s, slot: 5470800, proposer_index: 213763, block_root: 0xb4394aacb9c5525a2672aa0450228041559d74a7599fbc97fc4e40c58d9d150a, service: beacon, module: beacon_chain::canonical_head:1351
Dec 31 08:20:42.475 DEBG Delayed head block, set_as_head_delay: Some(36.256421ms), imported_delay: Some(2.815461452s), observed_delay: Some(4.624245651s), block_delay: 7.475963524s, slot: 5470901, proposer_index: 466341, block_root: 0x0709580fbd0b60d9a2719e0f5a525471c6f460b23293729be8c95f9e4a0ce0ba, service: beacon, module: beacon_chain::canonical_head:1351

These two are close:
Dec 31 10:54:27.917 DEBG Delayed head block, set_as_head_delay: Some(35.483702ms), imported_delay: Some(269.247221ms), observed_delay: Some(4.612639594s), block_delay: 4.917370517s, slot: 5471670, proposer_index: 223740, block_root: 0x9b43f2c186b2fafa1297092150eb809df5a4a4268449f33dcb735fc99c8f1e8f, service: beacon, module: beacon_chain::canonical_head:1351
Dec 31 10:55:40.086 DEBG Delayed head block, set_as_head_delay: Some(37.089849ms), imported_delay: Some(388.781598ms), observed_delay: Some(4.660107541s), block_delay: 5.085978988s, slot: 5471676, proposer_index: 425901, block_root: 0x39fd2bef701b55e22d29b28e3b8ed9321181ce4523addb5a97136d11807df035, service: beacon, module: beacon_chain::canonical_head:1351

EDIT: seems to coincide with late gossip block.
Dec 31 10:55:39.666 DEBG Gossip block arrived late, block_delay: 4.660107541s, slot: 5471676, proposer_index: 425901, block_root: 0x39fd2bef701b55e22d29b28e3b8ed9321181ce4523addb5a97136d11807df035, module: network::beacon_processor::worker::gossip_methods:741

Can this be solved by not accepting late blocks?

Couldn't find your username on discord (case sensitive?) mine is: djbakerman#7065

@michaelsproul
Copy link
Member

@djbakerman

It's interesting that 1 of 3 validators (all same host) would experience a delay because there is no resource contention on the host. Why would there be delay, from a slow/non-responding peer?

Each of the validators get assigned to attest at a different slot, and different things can happen in each slot. So it's not so much that your validators are performing differently despite the same machine, but that they are randomly performing about the same with a bit of variance due to how your machine performs at each slot/block.

As for why blocks in some slots are delayed, the two main components are observation delay and import delay. The observation delay is determined by the proposer of the block, they can publish late if they are faulty or malicious. There's not much your node can do about this once it happens, the only hope is that the next proposer chooses to re-org the late block. We are rolling out more aggressive re-orgs in our next release which should further disincentivise late blocks (see #2860).

As for the import delay, that's determined by your node software & hardware. Even if your node doesn't show signs of being maxed out, the hardware still limits how fast blocks can be processed. In particular, Lighthouse and Geth have several single-threaded and I/O bound functions that won't appear to max out your machine but will consume time during block import. We are constantly optimising the clients to reduce these times, but it's delicate work and it takes time to ship meaningful gains.

I've cross-checked some of the delayed head block messages you posted against my own nodes and they match. This rules out clock sync being an issue for your node. I also think the imported_delays that you're seeing are within the expected range. You might see them spike to 1s+ when the execution node has a particularly large amount of work to do verifying the block.

@michaelsproul michaelsproul added the question Further information is requested label Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants