Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance of validator client PATCH API with hundreds of keys #3968

Closed
michaelsproul opened this issue Feb 13, 2023 · 7 comments
Closed
Labels
optimization Something to make Lighthouse run more efficiently. val-client Relates to the validator client binary

Comments

@michaelsproul
Copy link
Member

michaelsproul commented Feb 13, 2023

@michaelsproul The team has been testing the v3.4.0 version of Lighthouse and encountered an issue with the PATCH lighthouse/validators/{pubkey} request, which is being performed every 2 minutes. Despite the fact that Lighthouse is not being killed every 2 minutes, it is still being restarted around every 2 hours by Kubernetes due to CPU throttling when updating 6 validators concurrently.

This could be a major problem in production, where there could be more than a thousand public keys in the same service, potentially leading to even more frequent restarts.

Any ideas on what may be causing this excessive CPU usage during the PATCH request?

Originally posted by @ricardolyn in #3795 (comment)

@michaelsproul michaelsproul added val-client Relates to the validator client binary optimization Something to make Lighthouse run more efficiently. labels Feb 13, 2023
@michaelsproul
Copy link
Member Author

Summarising the other comments from that issue:

  • @ricardolyn observed the slowdown with 500 validators of which 6 are active, and all 6 are patched every 2 minutes.
  • The PATCH body is { builder_proposals: <true|false> }. Note that many more patch bodies are possible, including disabling/enabling the validators, but that is not the issue here.
  • CPU usage is observed to spike when the requests are made.

A few observations from me:

The main work happens in set_validator_definition_fields which is called here:

initialized_validators.set_validator_definition_fields(
&validator_pubkey,
body.enabled,
body.gas_limit,
body.builder_proposals,
),

The main culprits which could be slow are update_validators, called here:

self.update_validators().await?;

And writing the updated definitions to disk here:

self.definitions
.save(&self.validators_dir)
.map_err(Error::UnableToSaveDefinitions)?;

Due to the increased CPU (rather than I/O or memory) I suspect update_validators is the source of the slowness. It could be the repeated decryption of the key cache, although with web3signer validators this should be close to a no-op (I strongly suspect you're using all web3signer validators, @ricardolyn?). We would need to add some metrics and do some profiling/benchmarking under similar conditions to determine the exact operations that are taking the longest.

Debug logs from your VC would also help @ricardolyn. They'd allow us to quickly check which branches are being taken, e.g. do we hit the efficient case here?

debug!(log, "Key cache not modified");

It's possible that @paulhauner has already done some optimisations of these routines for the validator manager (#3502) and we could merge that to improve performance.

@ricardolyn
Copy link

I strongly suspect you're using all web3signer validators

yes, we are.

do we hit the efficient case here?

I believe yes. However, metrics endpoint stops handling requests (timing out): Get "http://X.X.X.X:8008/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers).

Here are the logs when the validators were updated by the cronjob (the one that runs each 2 minutes) and the service being stopped (SIGTERM) because of not answering the metrics endpoint:

│ lighthouse-validator Feb 15 10:52:03.013 DEBG Validator without index                 fee_recipient: [retracted], pubkey: [retracted], │
│ lighthouse-validator Feb 15 10:52:03.019 DEBG Validator without index                 fee_recipient: [retracted], pubkey: [retracted], │
│ lighthouse-validator Feb 15 10:52:04.000 DEBG No local validators in current sync committee, slot: 4995860, service: sync_committee                                                                                           │
│ lighthouse-validator Feb 15 10:52:04.583 DEBG Key cache not modified                                                                                                                                                          │
│ lighthouse-validator Feb 15 10:52:05.909 DEBG Key cache not modified                                                                                                                                                          │
│ lighthouse-validator Feb 15 10:52:07.296 DEBG Key cache not modified                                                                                                                                                          │
│ lighthouse-validator Feb 15 10:52:08.725 DEBG Key cache not modified                                                                                                                                                          │
│ lighthouse-validator Feb 15 10:52:10.286 DEBG Key cache not modified                                                                                                                                                          │
│ lighthouse-validator Feb 15 10:52:11.629 DEBG Key cache not modified                                                                                                                                                          │
│ lighthouse-validator Feb 15 10:52:11.677 DEBG Fetching subscription duties            current_slot: 4995860, duty_slot: 4995860, service: sync_committee                                                                      │
│ lighthouse-validator Feb 15 10:52:11.677 DEBG No sync subscriptions to send           slot: 4995860, service: sync_committee                                                                                                  │
│ lighthouse-validator Feb 15 10:52:11.678 INFO Shutting down..                         reason: Success("Received SIGTERM")                                                                                                     │
│ lighthouse-validator Feb 15 10:52:11.678 INFO Connected to beacon node(s)             synced: 1, available: 1, total: 1, service: notifier                                                                                    │
│ lighthouse-validator Feb 15 10:52:11.678 INFO Some validators active                  slot: 4995860, epoch: 156120, total_validators: 500, active_validators: 7, current_epoch_proposers: 0, service: notifier                │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Async task shutdown, exit received      task: validator_notifier, service: notifier                                                                                             │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Couldn't spawn task. Runtime shutting down, service: attestation                                                                                                                │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Async task shutdown, exit received      task: fallback                                                                                                                          │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Async task shutdown, exit received      task: duties_service_indices, service: duties                                                                                           │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Async task shutdown, exit received      task: duties_service_proposers, service: duties                                                                                         │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Async task shutdown, exit received      task: duties_service_attesters, service: duties                                                                                         │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Async task shutdown, exit received      task: duties_service_sync_committee, service: duties                                                                                    │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Block service shutting down             service: block                                                                                                                          │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Async task shutdown, exit received      task: attestation_service, service: attestation                                                                                         │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Async task shutdown, exit received      task: sync_committee_service, service: sync_committee                                                                                   │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Async task shutdown, exit received      task: validator_registration_service, service: preparation                                                                              │
│ lighthouse-validator Feb 15 10:52:11.678 DEBG Async task shutdown, exit received      task: preparation_service, service: preparation

it seems that while is doing that code branch on each PATCH request (aka Key cache not modified), the metrics endpoint stops handling requests? only a thought.

@ricardolyn
Copy link

As you may know, Teku have implemented a standard way of loading this information in bulk from a file or a remote URL, using the same JSON schema. The schema used is defined here: https://docs.teku.consensys.net/HowTo/Configure/Proposer-Configuration.

In addition, the MEV-Boost project is also planning to use this approach to load relays per validator from a file or remote URL, as outlined in this Pull Request: flashbots/mev-boost#456.

Currently, Lighthouse requires performing an HTTP request for each validator that needs updating, which may not scale well. Additionally, there is no GET endpoint to validate if the builder_proposals is enabled or not per validator. This setup complexity could be reduced by adopting the Teku and MEV Boost approach, which would allow Lighthouse to optimize the logic of handling bulk updates more efficiently and optimise the solution for the slow performance raised on this Github issue.

Therefore, I recommend that Lighthouse consider adopting this standard approach for syncing fee recipient and builder proposal configuration. @michaelsproul could LH team consider this improvement?

@ricardolyn
Copy link

@paulhauner would make sense to implement something like in the comment above for the Validator Manager? thanks

@beetrootkid
Copy link

@paulhauner @michaelsproul - I think we at Consensys Staking have the capacity to fix this problem ourselves. Are you and your team open to us doing that?

@michaelsproul
Copy link
Member Author

@beetrootkid That would be great! My preference would be to not add any more types of config file. So either improving the perf of the HTTP API, or adding re-loading for the existing validator definitions YAML

bors bot pushed a commit that referenced this issue Mar 29, 2023
…ary (#4126)

## Title

Optimise `update_validators` by decrypting key cache only when necessary

## Issue Addressed

Resolves [#3968: Slow performance of validator client PATCH API with hundreds of keys](#3968)

## Proposed Changes

1. Add a check to determine if there is at least one local definition before decrypting the key cache.
2. Assign an empty `KeyCache` when all definitions are of the `Web3Signer` type.
3. Perform cache-related operations (e.g., saving the modified key cache) only if there are local definitions.

## Additional Info

This PR addresses the excessive CPU usage and slow performance experienced when using the `PATCH lighthouse/validators/{pubkey}` request with a large number of keys. The issue was caused by the key cache using cryptography to decipher and cipher the cache entities every time the request was made. This operation called `scrypt`, which was very slow and required a lot of memory when there were many concurrent requests.

These changes have no impact on the overall functionality but can lead to significant performance improvements when working with remote signers. Importantly, the key cache is never used when there are only `Web3Signer` definitions, avoiding the expensive operation of decrypting the key cache in such cases.

Co-authored-by: Maksim Shcherbo <max.shcherbo@consensys.net>
@michaelsproul
Copy link
Member Author

Closed by #4126! Please open a new issue if there are performance issues after the most recent change

ghost pushed a commit to oone-world/lighthouse that referenced this issue Jul 13, 2023
…ary (sigp#4126)

## Title

Optimise `update_validators` by decrypting key cache only when necessary

## Issue Addressed

Resolves [sigp#3968: Slow performance of validator client PATCH API with hundreds of keys](sigp#3968)

## Proposed Changes

1. Add a check to determine if there is at least one local definition before decrypting the key cache.
2. Assign an empty `KeyCache` when all definitions are of the `Web3Signer` type.
3. Perform cache-related operations (e.g., saving the modified key cache) only if there are local definitions.

## Additional Info

This PR addresses the excessive CPU usage and slow performance experienced when using the `PATCH lighthouse/validators/{pubkey}` request with a large number of keys. The issue was caused by the key cache using cryptography to decipher and cipher the cache entities every time the request was made. This operation called `scrypt`, which was very slow and required a lot of memory when there were many concurrent requests.

These changes have no impact on the overall functionality but can lead to significant performance improvements when working with remote signers. Importantly, the key cache is never used when there are only `Web3Signer` definitions, avoiding the expensive operation of decrypting the key cache in such cases.

Co-authored-by: Maksim Shcherbo <max.shcherbo@consensys.net>
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this issue Jan 6, 2024
…ary (sigp#4126)

## Title

Optimise `update_validators` by decrypting key cache only when necessary

## Issue Addressed

Resolves [sigp#3968: Slow performance of validator client PATCH API with hundreds of keys](sigp#3968)

## Proposed Changes

1. Add a check to determine if there is at least one local definition before decrypting the key cache.
2. Assign an empty `KeyCache` when all definitions are of the `Web3Signer` type.
3. Perform cache-related operations (e.g., saving the modified key cache) only if there are local definitions.

## Additional Info

This PR addresses the excessive CPU usage and slow performance experienced when using the `PATCH lighthouse/validators/{pubkey}` request with a large number of keys. The issue was caused by the key cache using cryptography to decipher and cipher the cache entities every time the request was made. This operation called `scrypt`, which was very slow and required a lot of memory when there were many concurrent requests.

These changes have no impact on the overall functionality but can lead to significant performance improvements when working with remote signers. Importantly, the key cache is never used when there are only `Web3Signer` definitions, avoiding the expensive operation of decrypting the key cache in such cases.

Co-authored-by: Maksim Shcherbo <max.shcherbo@consensys.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Something to make Lighthouse run more efficiently. val-client Relates to the validator client binary
Projects
None yet
3 participants