Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD-0132: Dynamic Block Limits #132

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

cavemanloverboy
Copy link

@cavemanloverboy cavemanloverboy commented Mar 25, 2024

SIMD Proposal: Dynamic Block Limits

Summary

This proposal introduces dynamic adjustments to the compute unit (CU) limit of Solana blocks based on network utilization at the end of each epoch. If the average block utilization exceeds 75%, the CU limit will increase by 20%; if it falls below 25%, the limit will decrease by 20%. A second metric based on vote slot latency is used to preserve protocol liveness and responsive UX. This proposal aims to optimize network performance by adapting to demonstrated compute capacity and demand without centralized decisions about limits and without voting.Although the adjustment rate is arbitrary (and can be discussed in this PR), the block limit will be determined by the demonstrated capacity of the network without voting.

@blasrodri
Copy link

Shouldn't this also have a mechanism to ensure that block producers are able to keep the pace during spike times?

@cavemanloverboy
Copy link
Author

Shouldn't this also have a mechanism to ensure that block producers are able to keep the pace during spike times?

CU limits will only move up if the vast majority of the network is nearly filling blocks, and move down if the vast majority of the network is struggling to fill blocks. Outliers will either be throttled or pruned.

@Tamgros
Copy link

Tamgros commented Mar 25, 2024

There are other considerations

  • Delinquency, the protocol should only increase if the delinquency rate is below x%. Ie, if the validators are already having some struggles, don't want to make it even harder to validate
  • If this update can happen every epoch, the CU changes can be a much smaller % and still see pretty significant movement in a short period.

I think it'd also be worth thinking about incentives long term. Validators want more tx fees, but they also have switching costs to hardware. This is why I think an incremental approach is better. It allows for validators to plan and see how their current setups are/aren't performing

@cavemanloverboy
Copy link
Author

There are other considerations

  • Delinquency, the protocol should only increase if the delinquency rate is below x%. Ie, if the validators are already having some struggles, don't want to make it even harder to validate
  • If this update can happen every epoch, the CU changes can be a much smaller % and still see pretty significant movement in a short period.

I think it'd also be worth thinking about incentives long term. Validators want more tx fees, but they also have switching costs to hardware. This is why I think an incremental approach is better. It allows for validators to plan and see how their current setups are/aren't performing

  • Love the delinquency check. Gets at the previous concern about potatoes.
  • 20% was arbitrary and chosen as conversation starter. can definitely go for something smaller.

@7layermagik
Copy link

7layermagik commented Mar 25, 2024

There are other considerations

  • Delinquency, the protocol should only increase if the delinquency rate is below x%. Ie, if the validators are already having some struggles, don't want to make it even harder to validate
  • If this update can happen every epoch, the CU changes can be a much smaller % and still see pretty significant movement in a short period.

I think it'd also be worth thinking about incentives long term. Validators want more tx fees, but they also have switching costs to hardware. This is why I think an incremental approach is better. It allows for validators to plan and see how their current setups are/aren't performing

Using vote latency might be better than delinquency? Hopefully doesn't get to the point where you have high delinquency... if you have a target vote latency range, that measure will always be pretty directly tied to UX and confirmation time consistency. You could take the median vote latency of the top 80% of validators for example and just make sure that falls within a certain range. You can even use vote latency instead of cu's as a way to know when to increase or decrease block size. With timely vote credits, validators will already be searching for ways to improve their apy by improving vote latency

Skip rate is also something worth considering.

@cavemanloverboy
Copy link
Author

Things that need to be specified more precisely that I was hoping to discuss:

  • Upper and lower bounds. The initial 48M CU limit seems like a natural suggestion for lower bound. Setting the upper bound too low renders this SIMD useless because a centralized/arbitrary decision will often need to be made to raise the upper bound. Perhaps the upper bound can instead be a max increase in some number of epochs, i.e. cannot be raised more than 2x within 10 epochs ≈ O(1 month)).
  • Whether per-account cu limits or other block parameters are to be included in this SIMD, or whether it is best to leave them for a future SIMD. I am in favor of the latter.

@bji
Copy link
Contributor

bji commented Mar 25, 2024

Shouldn't this also have a mechanism to ensure that block producers are able to keep the pace during spike times?

CU limits will only move up if the vast majority of the network is nearly filling blocks, and move down if the vast majority of the network is struggling to fill blocks. Outliers will either be throttled or pruned.

Why does it make sense to adjust CU based on utilization? Shouldn't it be adjusted based on capacity? CU should always be set to the greatest that the network supports, not what happens to be being used.

@cavemanloverboy
Copy link
Author

cavemanloverboy commented Mar 25, 2024

Shouldn't this also have a mechanism to ensure that block producers are able to keep the pace during spike times?

CU limits will only move up if the vast majority of the network is nearly filling blocks, and move down if the vast majority of the network is struggling to fill blocks. Outliers will either be throttled or pruned.

Why does it make sense to adjust CU based on utilization? Shouldn't it be adjusted based on capacity? CU should always be set to the greatest that the network supports, not what happens to be being used.

Utilization is both demonstrated capacity and demand. Increasing CU limits far beyond demonstrated capacity adds risk to the system because it opens a vector for validators to create fat blocks that the rest of the network may struggle to replay and which the network has not yet demonstrated it is capable of handling.

@ripatel-fd
Copy link
Contributor

This proposal needs a lot more research. Do you have evidence that increasing limits without validator interaction won't just introduce irrecoverable instability and crash the network?

@cavemanloverboy
Copy link
Author

cavemanloverboy commented Apr 9, 2024

This proposal needs a lot more research. Do you have evidence that increasing limits without validator interaction won't just introduce irrecoverable instability and crash the network?

What form of evidence would you like to see?

The mechanism is self-correcting. if the network demonstrates that it cannot keep up with a higher CU limit while preserving low latency for the entire supermajority, the CU limit decreases. If the blocks (whose schedule is sampled by stake-weight) are full and are replayed and voted on — and if the supermajority of the network is highly responsive — why would the network go down?

@bw-solana
Copy link

A couple of things:

  1. I think using median (or some percentile, like OC %) would be better/safer as opposed to average. E.g. we move up the CU limit when 67% of the blocks were packed >=80%. I'm thinking of some diabolical case where half the cluster stake is super nodes and half is potatoes. The super nodes are packing 100% and the potatoes are struggling just to pack 60%. We average to 80% and move up the limits again. The potatoes die even more, skipped slots go wild, machines go delinquent, we don't have enough stake to confirm anything. Much RIP. If we use OC%ile, that should guarantee OC% of the stake is able to keep up (+/- some small std deviation).
  2. We'll need some way of computing average/median CU cost per block for nodes that come online in the middle of epoch. This probably means adding some small amount of metadata to the snapshot

@bw-solana
Copy link

For testing, I bet we could start with just a local cluster running with shortened epochs(maybe 5 minutes) to prove out the idea. Set starting CUs to like 1M and see what terminal value is hit when spamming bench-tps. Then kill bench-tps and see what CUs fall to

@CriesofCarrots CriesofCarrots changed the title SIMD-0130: Dynamic Block Limits SIMD-0132: Dynamic Block Limits May 21, 2024
@0xSol
Copy link

0xSol commented May 21, 2024

@cavemanloverboy to share testing outcome with metrics once ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants