Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor BlockHistoryEstimator check to halt excessive bumping #13297

Conversation

amit-momin
Copy link
Contributor

  • Refactored the existing check to halt excessive bumping
    • Previously, CheckInclusionBlocks amount of blocks had to pass since broadcast to assess an attempt
    • Now, the check is an ongoing enforcement using the most recent CheckInclusionBlocks amount of blocks to calculate the CheckInclusionPercentile price
  • Fixed an issue where the oldest blocks in the cache were used for calculations instead of the latest ones. Depending on the configs, this could have had a negative affect on the gas estimation or the check to halt bumping.
    • If theCheckInclusionBlocks config was set greater than BlockHistorySize, the latest gas estimation used would be CheckInclusionBlocks - BlockHistorySize blocks behind
    • If the BlockHistorySize config was set greater than CheckInclusionBlocks, the check to halt bumping would not use all of the blocks available to it until BlockHistorySize - CheckInclusionBlocks passed

@amit-momin amit-momin marked this pull request as ready for review May 23, 2024 14:57
@amit-momin amit-momin requested a review from a team as a code owner May 23, 2024 14:57
}
// Return error to prevent bumping if gas price is nil or if EIP1559 is enabled and tip cap is nil
if maxGasPrice == nil || (b.eConfig.EIP1559DynamicFees() && maxTipCap == nil) {
errorMsg := fmt.Sprintf("%d percentile price is not set. This is likely because there aren't any valid transactions to estimate from. Preventing bumping until valid price is available to compare", percentile)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe? This means that if head tracker is not working we can't bump txes. That seems potentially dangerous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Think this was ok with a previous version of the proposal when we'd use the same block range for the gas estimation and this halt bumping check. Since we landed on using different number of blocks, it's possible we have a gas price set but don't have the cap set which would block bumping. I've updated this to return nil to allow bumping to continue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with Dimitris, I don't think halting bumping if max gas price or max tip cap are set is dangerous so I'm reverting back to erroring on this condition. This issue can only arise on startup when the initial value for this have not been set yet. If there is an issue with the head tracker, we likely have bigger problems with the node. If the issue is the lack of transactions to calculate these values, we don't think it's a problem to hold on bumping until we have usable values. It'd be an extreme edge case to get consistent blocks without usable transactions after startup to reach a point that bumping needs to be blocked for a missing value. But it seems safer to block than blindly (and potentially aggressively) bump the transaction(s).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see these values can be null only if we failed to calculate things on startup, which itself means BlockHistoryEstimator cannot work.
If you don't have blocks, you can't calculate %iles, thus not set gas prices at all.

If at a later time the HeadTracker stops, we will still have these values set, just to a stale value.

I'm ok with halting bumping in such cases. There is a risk, but then we already see HeadTracker not working, no no point going crazy with bumping.
I've seen many cases where our gas_prices are way way higher than the gas fees on a block, causing the nodes to go out of funds. So not halting is actually already causing production problems.

promBlockHistoryEstimatorSetGasPrice = promauto.NewGaugeVec(prometheus.GaugeOpts{
Name: "gas_updater_set_gas_price",
Help: "Gas updater set gas price (in Wei)",
},
[]string{"percentile", "evmChainID"},
)

promBlockHistoryEstimatorSetMaxPercentileGasPrice = promauto.NewGaugeVec(prometheus.GaugeOpts{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit sceptical on adding more metrics like these. Do we even know if anyone uses them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with removing these but I was thinking it was harmless in case we wanted to every create dashboards in the future. I'll remove these since they can be added later if we ever do need them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in the following commit

blockRange := mathutil.Min(len(blockHistory), int(b.bhConfig.CheckInclusionBlocks()))
startIdx := len(blockHistory) - blockRange
checkInclusionBlocks := blockHistory[startIdx:]
// Check each attempt for any with a gas price or tip cap (if EIP1559 type) exceeds the latest CheckInclusionPercentile prices
for _, attempt := range attempts {
if attempt.BroadcastBeforeBlockNum == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check relevant anymore? What does this accomplish ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this check in here as part of a belt and braces approach since it's still an assumption violation. Since we don't use attempt.BroadcastBeforeBlockNum anymore, you're right that this check isn't really protecting anything except halting bumping if a tx has an attempt without BroadcastBeforeBlockNum set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I had a major suspect this code is causing problems.

Inside our confirmer, we are setting BroadcastBeforeBlockNum to null here:

previousAttempt.BroadcastBeforeBlockNum = nil

I couldn't find where we do again set the value back to a valid block.
Could you check that, and ensure it isn't set to null forever?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, I'm in favor of removing this check

Copy link
Contributor Author

@amit-momin amit-momin Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe after this code the Confirmer tries broadcast the transaction. Whether it's successful or fails, it saves the attempt without a BroadcastBeforeBlockNum. Then on the next call of processHead, we set the block num for all broadcast attempts here:

if err := ec.txStore.SetBroadcastBeforeBlockNum(ctx, head.BlockNumber(), ec.chainID); err != nil {
return fmt.Errorf("SetBroadcastBeforeBlockNum failed: %w", err)
}
The only time an attempt wouldn't have a BroadcastBeforeBlockNum is if it is either in_progress or insufficient_funds state. But txs with an insufficient_funds attempt are never passed to bumping and attempts aren't marked as in_progress anywhere in the call stack before bumping.

If we want to ensure we don't bump for txs with either in_progress or insufficient_funds attempts, I can change this validation to check that all attempts are broadcast state. Seems like it would be functionally the same as the BroadcastBeforeBlockNum check but much more straight forward. I'm also ok removing this check but seems like a good check to have for a belts and braces approach since we would want to block bumping if a tx has a non-broadcast state attempt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the validation. Looks good.

Comment on lines 613 to 616
gweiMultiplier := big.NewFloat(1_000_000_000)
float := new(big.Float).SetInt(maxPercentileGasPrice.ToInt())
gwei, _ := big.NewFloat(0).Quo(float, gweiMultiplier).Float64()
maxPercentileGasPriceGwei := fmt.Sprintf("%.2f", gwei)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take a look at the assets package and see if you can reuse something from there to simplify this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Found an existing method

Comment on lines 631 to 633
float = new(big.Float).SetInt(maxPercentileTipCap.ToInt())
gwei, _ = big.NewFloat(0).Quo(float, gweiMultiplier).Float64()
maxPercentileTipCapGwei := fmt.Sprintf("%.2f", gwei)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor

@prashantkumar1982 prashantkumar1982 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change, this simplifies the logic, and will hopefully fix all the runaway bumping bugs that we see.
Just added 1 comment around BroadcastBeforeBlockNum, which I suspect is a problem.
Otherwise looks good.

blockRange := mathutil.Min(len(blockHistory), int(b.bhConfig.CheckInclusionBlocks()))
startIdx := len(blockHistory) - blockRange
checkInclusionBlocks := blockHistory[startIdx:]
// Check each attempt for any with a gas price or tip cap (if EIP1559 type) exceeds the latest CheckInclusionPercentile prices
for _, attempt := range attempts {
if attempt.BroadcastBeforeBlockNum == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I had a major suspect this code is causing problems.

Inside our confirmer, we are setting BroadcastBeforeBlockNum to null here:

previousAttempt.BroadcastBeforeBlockNum = nil

I couldn't find where we do again set the value back to a valid block.
Could you check that, and ensure it isn't set to null forever?

blockRange := mathutil.Min(len(blockHistory), int(b.bhConfig.CheckInclusionBlocks()))
startIdx := len(blockHistory) - blockRange
checkInclusionBlocks := blockHistory[startIdx:]
// Check each attempt for any with a gas price or tip cap (if EIP1559 type) exceeds the latest CheckInclusionPercentile prices
for _, attempt := range attempts {
if attempt.BroadcastBeforeBlockNum == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, I'm in favor of removing this check

}
// Return error to prevent bumping if gas price is nil or if EIP1559 is enabled and tip cap is nil
if maxGasPrice == nil || (b.eConfig.EIP1559DynamicFees() && maxTipCap == nil) {
errorMsg := fmt.Sprintf("%d percentile price is not set. This is likely because there aren't any valid transactions to estimate from. Preventing bumping until valid price is available to compare", percentile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see these values can be null only if we failed to calculate things on startup, which itself means BlockHistoryEstimator cannot work.
If you don't have blocks, you can't calculate %iles, thus not set gas prices at all.

If at a later time the HeadTracker stops, we will still have these values set, just to a stale value.

I'm ok with halting bumping in such cases. There is a risk, but then we already see HeadTracker not working, no no point going crazy with bumping.
I've seen many cases where our gas_prices are way way higher than the gas fees on a block, causing the nodes to go out of funds. So not halting is actually already causing production problems.

Copy link
Collaborator

@dimriou dimriou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this scope I'm approving this but we need to tackle the issues discussed internally in slack.

@prashantkumar1982 prashantkumar1982 added this pull request to the merge queue Jun 10, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 10, 2024
@dimriou dimriou added this pull request to the merge queue Jun 10, 2024
Merged via the queue into develop with commit e55e042 Jun 10, 2024
109 checks passed
@dimriou dimriou deleted the BCI-2883-Refactor-Block-History-Estimator-s-checkConnectivity-method branch June 10, 2024 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants