-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor BlockHistoryEstimator check to halt excessive bumping #13297
Refactor BlockHistoryEstimator check to halt excessive bumping #13297
Conversation
} | ||
// Return error to prevent bumping if gas price is nil or if EIP1559 is enabled and tip cap is nil | ||
if maxGasPrice == nil || (b.eConfig.EIP1559DynamicFees() && maxTipCap == nil) { | ||
errorMsg := fmt.Sprintf("%d percentile price is not set. This is likely because there aren't any valid transactions to estimate from. Preventing bumping until valid price is available to compare", percentile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe? This means that if head tracker is not working we can't bump txes. That seems potentially dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Think this was ok with a previous version of the proposal when we'd use the same block range for the gas estimation and this halt bumping check. Since we landed on using different number of blocks, it's possible we have a gas price set but don't have the cap set which would block bumping. I've updated this to return nil to allow bumping to continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing with Dimitris, I don't think halting bumping if max gas price or max tip cap are set is dangerous so I'm reverting back to erroring on this condition. This issue can only arise on startup when the initial value for this have not been set yet. If there is an issue with the head tracker, we likely have bigger problems with the node. If the issue is the lack of transactions to calculate these values, we don't think it's a problem to hold on bumping until we have usable values. It'd be an extreme edge case to get consistent blocks without usable transactions after startup to reach a point that bumping needs to be blocked for a missing value. But it seems safer to block than blindly (and potentially aggressively) bump the transaction(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see these values can be null only if we failed to calculate things on startup, which itself means BlockHistoryEstimator cannot work.
If you don't have blocks, you can't calculate %iles, thus not set gas prices at all.
If at a later time the HeadTracker stops, we will still have these values set, just to a stale value.
I'm ok with halting bumping in such cases. There is a risk, but then we already see HeadTracker not working, no no point going crazy with bumping.
I've seen many cases where our gas_prices are way way higher than the gas fees on a block, causing the nodes to go out of funds. So not halting is actually already causing production problems.
promBlockHistoryEstimatorSetGasPrice = promauto.NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "gas_updater_set_gas_price", | ||
Help: "Gas updater set gas price (in Wei)", | ||
}, | ||
[]string{"percentile", "evmChainID"}, | ||
) | ||
|
||
promBlockHistoryEstimatorSetMaxPercentileGasPrice = promauto.NewGaugeVec(prometheus.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit sceptical on adding more metrics like these. Do we even know if anyone uses them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with removing these but I was thinking it was harmless in case we wanted to every create dashboards in the future. I'll remove these since they can be added later if we ever do need them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in the following commit
…-s-checkConnectivity-method
…-s-checkConnectivity-method
blockRange := mathutil.Min(len(blockHistory), int(b.bhConfig.CheckInclusionBlocks())) | ||
startIdx := len(blockHistory) - blockRange | ||
checkInclusionBlocks := blockHistory[startIdx:] | ||
// Check each attempt for any with a gas price or tip cap (if EIP1559 type) exceeds the latest CheckInclusionPercentile prices | ||
for _, attempt := range attempts { | ||
if attempt.BroadcastBeforeBlockNum == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check relevant anymore? What does this accomplish ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this check in here as part of a belt and braces approach since it's still an assumption violation. Since we don't use attempt.BroadcastBeforeBlockNum
anymore, you're right that this check isn't really protecting anything except halting bumping if a tx has an attempt without BroadcastBeforeBlockNum
set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I had a major suspect this code is causing problems.
Inside our confirmer, we are setting BroadcastBeforeBlockNum to null here:
chainlink/common/txmgr/confirmer.go
Line 857 in d675d86
previousAttempt.BroadcastBeforeBlockNum = nil |
I couldn't find where we do again set the value back to a valid block.
Could you check that, and ensure it isn't set to null forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, I'm in favor of removing this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe after this code the Confirmer tries broadcast the transaction. Whether it's successful or fails, it saves the attempt without a BroadcastBeforeBlockNum
. Then on the next call of processHead
, we set the block num for all broadcast
attempts here:
chainlink/common/txmgr/confirmer.go
Lines 293 to 295 in 58d73ec
if err := ec.txStore.SetBroadcastBeforeBlockNum(ctx, head.BlockNumber(), ec.chainID); err != nil { | |
return fmt.Errorf("SetBroadcastBeforeBlockNum failed: %w", err) | |
} |
BroadcastBeforeBlockNum
is if it is either in_progress
or insufficient_funds
state. But txs with an insufficient_funds
attempt are never passed to bumping and attempts aren't marked as in_progress
anywhere in the call stack before bumping.
If we want to ensure we don't bump for txs with either in_progress
or insufficient_funds
attempts, I can change this validation to check that all attempts are broadcast
state. Seems like it would be functionally the same as the BroadcastBeforeBlockNum
check but much more straight forward. I'm also ok removing this check but seems like a good check to have for a belts and braces approach since we would want to block bumping if a tx has a non-broadcast state attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the validation. Looks good.
gweiMultiplier := big.NewFloat(1_000_000_000) | ||
float := new(big.Float).SetInt(maxPercentileGasPrice.ToInt()) | ||
gwei, _ := big.NewFloat(0).Quo(float, gweiMultiplier).Float64() | ||
maxPercentileGasPriceGwei := fmt.Sprintf("%.2f", gwei) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you take a look at the assets package and see if you can reuse something from there to simplify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! Found an existing method
float = new(big.Float).SetInt(maxPercentileTipCap.ToInt()) | ||
gwei, _ = big.NewFloat(0).Quo(float, gweiMultiplier).Float64() | ||
maxPercentileTipCapGwei := fmt.Sprintf("%.2f", gwei) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
…-s-checkConnectivity-method
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change, this simplifies the logic, and will hopefully fix all the runaway bumping bugs that we see.
Just added 1 comment around BroadcastBeforeBlockNum, which I suspect is a problem.
Otherwise looks good.
blockRange := mathutil.Min(len(blockHistory), int(b.bhConfig.CheckInclusionBlocks())) | ||
startIdx := len(blockHistory) - blockRange | ||
checkInclusionBlocks := blockHistory[startIdx:] | ||
// Check each attempt for any with a gas price or tip cap (if EIP1559 type) exceeds the latest CheckInclusionPercentile prices | ||
for _, attempt := range attempts { | ||
if attempt.BroadcastBeforeBlockNum == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I had a major suspect this code is causing problems.
Inside our confirmer, we are setting BroadcastBeforeBlockNum to null here:
chainlink/common/txmgr/confirmer.go
Line 857 in d675d86
previousAttempt.BroadcastBeforeBlockNum = nil |
I couldn't find where we do again set the value back to a valid block.
Could you check that, and ensure it isn't set to null forever?
blockRange := mathutil.Min(len(blockHistory), int(b.bhConfig.CheckInclusionBlocks())) | ||
startIdx := len(blockHistory) - blockRange | ||
checkInclusionBlocks := blockHistory[startIdx:] | ||
// Check each attempt for any with a gas price or tip cap (if EIP1559 type) exceeds the latest CheckInclusionPercentile prices | ||
for _, attempt := range attempts { | ||
if attempt.BroadcastBeforeBlockNum == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, I'm in favor of removing this check
} | ||
// Return error to prevent bumping if gas price is nil or if EIP1559 is enabled and tip cap is nil | ||
if maxGasPrice == nil || (b.eConfig.EIP1559DynamicFees() && maxTipCap == nil) { | ||
errorMsg := fmt.Sprintf("%d percentile price is not set. This is likely because there aren't any valid transactions to estimate from. Preventing bumping until valid price is available to compare", percentile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see these values can be null only if we failed to calculate things on startup, which itself means BlockHistoryEstimator cannot work.
If you don't have blocks, you can't calculate %iles, thus not set gas prices at all.
If at a later time the HeadTracker stops, we will still have these values set, just to a stale value.
I'm ok with halting bumping in such cases. There is a risk, but then we already see HeadTracker not working, no no point going crazy with bumping.
I've seen many cases where our gas_prices are way way higher than the gas fees on a block, causing the nodes to go out of funds. So not halting is actually already causing production problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this scope I'm approving this but we need to tackle the issues discussed internally in slack.
CheckInclusionBlocks
amount of blocks had to pass since broadcast to assess an attemptCheckInclusionBlocks
amount of blocks to calculate theCheckInclusionPercentile
priceCheckInclusionBlocks
config was set greater thanBlockHistorySize
, the latest gas estimation used would beCheckInclusionBlocks
-BlockHistorySize
blocks behindBlockHistorySize
config was set greater thanCheckInclusionBlocks
, the check to halt bumping would not use all of the blocks available to it untilBlockHistorySize
-CheckInclusionBlocks
passed