TxMempool rewrite (CON-305)#3476
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
| // only add new transaction if checkTx passes and is not pending | ||
| if !txmp.isPending(wtx) { | ||
| if err := txmp.addNewTransaction(wtx); err != nil { | ||
| return nil, err |
There was a problem hiding this comment.
Recheck only covers ready txs, misses pending ones
Medium Severity
The Update recheck loop iterates only over txmp.txStore.ReadyTxs(), so pending (not-yet-ready) EVM transactions are never rechecked. If a pending transaction becomes invalid after a block (e.g. the sender's balance dropped), it won't be detected and will remain in the mempool indefinitely until it either expires or becomes ready and gets rechecked later.
Reviewed by Cursor Bugbot for commit d9b1814. Configure here.
There was a problem hiding this comment.
that's a backward compatible behavior.
| l.head = nil | ||
| l.tail = nil | ||
| l.len = 0 | ||
| } |
There was a problem hiding this comment.
CList.Clear deadlocks calling Next under write lock
High Severity
Clear holds l.mtx.Lock() and calls el.Next(), which acquires el.mtx.RLock(). Meanwhile, another goroutine calling Remove(e) holds l.mtx.Lock() and calls e.setRemoved() which acquires e.mtx.Lock(). But the real issue is that setRemoved() is called within Clear() while l.mtx is already held — and setRemoved also acquires e.mtx.Lock(). If a concurrent broadcastTxRoutine is calling NextWait() which holds e.mtx.RLock() and then tries to call some list operation that needs l.mtx, a lock-ordering deadlock occurs.
Reviewed by Cursor Bugbot for commit d9b1814. Configure here.
There was a problem hiding this comment.
that's plain bullshit
| if !wtx.readyEl.IsPresent() { | ||
| wtx.readyEl = utils.Some(s.readyTxs.PushBack(wtx.Tx())) | ||
| } | ||
| } |
There was a problem hiding this comment.
Ready count double-incremented for newly promoted replacement tx
Medium Severity
In insert(), when replacing a tx with the same nonce where the old tx was NOT ready (oldReady is false) but the new tx satisfies the balance requirement, the ready count is only incremented once by the promotion loop at line 339. This is correct. However, when the old tx WAS ready and evm.nonce equals account.nextNonce - 1, the replacement code at lines 324–326 increments state.ready for the new tx. Then the promotion loop at lines 332–343 uses a shadowed wtx variable (line 334) and re-finds the same just-inserted tx at its nonce if account.nextNonce was decremented during the Dec/Inc — but account.nextNonce is not modified in the replacement path, so the promotion loop starts past this nonce. This is actually safe, so the ready count is correct for this path. The actual issue: when the old tx was NOT ready (pending) and is being replaced, the old tx's bytes are NOT decremented from state.ready (correctly, since it wasn't ready). But the old tx is deleted from byHash at line 317 without decrementing byNonce-related ready state. Then the promotion loop may find the new tx and promote it — but state.total was already decremented for the old tx at line 319 and re-incremented at line 329. The byNonce map only has one entry (replaced in place at line 330). Actually, on closer re-analysis, the counts appear correct. Disregard the double-count concern.
Reviewed by Cursor Bugbot for commit b4d042b. Configure here.
| s.metrics.EvictedTxs.Add(1) | ||
| if el, ok := wtx.readyEl.Get(); ok { | ||
| s.readyTxs.Remove(el) | ||
| } |
There was a problem hiding this comment.
Compact pre-checks total without accounting for insert's effect
Medium Severity
In compact(), the limit check at line 436–437 creates a local total copy, increments it by wtx.Size(), and checks against softLimit. But this prospective check does not account for the possibility that insert() might replace an existing tx (nonce collision), which would decrease the total. During compact, byNonce is cleared so nonce collisions shouldn't occur. However, there's a more subtle issue: the prospective total.Inc(wtx.Size()) includes the tx's size, but if insert() fails (e.g., errOldNonce after re-fetching nonces), the actual state is never incremented. Yet the eviction path removes the tx from cache regardless. This means a tx that fails insert due to stale nonces is evicted AND removed from cache, even though the limit check (which assumed the tx would be inserted) was the wrong criterion for eviction.
Reviewed by Cursor Bugbot for commit b4d042b. Configure here.
| s.metrics.EvictedTxs.Add(1) | ||
| if el, ok := wtx.readyEl.Get(); ok { | ||
| s.readyTxs.Remove(el) | ||
| } |
There was a problem hiding this comment.
Compact evicts valid txs using wrong limit criterion
Medium Severity
In compact(), the prospective limit check (total.Inc(wtx.Size()) then total.LessEqual(&inner.softLimit)) evaluates whether adding this tx would exceed the soft limit. But due to short-circuit evaluation in if !limitOk || s.insert(...) != nil, when limitOk is false, insert() is never called. The eviction path then removes the tx from cache. However, if a previous large tx caused the total to approach the limit, many subsequent smaller high-priority txs could be incorrectly evicted because the prospective check double-counts — it adds the current tx's size to a total that already includes successfully inserted txs, but those txs may have replaced older ones (reducing the actual total). Since compact clears byNonce first, nonce collisions won't occur during reinsertion, so this particular scenario doesn't apply. The real concern is that once a single tx pushes the prospective total over the limit, ALL subsequent txs in wtxs will also fail the check (since the actual state only grows), causing potentially valid lower-priority txs to be evicted unnecessarily.
Reviewed by Cursor Bugbot for commit b4d042b. Configure here.
…ning should be enabled as soon as we start evicting transactions, no matter what kind
| } | ||
|
|
||
| // Defer cancelling as the last so that it is called first during unwinding. | ||
| defer cancel() |
There was a problem hiding this comment.
Removed context cancel changes shutdown ordering
Low Severity
Removing the second defer cancel() that was placed at the end of the function changes context cancellation timing during shutdown. Previously, cancel() ran first during defer unwinding (LIFO order), signalling all goroutines using goCtx to stop before other resources were torn down. Now, the remaining defer cancel() near the top of the function runs last, meaning gRPC servers, API servers, and the app are closed while goroutines using goCtx may still be running.
Reviewed by Cursor Bugbot for commit 7a8da94. Configure here.
| var wtxs []*WrappedTx | ||
| for inner := range s.inner.Lock() { | ||
| if uint64(inner.state.Load().ready.count) >= s.config.TxNotifyThreshold { //nolint:gosec // count is non-negative | ||
| for _, wtx := range inner.inInclusionOrder() { |
There was a problem hiding this comment.
Is the reap cost still O(nlog(n))? Do we care?
There was a problem hiding this comment.
it is, it is done once per block proposal. I don't think we do.
|
|
||
| func NewTxStore() *TxStore { | ||
| func NewTxStore(cfg *Config, app *proxy.Proxy, metrics *Metrics) *txStore { | ||
| softLimit := txCounter{count: cfg.Size + cfg.PendingSize, bytes: utils.Clamp[uint64](cfg.MaxTxsBytes + cfg.MaxPendingTxsBytes)} |
There was a problem hiding this comment.
after we merge size+pendingsize, can a sender flood with high-nonce pending txs to evict ready txs?
There was a problem hiding this comment.
no, inInclusionOrder() contains ready txs first (ready always take precedence during compaction).
| } | ||
|
|
||
| func (txmp *TxMempool) Size() int { return txmp.txStore.State().total.count } | ||
| func (txmp *TxMempool) SizeBytes() uint64 { return txmp.txStore.State().total.bytes } |
There was a problem hiding this comment.
SizeBytes() include pending now? Would this break any outside tools? (metrics/dashboards)
There was a problem hiding this comment.
good point, I didn't notice that Size and SizeBytes took different sets of txs. Reverting.
| // Fetch the evm account state. | ||
| account, ok := inner.accounts[evm.address] | ||
| if !ok { | ||
| // TODO(gprusak): consider whether we should move these queries out of the mutex. |
There was a problem hiding this comment.
Account state (balance, firstNonce) is only refreshed in compact(_, true) from Update. Can a tx get stuck pending if mempool's cached nextNonce drifts from on-chain reality?
There was a problem hiding this comment.
tendermint impl tightly binds mempool state and application state - given that Commit and Update are executed under mempool lock there is no race condition here. So the cached nonce can drift from reality iff node is behind
| if now.Sub(ptx.timestamp) <= p.config.TTLDuration { | ||
| idxFirstNotExpiredTx = i | ||
| break | ||
| if remove { |
There was a problem hiding this comment.
This is only true for Autobahn right? I think you said we will have a totally different(?) implementation for Autobahn?
There was a problem hiding this comment.
Yes, I did. I'll get rid of this mode as soon as autobahn gets its own mempool. This "remove" option is consistent with what we did so far, but it is hardly robust.
| defer c.mtx.Unlock() | ||
| func (c *lruTxCache) Push(txHash types.TxHash) bool { | ||
| if c.size <= 0 { | ||
| return true |
There was a problem hiding this comment.
Is it reasonable to return true here? We didn't actually push anything?
There was a problem hiding this comment.
Here I'm replacing NopCache with LRU cache with capacity 0, so I had to extend the LRU cache semantics. Afaict "true" here is the correct value: it is equivalent to what NopCache did, and result of Push is interpreted in mempool as "whether the value wasn't already present in cache", and NOT "whether the value wasn't already present but now is". I.e. what Push does with capacity 0 is: it pushes element successfully, then immediately evicts it (although implementation do evicting first, then pushing, so that's why we need a special case here).
| if !wtx.readyEl.IsPresent() { | ||
| wtx.readyEl = utils.Some(s.readyTxs.PushBack(wtx.Tx())) | ||
| } | ||
| } |
There was a problem hiding this comment.
Ready counter double-incremented when replacing ready EVM transaction
Medium Severity
In insert(), when replacing a same-nonce EVM tx where the old tx was ready and the new tx's nonce equals account.nextNonce - 1, the replacement block increments state.ready for the new tx (line ~332). Then the nonce-advance loop can find and re-increment state.ready for the same tx if a subsequent nonce also needs promotion. While the direct nonce of the replaced tx is not re-processed (it's below nextNonce), the code path where oldReady is true and the replacement causes account.nextNonce to be exactly at the replacement nonce plus one means readyEl is set in both the replacement block and potentially the loop. The readyEl.IsPresent() guard prevents double-push to readyTxs, but the state.ready.Inc has no such guard.
Reviewed by Cursor Bugbot for commit de2eda1. Configure here.
|
|
||
| if p.config.TTLDuration > 0 { | ||
| idxFirstNotExpiredTx := len(inner.txs) | ||
| for i, ptx := range inner.txs { |
There was a problem hiding this comment.
Shared accPrio map corrupts pending transaction ordering
Medium Severity
The accPrio map in inInclusionOrder() is shared across the ready and pending iteration groups. The minimum-priority cap accumulated from the ready group bleeds into the pending group's sorting. If an account has a low-priority ready tx and a high-priority pending tx, the pending tx gets the low cap from the ready set, incorrectly demoting it relative to other accounts' pending txs. This affects both compaction eviction ordering and Reap() ordering.
Reviewed by Cursor Bugbot for commit de2eda1. Configure here.
| metrics *Metrics | ||
| config *Config | ||
| app *proxy.Proxy | ||
| txLocks *lockMap[types.TxHash] |
There was a problem hiding this comment.
hmm okay, this map has no capacity limit. I guess it's fine because CheckTx should normally be fast.
There was a problem hiding this comment.
it is rate limited on the rpc level.
|
|
||
| func (txmp *TxMempool) BytesNotPending() int64 { | ||
| return txmp.txStore.AllTxsBytes() | ||
| return float64(txmp.Size()) / float64(txmp.config.Size+txmp.config.PendingSize) |
There was a problem hiding this comment.
Is it guaranteed that PendingSize >= 0 or Size + PendingSize > 0? Should we add some checks?
| // * txs which fail Insert() are NOT added to cache and can be reattempted later. | ||
| // * invalid transactions can be recorded via CachePush. | ||
| // * txs dropped due to pruning are removed from cache. | ||
| // * txs successfully executed are kept in cache to avoid reinsert. |
There was a problem hiding this comment.
nit: explain for how long?
There was a problem hiding this comment.
I mean, it is a LRU cache, so if someone tries to aggressively reinsert it, then it will stay there forever.
| rejectedTxs = append(rejectedTxs, inner.txs[i]) | ||
| poppedIndices = append(poppedIndices, i) | ||
| // Cap priority to obtain a linear order of txs per account by nonce. | ||
| // NOTE: this precisely emulates the heap behavior described in this functions docstring. |
There was a problem hiding this comment.
nit: I almost misread it as reap behavior, maybe just say what behavior we are guaranteeing here.
There was a problem hiding this comment.
I don't think duplicating the docstring makes sense. I rephrased this statement. Perhaps it is better now.
| return errSameNonce | ||
| } | ||
| // If the old tx has >= priority, then reject new tx. | ||
| if old.priority >= wtx.priority { |
There was a problem hiding this comment.
Nonce-stuck tx blocks cheaper replacement that would succeed
Medium Severity
In insert(), when an existing tx at the same nonce is pending because account.balance < old.requiredBalance (stuck at nextNonce boundary), a replacement tx with lower requiredBalance (that would become ready and unblock the nonce chain) is rejected if it has lower priority. The stated invariant "we prefer ready tx to pending tx" is violated. This matters for gas-price-replacement scenarios where a user can't afford a high-gas-price tx and submits a cheaper one at the same nonce—the cheaper tx is rejected despite being the only one that can actually execute.
Reviewed by Cursor Bugbot for commit 41a6a18. Configure here.
| } | ||
|
|
||
| func (txmp *TxMempool) Size() int { return txmp.txStore.State().total.count } | ||
| func (txmp *TxMempool) SizeBytes() uint64 { return txmp.txStore.State().ready.bytes } |
There was a problem hiding this comment.
SizeBytes returns ready bytes, not total bytes
Low Severity
SizeBytes() returns ready.bytes while Size() returns total.count (ready + pending). This semantic mismatch means SizeBytes() excludes pending transaction bytes, making it inconsistent with Size() which includes pending transactions in its count. TotalTxsBytesSize() exists for total bytes, so this naming is misleading but functionally mirrors old behavior.
Reviewed by Cursor Bugbot for commit 41a6a18. Configure here.
There was a problem hiding this comment.
that's backward compatible behavior
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 5 potential issues.
There are 16 total unresolved issues (including 11 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2465a33. Configure here.
| state.ready.Dec(old.Size()) | ||
| state.ready.Inc(wtx.Size()) | ||
| wtx.readyEl = utils.Some(s.readyTxs.PushBack(wtx.Tx())) | ||
| } |
There was a problem hiding this comment.
EVM replacement marks new tx ready without balance check
Medium Severity
When replacing a same-nonce EVM tx, the code checks if the new tx has sufficient balance only to decide whether to reject it (line 316: oldReady && balance < requiredBalance). But if the old tx is ready and the new tx's requiredBalance IS satisfied, the replacement unconditionally marks the new tx as ready (line 332-335). However, if oldReady is false (the old tx was pending), the new tx is never explicitly checked for readiness at this nonce — it relies solely on the nonce-advancement loop (line 341). The issue is: the nonce-advancement loop starts from account.nextNonce, but if the replaced pending tx's nonce equals account.nextNonce AND balance is now sufficient, the loop will correctly pick it up. Actually the real subtle issue is that oldReady on line 314 is old.evm.nonce < account.nextNonce. If old was ready and we replace it, state.ready.Inc(wtx.Size()) happens. Then the nonce-advancement loop starting at account.nextNonce could ALSO increment ready for subsequent nonces that depend on this one — which is correct. But if the NEW tx has the same nonce but higher requiredBalance that the account CAN satisfy, the oldReady branch makes it ready without verifying account.balance >= evm.requiredBalance for the new tx specifically.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 2465a33. Configure here.
| nonEvmTxs = append(nonEvmTxs, wtx.Tx()) | ||
| } | ||
| } | ||
| return append(evmTxs, nonEvmTxs...), totalGasEstimated |
There was a problem hiding this comment.
Reap EVM ordering breaks nonce-sequential guarantee within accounts
Medium Severity
Reap splits txs into EVM and non-EVM groups after inInclusionOrder() computes a nonce-respecting priority order. However, the final append(evmTxs, nonEvmTxs...) can break the nonce ordering within a single account if that account has both EVM and non-EVM transactions — though this is unlikely since isEVM is per-tx. More importantly, the split reorders transactions: inInclusionOrder guarantees that if account A has nonces [0,1] interleaved with account B, the original order is preserved. But moving ALL EVM txs first changes the relative order within the EVM group — two different accounts' txs can get reordered relative to inInclusionOrder's computed sequence. This is safe since cross-account ordering doesn't matter for correctness, only within-account nonce order matters, which inInclusionOrder guarantees within each account regardless of the final split.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 2465a33. Configure here.
| wtx.priority = newPriority | ||
| } | ||
| } | ||
| s.compact(inner, true) |
There was a problem hiding this comment.
Update doesn't clean byNonce before compact
Medium Severity
In txStore.Update, txs are deleted from inner.byHash (line 519) but NOT from inner.byNonce. Then inInclusionOrder() is called inside compact which iterates inner.byHash to get the surviving txs. However, compact resets inner.byNonce before reinserting, so stale entries don't persist. The issue is that isReady (called by inInclusionOrder) accesses inner.accounts[evm.address].nextNonce — but if a deleted tx was the head of a nonce queue and was holding nextNonce at a certain level, the remaining txs' readiness classification during inInclusionOrder (before compact resets state) uses the stale nextNonce. Since compact(inner, true) clears accounts and refetches, this stale classification only affects the ORDER of txs in inInclusionOrder's output (ready vs pending partitioning), which determines eviction priority during compact's reinsertion loop.
Reviewed by Cursor Bugbot for commit 2465a33. Configure here.
| err := txmp.Update(ctx, height, []types.Tx{}, make([]*abci.ExecTxResult, 0), mempool.NopTxConstraints(), true) | ||
| require.NoError(t, err) | ||
| }() | ||
| }) |
There was a problem hiding this comment.
Reactor concurrency test has data race on height variables
Low Severity
primaryHeight and secondaryHeight are shared variables accessed by multiple goroutines spawned by wg.Go. While txmp.Lock() serializes Update calls for the same mempool, the checkTxs call on line 362 for the primary mempool also happens inside the same goroutine before the lock. If two primary goroutines run concurrently, they both call checkTxs (which internally calls txmp.CheckTx acquiring only an RLock), then both try to Lock(). Since the height increment is protected by the mempool's own mutex, this is safe for the primary independently and for the secondary independently. However, the Go race detector may flag primaryHeight++ as a data race since it's not protected by any Go-visible synchronization on the variable itself — the mempool mutex doesn't guard the primaryHeight variable in the race detector's view.
Reviewed by Cursor Bugbot for commit 2465a33. Configure here.
| l.head = nil | ||
| l.tail = nil | ||
| l.len = 0 | ||
| } |
There was a problem hiding this comment.
CList Clear holds list lock while acquiring element locks
Medium Severity
CList.Clear() holds l.mtx (write lock) and calls el.Next() which acquires el.mtx.RLock(). While the lock ordering (list → element) is consistent with Remove and PushBack, Clear does NOT call setNext(nil) on elements before advancing. After Clear, all removed elements still chain via next pointers, forming a reference chain that prevents garbage collection of intermediate elements until all external CElement references (e.g., from WrappedTx.readyEl) are dropped. This differs from the behavior of iterating and calling Remove on each element (which properly relinks next pointers to skip removed elements).
Reviewed by Cursor Bugbot for commit 2465a33. Configure here.
…i-protocol#3519) Adds two metrics that expose the TxMempool rewrite's compaction behavior without changing logic: `tendermint_mempool_compact_total{trigger}` and `tendermint_mempool_compact_duration_seconds`. The `trigger` label distinguishes the three call sites of `compact()` (`insert_overflow`, `update`, `reap`); rate-of-`insert_overflow` is the capacity-pressure signal. Emitted via OpenTelemetry (`otel.Meter("tendermint_mempool")`) following the sei-db convention. Companion to platform sei-protocol/platform#743 (recording rules, dashboard panels, alerts). The reactor-side gossip-bytes counter and any pending→ready promotion signal are separate followups — the latter is already directly tested by `mempool_pending_promotion_test.js` in the release-test suite, so a metric for it isn't yet pulling weight. ## Metric rationale These metrics let us answer four questions about the rewrite that we couldn't ask cleanly before. ### What each metric is for, when stress-testing **`compact_total{trigger="insert_overflow"}` is the capacity-pressure signal.** When this counter ticks, the mempool was momentarily over `hardLimit` (= 2× softLimit). A non-zero rate means ingress is outpacing block consumption — the only operational question is whether admission control is supposed to be catching it. If both `check_tx_met_drop_utilisation_threshold` *and* `insert_overflow` are rising, pressure escaped the gate. **`compact_total{trigger="update"}` is the block heartbeat.** It fires once per Update, so its rate is a block-rate proxy. If it flatlines while seiload is still pushing, the consensus loop is stuck — different signal from `block_height_delta` because it's measured at the mempool, not the indexer. **`compact_total{trigger="reap"}` is this node's proposer share.** Validators that propose blocks Reap; followers don't. Useful for telling "is this node doing block-proposal work?" without consulting Tendermint internals. **`compact_duration_seconds` is where we'll catch the rewrite's real failure mode.** `compact()` is O(m log m) over the full mempool. As load grows, this latency grows. The number we'll watch is `rate(compact_total[5m]) × avg(compact_duration_seconds)` — wall-time-fraction spent inside compact. Once that approaches ~50%, compact is now blocking block production. The histogram lets us see this coming long before p99 actually hits block-interval (~200ms). ### The composite questions only these together can answer 1. **Does the rewrite's amortization actually hold under load?** `rate(insert_overflow) / rate(inserted_txs)` should stay sub-linear as throughput rises. If the ratio goes to 1, the 2× hardLimit amortization broke and we're back to per-insert prune. 2. **Is compact CPU-bound or block-bound?** Compare per-call duration to per-block compact rate. `compact_duration_p99` rising independently of `insert_overflow` rate means individual compacts got expensive (likely GC pressure on 100k+ entries); both rising means we're pruning more often AND each prune is slower — the failure mode. 3. **Is admission control doing its job or just delaying the problem?** `rate(check_tx_met_drop_utilisation_threshold) / rate(insert_overflow)` — if it's high, the gate is absorbing the spike. If it's near zero while overflow is high, the gate's threshold is mistuned. ### What we're still blind to These metrics don't tell us *why* a compact was slow (sort vs. map-rebuild vs. GC) — that needs pprof or per-phase timing inside `compact()`. Per-peer gossip bandwidth is the next PR. The pending→ready promotion behavior is directly tested by `mempool_pending_promotion_test.js`; a continuous prod signal for it can be added later if operational need surfaces. ## Test plan - [x] `go build ./...` clean - [x] `go test ./internal/mempool/ -count=1` passes - [ ] Validate emissions in a nightly chaos run - [ ] Wire up platform-side recording rules + dashboard panels per sei-protocol#743 References: design doc sei-protocol/platform#743, TxMempool rewrite sei-protocol#3476. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>


Addressed a bunch of issues:
Not addressed:
Consistency has been achieved mostly by moving all the data under a single mutex in txStore. Amortized O(log n) updates were achieved by delaying pruning until mempool size reaches 2*capacity (more precisely capacity is counted both in number of txs and in bytes, so 1 enormous transaction can also trigger recomputation of the mempool, in which case we have amortized O(log n) per tx byte instead)
Additional changes: