mempool: Change data structure #2147

ValarDragon · 2018-08-05T06:21:35Z

Currently our mempool uses a linked list with a cache implemented as a hashmap (default size 100k).
The operations involved have the following asymptotics:
(assume mempool has N items already in it)

Add Tx - O(1) check in cache, locks on the cache. (Note this has a non-neglible hash involved, so the constant here is important!) Doesn't check if tx is in the main mempool linked list, which means we can add duplicate txs if our mempool size in the config is greater than our cache size.
Update k Txs - O(N), progresses serially. This doesn't parallelize because we iterate through every tx, and run recheck Tx blocking on that response. We also lock on the cache here, so parallelizing this part would still cause the threads to hang at the last step.
Reap k Txs - O(k) locks, unsorted. cache locks and unlocks here.

The cache locking and unlocking is on every operation is a huge barrier to this scaling. We also want our mempool to be sorted, which would mean our Add operation would have to be O(N). (This is a property required of basically all blockchain mempools, to handle periods of high tx volume)

I propose that we switch to the following concurrent AVL tree, that doesn't require locks: http://ddana.cswp.cs.technion.ac.il/wp-content/uploads/sites/19/2015/12/logicalorderingavl.pdf. Looking at table 1, this allows large numbers of threads quite well. Our expected number of txs in the mempool is between 10^3 and 10^5 I'd guess. No locking means we don't have to fear deadlocks in our implementation, and that implementation is significantly simpler. The other suggested implementations there either have locks, or "virtual nodes" which opens up fears of memleaks. (Note this suggestion presumes implementation of an ABCI "priority" function for txs)

This allows us to having the sorting property we want, and get rid of all locks, so we can easily support parallelizability in Add Tx and Update Tx. (No need to even lock between them!!) (Its not fully clear to me if Reap can be parallelized as well, however we don't need parallellizability in reap at all) We also use lots of extra memory in our Clist, so we will get a significantly more memory efficient implementation. (As this gets rid of the cache as well)

Our operations are all now either log(n) or k log (n) for the batch variants. We get the advantage that we can execute them all in parallel safely, this tree is self-balancing, and no constant hash overheads.

The text was updated successfully, but these errors were encountered:

ValarDragon · 2018-08-28T00:30:34Z

cc @jack

jaekwon · 2018-09-04T21:21:22Z

The purpose of the mempool keeping its order, is to keep determinism in checkTx. This gives us the property that a proposer who proposes a block, as long as it is running the existing mempool logic, is guaranteed to propose a block whose txs all pass checkTx. That's maybe not a property that the Cosmos Hub cares about, but it might be important for other applications.

There may be no need to optimize this part, because blocks shouldn't take all 100% of the possible CPU/hours available, because otherwise nodes that are behind will never be able to catch up. So ideally each block only requires 10% or so of the processing time available, and so this extra synchronous CheckTx step isn't particularly onerous, as far as I can tell. The block had already been committed, and it's time to execute this block, and after we do so we'll probably want to wait a bit before the next block anyways.

I'm not sure that that concurrent AVL tree is sufficient for determinism... we should discuss strategies for enabling parallel computation while still preserving determinism, and yeah there are a number of algorithms that gives us the ability to merge trees together after somewhat isolated processing. In any case, the design space of how to parallelize transactions on a single chain is large, and warrants its own mega-discussion thread, because there's a lot to discuss explore there... mostly after launch :)

ValarDragon · 2018-09-04T22:19:48Z

Totally agree that this all post launch :)

The purpose of the mempool keeping its order, is to keep determinism in checkTx. This gives us the property that a proposer who proposes a block, as long as it is running the existing mempool logic, is guaranteed to propose a block whose txs all pass checkTx.

I believe when building a block it is reasonable to have the software default to require that checkTx of all txs passes. I agree that this set of checkTx calls can be synchronous (This is what I meant when I said however we don't need parallellizability in reap at all). This may also just be safer in the situation where we go with my recheck proposal and if a chain has a MsgChangePubkey and they forget to account for this in recheck.

There may be no need to optimize this part, because blocks shouldn't take all 100% of the possible CPU/hours available, because otherwise nodes that are behind will never be able to catch up.

I wrote this from the perspective that mempool sorting is something we eventually want. If we were to go with a scheme similar to Vitalik Buterin's min fee (which is very similar to the minfee you've described in person), then we would still need mempool sorting. Github issue describing it: zcash/zcash#3473, paper: https://ethresear.ch/uploads/default/original/2X/1/197884012ada193318b67c4b777441e4a1830f49.pdf, ethresearch post: https://ethresear.ch/t/draft-position-paper-on-resource-pricing/2838. The TL;DR; of it is it makes a more socially optimal fee mechanism, by having a consensus determined min fee. This min-fee gets burned, and doesn't go to the miner. We have the min-fee adapt based on network conditions to optimize for blocks being 50% full. (Threshold can be altered) A tx then has a miner reward on top of the fee, to incentivize inclusion on-chain. He does a significant amount of analysis to show why this is better.

Given that we want mempool sorting, we need to adapt our data structure anyway. (A linked list is too slow) I even venture that a linked list is too slow without the need for sorting, the part that is particularly concerning regarding performance currently is our Update function. (Filter out txs that already have appeared) It is O(N), and each update involves a mutex lock so it can't be parallelized.

My concern is mostly not with the above though. (As you are right, as we long as we limit the amount of CPU that takes, that alone wouldn't affect block production / verification) My concern is primarily that we currently have the downside that you can't keep receiving / broadcasting txs while updating / rechecking. This limits our p2p network, especially in the situation of low block times. This will be true of all data structures that aren't concurrency safe. Now once we add a concurrency safe data structure we can receive a new block and new txs in parallel. (While retaining being sorted and balanced!) This means that our p2p network will be more robust.

I'm not sure that that concurrent AVL tree is sufficient for determinism... we should discuss strategies for enabling parallel computation while still preserving determinism, and yeah there are a number of algorithms that gives us the ability to merge trees together after somewhat isolated processing. In any case, the design space of how to parallelize transactions on a single chain is large, and warrants its own mega-discussion thread, because there's a lot to discuss explore there... mostly after launch :)

Totally agree that on-chain parallelization is its own mega-discussion haha. I'm not really sure what the determinism here that you are worrying about is. (I think my above paragraph may address your concern?) If its w.r.t to Reap, I do agree with keeping Reap unparallelized, and I even think enforcing CheckTx there is reasonable.

ValarDragon · 2018-09-24T19:01:45Z

Jae and I had an in person conversation about this awhile ago. I forgot to write up the results of it.

The idea we had converged on at the end of the convo was basically to have a concurrent tree with sorting enabled for storing the bulk of the mempool, with txs verified independently. However, each validator should also maintain a few "threads" of txs. The idea of a "thread" is that these are all verified to pass CheckTx when ran sequentially. (So the proposer can basically instantly propose its highest value thread) This is unlike the mempool, as the mempool constantly changes ordering.

When initially building a thread, you make it sorted by value. If you detect that you are proposing soon, you check to see if your mempool contains any txs with greater fees than what you have in the thread. Suppose there are n better txs in the mempool than whats in the thread. Then you remove the last n txs from the thread, and put these new ones.

The idea behind multiple threads is to reduce re-computation when you receive another persons block. There is a lot of design work for how to optimally do this thread concept. Please note that this thread design can be done incrementally. Phase 1 may just have a single thread, and a slightly slower update process right before you Reap. Later phases can incorporate more threads, and more improvements for incremental updates, and detection for when you will be proposing soon.

xla · 2018-09-25T10:57:02Z

Would be extremely valuable to write this up in an ADR, created #2484 to keep track.

tac0turtle · 2020-08-17T14:47:35Z

The idea is encapsulated in #2484, there is also a link from that issue to this one. The information will not be lost

ValarDragon added C:mempool Component: Mempool S:proposal Status: Proposal T:enhancement Type: Enhancement labels Aug 5, 2018

ValarDragon added this to the post-launch milestone Aug 5, 2018

This was referenced Aug 5, 2018

Implement a minimum fee mechanism cosmos/cosmos-sdk#1921

Closed

Mempool: tx cache can be filled with invalid txs until memory is exhausted. #2180

Closed

ValarDragon mentioned this issue Aug 25, 2018

ABCI, mempool: Add priority function #2275

Closed

ebuchman mentioned this issue Aug 30, 2018

Docs request: what precisely are the current semantics Tendermint applies to gas? #2299

Closed

xla mentioned this issue Sep 25, 2018

Write up ADR for mempool thread design #2484

Closed

ValarDragon mentioned this issue Dec 5, 2018

mempool lock can cause induce adversarial large delay in block reap functions #2972

Closed

ValarDragon mentioned this issue Jan 5, 2019

mempool cache: use a fast hash #2187

Closed

ebuchman mentioned this issue Jan 21, 2019

Better support for experimenting with reactors #3176

Closed

ebuchman mentioned this issue Mar 26, 2019

mempool: wasteful tx broadcasting #1798

Closed

tac0turtle closed this as completed Aug 17, 2020

melekes mentioned this issue Sep 1, 2020

abci: add priority to checkTx response #5314

Closed

ValarDragon mentioned this issue Sep 4, 2020

RFC: priority-based mempool tendermint/spec#154

Closed

ValarDragon mentioned this issue Mar 10, 2021

mempool: Make mempool.update release a mutex for Reap earlier #6232

Closed

4 tasks

ValarDragon mentioned this issue Apr 23, 2021

adr 067: mempool refactor #6368

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mempool: Change data structure #2147

mempool: Change data structure #2147

ValarDragon commented Aug 5, 2018 •

edited

ValarDragon commented Aug 28, 2018

jaekwon commented Sep 4, 2018

ValarDragon commented Sep 4, 2018 •

edited

ValarDragon commented Sep 24, 2018

xla commented Sep 25, 2018

tac0turtle commented Aug 17, 2020

mempool: Change data structure #2147

mempool: Change data structure #2147

Comments

ValarDragon commented Aug 5, 2018 • edited

ValarDragon commented Aug 28, 2018

jaekwon commented Sep 4, 2018

ValarDragon commented Sep 4, 2018 • edited

ValarDragon commented Sep 24, 2018

xla commented Sep 25, 2018

tac0turtle commented Aug 17, 2020

ValarDragon commented Aug 5, 2018 •

edited

ValarDragon commented Sep 4, 2018 •

edited