Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mempool: tx cache can be filled with invalid txs until memory is exhausted. #2180

Closed
1 of 2 tasks
ValarDragon opened this issue Aug 7, 2018 · 14 comments
Closed
1 of 2 tasks
Labels
C:mempool Component: Mempool T:bug Type Bug (Confirmed) T:security Type: Security (specify priority)
Milestone

Comments

@ValarDragon
Copy link
Contributor

ValarDragon commented Aug 7, 2018

Found through hacker one!

In mempool.CheckTx() we add any incoming transaction to the mempool and cache, and after that we run check tx. CheckTx has a callback to delete the tx from the mempool and cache if check tx failed. However its not deleting it from the mempool's caches' linked list. This means that an attacker can spam garbage txs at the p2p layer, and they will clog up everyones mempool. Due to our default settings, each of these invalid Txs can be up to 1 MB, and the default mempool would take 100k of these. This means you can cause Out of Memory errors through the p2p layer. (As this allows for up to 100GB of RAM to be used up given enough time)

  • This can be quickly patched by making the callback delete from the linked list and write a test for callbacks deleting from all three locations. We should iterate from the back of the linked list to delete the element from there.

  • Additionally we should lower the default max tx size from 1 MB. I suggest 100KB. This should be a parameter in the config though.

Additionally we change the CheckTx functionality to only add it to the cache after checkTx passes. This could be done #postlaunch, but it needs to be done. If not, this same wasted space vector exists on txs that take forever to pass through checktx. (E.g. multisig with many accts) This attack vector is much less impactful, and likely would never crash a node, but no reason to let someone make you waste extra memory / computation.

@ValarDragon ValarDragon added C:mempool Component: Mempool T:security Type: Security (specify priority) labels Aug 7, 2018
@ValarDragon ValarDragon added this to the launch milestone Aug 7, 2018
@ebuchman ebuchman added T:bug Type Bug (Confirmed) C:mempool Component: Mempool and removed C:mempool Component: Mempool labels Aug 8, 2018
@melekes
Copy link
Contributor

melekes commented Aug 8, 2018

However its not deleting it from the mempool's caches' linked list.

I raised this issue a long time ago. I will double check, but this should not be a problem since old txs get evicted when we push new txs (linked list has a fixed size)

Additionally we change the CheckTx functionality to only add it to the cache after checkTx passes.

the sole purpose of the cache is to prevent the same txs entering app.CheckTx(tx []byte)

@melekes
Copy link
Contributor

melekes commented Aug 8, 2018

And yes, we should enforce

MaxBytes: 10240, // 10kB

@ValarDragon
Copy link
Contributor Author

ValarDragon commented Aug 8, 2018

I raised this issue a long time ago. I will double check, but this should not be a problem since old txs get evicted when we push new txs (linked list has a fixed size)

Linked list fixed size is the same as the cache size which is 100k. This also causes bugs with the linked list trying to evict stuff from the hashmap that isnt in the cache. (Meaning you could make the cache enforce effictively nothing)

the sole purpose of the cache is to prevent the same txs entering app.CheckTx(tx []byte)

If thats the goal, were failing at that. We would need to not remove it from the hashmap, or the linked list. Currently were removing it from the hashmap (which is what we use to actually check this), and not the linked list. Even with not removing from the hashmap, this still opens up an adversary to being able to fill up your RAM quite aggresively. They could fill up 10KB * 2 * 100k = 2 GB under default params. That would make this problem less critical, but I think still quite subpar. We could cache the tx hash in the hashmap + linked list. That makes alot of sense to me actually, and removes alot of this concern. (the amount of maliciously wasted ram is now in the megabytes)

Storing txs only in the cache is very problematic, as our current implementation only checks duplicate txs through cache. The alternative (until we do my proposal in #2147 which will make searches in the tree log(n)) is that we have to do an O(N) search in CheckTx to make sure we aren't adding a duplicate. This is too expensive. For prelaunch, I'd be fine with raising the mempool cache size by a factor of 10 relative to the actual mempool size, and we just hope noone tries to exploit mempool cache eviction. (It would take 1 million tx receives if we raise the factor, so probably only a #postlaunch concern) the proper solution to this will come once we alter the data structure, so that we can properly check duplicates in the BST.

We shouldn't be adding things to the mempool until its passed checkTx though. But that can be changed #postlaunch.

@melekes
Copy link
Contributor

melekes commented Aug 8, 2018

This also causes bugs with the linked list trying to evict stuff from the hashmap that isnt in the cache. (Meaning you could make the cache enforce effictively nothing)

You mean this

// NOTE: the tx may have already been removed from the map
// but deleting a non-existent element is fine
delete(cache.map_, string(poppedTx))
?

How exactly somebody can enforce nothing?

I suggest 100KB

10KB currently, but we're not enforcing it

We could cache the tx hash in the hashmap + linked list.

👍 if we do that, I think we don't need linked list. as for possible duplicates, that's fine since in our docs we explicitly say app should implement replay protection.

@ValarDragon
Copy link
Contributor Author

The linked list is what handles evicting things from the hashmap. So consider the following:

  1. hashmap + linked list filled with good txs
  2. spam bad txs, these get removed from hashmap (since callback removes it), but each additional addition to linked list evicts a good tx from the end of the linked list + causes it to be evicted from the hashmap
  3. linked list now only has bad spam txs, all good txs have been evicted from hashmap

+1 if we do that, I think we don't need linked list. as for possible duplicates, that's fine since in our docs we explicitly say app should implement replay protection.

Not sure why this lets us remove the linked list? We use the linked list to figure out once the cache is too big, which element from the hashmap we need to remove. (e.g. the order in which txs came through)

@melekes
Copy link
Contributor

melekes commented Aug 8, 2018

Oh, right. Thanks for the explanation. Yeah, we need to fix this asap.

@ValarDragon
Copy link
Contributor Author

I'll start working on the fix for this then!

@fullyallocated
Copy link

Related: #2185

@ValarDragon
Copy link
Contributor Author

ValarDragon commented Aug 8, 2018

I didn't realize earlier, syncing the linked list and the hashmap doesn't have to be O(N). Theres an easy O(1) solution, just make the hashmap store the linked list element pointer.

So the solution were at now is (AFAIK):

  1. Make remove delete from linked list, by making the hashmap store the element ptr. (PR in progress)
  2. Enforce max tx size (Being done in the MaxBytes PR)
  3. Make cache use tx hash. Blocked on deciding what hash algorithm to use here, we want something fast. Ref mempool cache: use a fast hash #2187
  4. Consider changing cache size (I think we should reduce it until we make the cache use a tx hash)
  5. make callbacks no longer delete from cache

@ValarDragon
Copy link
Contributor Author

ValarDragon commented Aug 9, 2018

Re item 5, Currently we remove bad txs from the cache because they may be good later. (Thwbks for pointing this out anton) We probably need to be more nuanced, and check what the abci error is before we decide to remove it from the cache. An alternative that may be simpler is have the hashmap also store the block height invalid txs were included, so they keep getting rejected until the subsequent block.

#2185 will take care alot of this particular problem, but we probs should have a solution at the mempool level.

@ValarDragon
Copy link
Contributor Author

@melekes @ebuchman How do you guys feel about reducing the cache size until we use a tx hash. I think we should make the cache size like 10k until we implement cache using a tx hash. (Per #2187, that will be postlaunch) SHA2 is too slow here to make sense (imo), unless we used AVX2 SHA 2.

Then once the maxbytes PR and keeping linked list synced prs are merged, were done with this issue prelaunch.

@ebuchman
Copy link
Contributor

SHA2 is too slow here to make sense

What is this based on ?

@ValarDragon
Copy link
Contributor Author

Thanks for asking! Turns out the number I was basing that on was for the 1MB case, I had forgotten to adjust it now that we've reduced tx size to 10kb.

BenchmarkSHA256_1mb-8   	     500	   2432877 ns/op
BenchmarkSHA256_10kb-8   	   50000	     24605 ns/op

Sha2 should be fine at 10kb :)

@melekes
Copy link
Contributor

melekes commented Aug 29, 2018

Merged to develop.

@melekes melekes closed this as completed Aug 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:mempool Component: Mempool T:bug Type Bug (Confirmed) T:security Type: Security (specify priority)
Projects
None yet
Development

No branches or pull requests

4 participants