New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mempool lock can cause induce adversarial large delay in block reap functions #2972

Open
ValarDragon opened this Issue Dec 5, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@ValarDragon
Copy link
Member

ValarDragon commented Dec 5, 2018

Currently the mempool places a mutex lock on every CheckTx (https://github.com/tendermint/tendermint/blob/master/mempool/mempool.go#L300), and Reap (https://github.com/tendermint/tendermint/blob/master/mempool/mempool.go#L467).

However reap has no prioritization in selection once the lock is free. There will be contention between all queued check tx processes and the queued reap process. There can be a large number of queued CheckTx processes. In the cosmos-sdk, tendermint is ran in process, and CheckTx's happen synchronously. So the CheckTx mutex lock will not unlock until this computation (i.e. signature verification) is complete. Secp256k1 in the golang implementation takes ~250 microseconds to verify a signature. Since a cosmos tx can be made at ~150 bytes such that signature verification would get ran, this implies that a 1 MBit/s line would allow an adversary to cause 218 ms of verification time in txs to the validator, per second. Linearly scaling the connections throughput to 10 Mbit/s implies 2.18 seconds of delay induced per second. (This is assuming that exponential back-off on the mutexes doesn't further increase the time) Note that the attacker can keep scaling the amount of txs they send with network throughput, since the check for mempool already being full happens after the mutex lock.

Supposing no new check tx contentions were made, since the reap process has no prioritization over CheckTxs, one would expect it to occur around the halfway mark. (So with 1 seconds worth of induced validator delay, this could be delayed by 1.09 seconds) However the attacker would continue their check tx spam, further increasing this bound.* Additionally, the newer check txs will have faster "refresh rates", since their exponential back-off is smaller, whereas the reap's refresh rate will have become longer. Thus this further increases the delay.

There are numerous solutions and stop gaps. The fundamental problem is the need for locking in the mempool, which can be addressed via #2147 , #2484. However those are more long term solutions.

Currently the maximum mempool size was decreased to mitigate also large delays on rechecking within Update. We don't check the maximum mempool size before locking, we should do so. Then the attackers power no longer scales with their internet connection to you, it is blocked by your mempool size plus the number of txs you already have. This is still highly applicable, since a small mempool size of 5000 txs still opens you up to 5000 * 250microseconds = 1.2 seconds of delay and at least .6 seconds of delay in expectation to reap. This also doesn't solve an attacker being able to constantly send more txs, so that .6 can still probably be increased.

Making the check txs actually asyncronous such that the mempool doesn't lock over its execution time would remove this problem, however the delays that may induce in normal execution may not be worth it.

In a similar theme to recent discussions deliver tx vs deliver block, making a batch check tx would strongly alleviate this problem, if the mempool buffered its check txs before engaging in a lock.

* We should do the math with it in the continually refreshed case, it may be that the above solutions don't actually solve it, and the removal of mutexes should happen relatively soon.

@ValarDragon

This comment has been minimized.

Copy link
Member

ValarDragon commented Dec 5, 2018

Just realized, we can actually probably remove the lock between CheckTx and Reap, and that would solve the problem. Reap in the current implementation is non-mutative, and begins its search from the beginning of the mempool linked list, while Check tx appends to the end. (CheckTx may mutate txs in between, but its okay if the mempoolTx reap pulls is not the latest mempoolTx from checkTx, as checkTx is just changing metadata) Reap doesn't actually remove the txs. Thus Reap and CheckTx could happen concurrently. What can't happen concurrently is Update and CheckTx, and Update and Reap.

@ValarDragon ValarDragon added this to the launch milestone Dec 10, 2018

@gamarin2

This comment has been minimized.

Copy link

gamarin2 commented Jan 9, 2019

This was exploited during GoS. With a mempool of 5000 txs and a block time of ~10sec, we missed 90% of blocks. Constantly flushing the mempool did not help.

Note that our mempool grew very large because we set min fees in gaiad.toml which was not working at the time. This has been fixed since then. The problem is not apparent right now because everyone is running with high min fees and therefore mempool size remains small.

If we expect to have large mempool in mainnet (thousdands of txs), which I think is a reasonable expectation, this indeed needs to be fixed.

@ebuchman ebuchman modified the milestones: launch, v0.29.0 Jan 13, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment