Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mempool: rework lock discipline to mitigate callback deadlocks #9033

Merged
merged 1 commit into from
Jul 19, 2022

Conversation

creachadair
Copy link
Contributor

@creachadair creachadair commented Jul 18, 2022

This is a manual backport of #9030.

@creachadair creachadair force-pushed the mjf/outer-lock-34 branch 4 times, most recently from f3a11d2 to f2dacf2 Compare July 19, 2022 20:29
…ort #9030)

(manual cherry-pick of commit 22ed610)

The priority mempool has a stricter synchronization requirement than the legacy
mempool. Under sufficiently-heavy load, exclusive access can lead to deadlocks
when processing a large batch of transaction rechecks through an out-of-process
application using the socket client.

By design, a socket client stalls when its send buffer fills, during which time
it holds a lock shared with the receive thread.  While blocked in this state, a
response read by the receive thread waits for the shared lock so the callback
can be invoked.

If we're lucky, the server will then read the next request and make enough room
in the buffer for the sender to proceed. If not however (e.g., if the next
request is bigger than the one just consumed), the receive thread is blocked:
It is waiting on the lock and cannot read a response.  Once the server's output
buffer fills, the system deadlocks.

This can happen with any sufficiently-busy workload, but is more likely during
a large recheck in the v1 mempool, where the callbacks need exclusive access to
mempool state.  As a workaround, process rechecks for the priority mempool in
their own goroutines outside the mempool mutex.  Responses still head-of-line
block, but will no longer get pushback due to contention on the mempool itself.
@creachadair
Copy link
Contributor Author

The upload-coverage-report failure is not related; it looks like codecov has an expired cert. As far as I can tell nobody ever looks at those reports anyway.

@creachadair creachadair merged commit 64dfeb7 into v0.34.x Jul 19, 2022
@creachadair creachadair deleted the mjf/outer-lock-34 branch July 19, 2022 20:48
thanethomson added a commit that referenced this pull request Aug 16, 2022
Signed-off-by: Thane Thomson <connect@thanethomson.com>
thanethomson added a commit that referenced this pull request Aug 16, 2022
Signed-off-by: Thane Thomson <connect@thanethomson.com>
thanethomson added a commit that referenced this pull request Aug 17, 2022
Signed-off-by: Thane Thomson <connect@thanethomson.com>
thanethomson added a commit that referenced this pull request Aug 18, 2022
* Make reindex-event cmd docs consistent with other commands

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Add warning regarding DiscardABCIResponses to BlockResults Go API

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Update OpenAPI spec to reflect discard_abci_responses change

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Add release highlights to CHANGELOG_PENDING

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Add pending changelog entry for #9033

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Format pending changelog entries consistently

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Correct and simplify comment wording

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Remove changelog entry regarding storage section

Signed-off-by: Thane Thomson <connect@thanethomson.com>

Signed-off-by: Thane Thomson <connect@thanethomson.com>
cmwaters pushed a commit to cmwaters/tendermint that referenced this pull request Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants