Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: deadlock #4674

Merged
merged 1 commit into from
Sep 14, 2022
Merged

Conversation

SWvheerden
Copy link
Collaborator

@SWvheerden SWvheerden commented Sep 13, 2022

Description

This fixes a deadlock in the code.

The RwLock allows a mutex to exist that allows concurrent reads but blocks concurrent writes, this behaves differently depending on the OS.
On Linux systems: The lock favors reads, meaning that as long as there exists a reader lock, it will allow a new reader lock to open. This means that the system can cause starvation of writers.
On Mac/Win the lock has equal ordering, this means that as soon as a writer queues for a lock, all additional readers will be blocked till after the writer has acquired and released its lock.

This behavior can be dangerous if recursive locks are used, as was the case here.

At about the same time, a block was submitted, and a template was constructed for a new miner.
The add_block process requires a write lock, while the block template process requires a read lock.
The template process was first in acquiring a lock on the read, followed shortly by the add_block on the blocking for a write. But the deadlock was caused after the add_block blocked for a write, the block template required an additional read_lock on the calculation of the mmr roots. And thus, the entire block_chain db class is deadlocked.

Fixes: #4668

Copy link
Member

@sdbondi sdbondi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find, the rust docs now (not sure since when) have an example of this

https://doc.rust-lang.org/stable/std/sync/struct.RwLock.html

@sdbondi
Copy link
Member

sdbondi commented Sep 14, 2022

Took the liberty of proving/understanding the spurious deadlock, running this on playpen sometimes deadlocks (timeout), but on my local Linux machine it always deadlocks.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c8d94e50a665a4c15d13e3b3886703d8

Same behaviour for tokio async locks.

This only happens when obtaining a read lock more than once at the same time from the same thread.

Copy link
Collaborator

@stringhandler stringhandler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find. I feel like there might be more

@stringhandler stringhandler merged commit 996a88a into tari-project:development Sep 14, 2022
@SWvheerden SWvheerden deleted the sw_fix_deadlock branch September 14, 2022 08:36
sdbondi added a commit to sdbondi/tari that referenced this pull request Sep 16, 2022
* development: (72 commits)
  fix: reinsert transactions from failed block (tari-project#4675)
  fix: stray clippy error (tari-project#4685)
  fix(wallet): mark mined_height as null when pending outputs are cancelled (tari-project#4686)
  chore: updated dependancies (tari-project#4684)
  fix(p2p): remove DETACH flag usage (tari-project#4682)
  fix(comms): simplify and remove possibility of deadlock from pipelines and substream close (tari-project#4676)
  feat(ci): add default CI and FFI testing with custom dispatch (tari-project#4672)
  chore: remove broken test (tari-project#4678)
  fix: fix potential race condition between add_block and sync (tari-project#4677)
  fix deadlock (tari-project#4674)
  fix: add burn funds command to console wallet (see issue tari-project#4547) (tari-project#4655)
  v0.38.3
  fix: fee estimate (tari-project#4656)
  fix(comms/messaging): fix possible deadlock in outbound pipeline (tari-project#4657)
  fix: replace Luhn checksum with DammSum (tari-project#4639)
  fix(core/sync): handle deadline timeouts by changing peer (tari-project#4649)
  fix(ci): libtor build on Ubuntu (tari-project#4644)
  chore: fix log (tari-project#4634)
  v0.38.2
  fix(comms/rpc): detect early close in all cases (tari-project#4647)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[base node] Block chain db locks up with read-write lock
3 participants