Deadlock in Berkeley DB between msghand and txnotify threads #4502
Labels
A-wallet-database
Area: Wallet database and serialization
C-bug
Category: This is a bug
thread safety
Describe the issue
As observed in a fork of Zcash last sync'ed to Zcash 2.1.0-1 upstream, when running a node with an unusually large
wallet.dat
(of 2+ GB in this case), the node would deadlock every few days (or every few thousand blocks) withcs_main
taken and never released.Can you reliably reproduce the issue?
Unfortunately, it kept reoccurring, but we've since worked around it and don't want to see it again.
If so, please list the steps to reproduce below:
Something like this will likely trigger it again:
wallet.dat
grow to 2+ GB. Observe the node start to lock up every few days.Expected behaviour
The node should continue working without locking up.
Actual behaviour + errors
The node locks up. When the metrics screen is enabled, it stops being updated. There's no relevant error in
debug.log
, and nothing new gets written to there when the node is locked up.The version of Zcash you were using:
Resistance Core 2.1.0-2 (last sync'ed to Zcash 2.1.0-1 upstream).
Machine specs:
Any extra information that might be useful in the debugging process.
My analysis of the problem is as follows:
msghand
andtxnotify
threads both proceeded through a chain of calls eventually ending up in Berkeley DB. There they deadlocked against each other, presumably (I didn't analyze this in detail) on BDB taking two mutexes in different order.msghand
hascs_lock
taken,txnotify
does not.A workaround, which seems to have worked, is to have
txnotify
takecs_lock
before proceeding into those calls, which preventsmsghand
from also going there simultaneously:Another workaround (untested) could be to introduce locks in Zcash's BDB wrapper functions, which would cover all potential code paths leading to the deadlock (not just the specific scenario above) and would possibly be cheaper performance-wise (not keeping one global lock for long), but would result in a more invasive patch.
One similar (but not exactly the same) bug within BDB is documented here: https://bugzilla.redhat.com/show_bug.cgi?id=1349779
Reviewing upstream BDB 6.2.x fixes listed at https://download.oracle.com/otndocs/products/berkeleydb/html/changelog_6_2.html the closest match I see is "Fixed a bug that may cause self-deadlock during database compaction. [#23725]", which looks like the same as the Red Hat bug above, but doesn't look exactly the same as what we have (Red Hat's backtraces include database compaction, ours don't). So upgrading BDB to latest 6.2.x isn't very promising.
One thing I'm confused about is why Zcash even uses BDB, given that Bitcoin Core seems to have migrated to LevelDB before Zcash forked from it. Moreover, Zcash brings its own choice of BDB version, different from what Bitcoin had used. Was this a deliberate preference over both LevelDB and older Bitcoin's older BDB?
Do you have a backup of
~/.zcash
directory and/or take a VM snapshot?I have
gdb
backtraces, here:The text was updated successfully, but these errors were encountered: