Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC stops working and http queue fills up #4970

Closed
josef-v opened this issue Jan 29, 2021 · 35 comments
Closed

RPC stops working and http queue fills up #4970

josef-v opened this issue Jan 29, 2021 · 35 comments
Assignees
Labels
A-rpc-interface Area: RPC interface C-bug Category: This is a bug S-committed Status: Planned work in a sprint

Comments

@josef-v
Copy link

josef-v commented Jan 29, 2021

version: v4.1.1, v4.1.0 (not sure about the v4.0.0)

WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting

We are encountering problem with RPC interface. At one moment, it seems that RPC calls are not serviced anymore by the RPC server and the http queue fills up. This state is recoverable only by restart of the zcashd.

We are encountering this issue approximately once a day.

The node is used together with wallet, shielding of coinbase outputs and z_sendmany calls are being used on this node. (Full list of methods used on this node: backupwallet, getinfo, getmempoolinfo, getnetworkinfo, gettransaction, getunconfirmedbalance, getwalletinfo, listunspent, settxfee, validateaddress, z_getbalance, z_getoperationresult, z_getoperationstatus, z_listoperationids, z_sendmany, z_shieldcoinbase).

There is no mining happening on this node so no getblocktemplate calls are being used (important for reading the log as getblocktemplate is not logged on the side of RPC server for some reason).

Similar issue was reported on zcash's tech support discord channel by another party.

snippet of debug log: debug.log

@LarryRuane
Copy link
Collaborator

LarryRuane commented Feb 4, 2021

Some RPCs are probably getting blocked and never completing. There are, by default, four zcash-httpworke threads. Once all of them become stuck, the system queues up to 16 more RPCs. After that, any RPC will return that error.

Could you please start zcashd with the -debug=rpc command line argument, allow the problem to happen, and attach interesting parts of debug.log here?

That may not give us enough information. If you're building from source, would you mind applying this patch and reproducing the problem (specifying -debug=rpc) and attaching the log file? This patch adds logging when each RPC completes, in addition to when each RPC begins. We should be able to see which RPCs are never completing (I expect it will be one specific RPC). We still won't know why they're not completing, but at least this will narrow it down to a particular RPC.

--- a/src/rpc/server.cpp
+++ b/src/rpc/server.cpp
@@ -493,13 +493,22 @@ UniValue CRPCTable::execute(const std::string &strMethod, const UniValue &params
 
     g_rpcSignals.PreCommand(*pcmd);
 
+    LogPrint("rpc", "enter method=%s\n", SanitizeString(strMethod));
     try
     {
         // Execute
-        return pcmd->actor(params, false);
+        UniValue ret = pcmd->actor(params, false);
+        LogPrint("rpc", "leave method=%s\n", SanitizeString(strMethod));
+        return ret;
+    }
+    catch (const UniValue& objError)
+    {
+        LogPrint("rpc", "failed method=%s\n", SanitizeString(strMethod));
+        throw objError;
     }
     catch (const std::exception& e)
     {
+        LogPrint("rpc", "failed method=%s\n", SanitizeString(strMethod));
         throw JSONRPCError(RPC_MISC_ERROR, e.what());
     }
 

(You can apply this patch by saving it to a file and running patch -p1 <patchfile.)

Thank you!

(Just FYI, the reason getblocktemplate isn't logged is that mining nodes call that RPC very often, and logging it would cause the log file to grow very quickly, and, depending on what type of log rotation is being used, may result in useful log messages being lost.)

@LarryRuane LarryRuane self-assigned this Feb 4, 2021
@r3ld3v r3ld3v added this to the Core Sprint 2021-04 milestone Feb 4, 2021
@josef-v
Copy link
Author

josef-v commented Feb 5, 2021

Here is the newest log

Feb  4 18:04:34 zcash[8470]: DEBUG rpc: enter method=z_getoperationstatus
Feb  4 18:04:40 zcash[8470]: DEBUG rpc: leave method=getwalletinfo
Feb  4 18:04:40 zcash[8470]: DEBUG rpc: leave method=z_getoperationstatus
Feb  4 18:04:40 zcash[8470]: DEBUG rpc: ThreadRPCServer method=z_getoperationstatus
Feb  4 18:04:40 zcash[8470]: DEBUG rpc: enter method=z_getoperationstatus
Feb  4 18:04:40 zcash[8470]: DEBUG rpc: leave method=z_getoperationstatus
Feb  4 18:04:40 zcash[8470]: DEBUG rpc: ThreadRPCServer method=listunspent
Feb  4 18:04:40 zcash[8470]: DEBUG rpc: enter method=listunspent
Feb  4 18:04:40 zcash[8470]: DEBUG rpc: leave method=listunspent
Feb  4 18:04:40 zcash[8470]: DEBUG rpc: ThreadRPCServer method=z_shieldcoinbase
Feb  4 18:04:40 zcash[8470]: DEBUG rpc: enter method=z_shieldcoinbase
Feb  4 18:04:41 zcash[8470]: DEBUG rpc: leave method=z_shieldcoinbase
Feb  4 18:04:43 zcash[8470]: DEBUG rpc: ThreadRPCServer method=z_getoperationstatus
Feb  4 18:04:43 zcash[8470]: DEBUG rpc: enter method=z_getoperationstatus
Feb  4 18:05:07 zcash[8470]: DEBUG rpc: ThreadRPCServer method=getnetworkinfo
Feb  4 18:05:07 zcash[8470]: DEBUG rpc: enter method=getnetworkinfo
Feb  4 18:05:11 zcash[8470]: DEBUG rpc: ThreadRPCServer method=getnetworkinfo
Feb  4 18:05:11 zcash[8470]: DEBUG rpc: enter method=getnetworkinfo
Feb  4 18:05:56 zcash[8470]: DEBUG rpc: ThreadRPCServer method=getnetworkinfo
Feb  4 18:05:56 zcash[8470]: DEBUG rpc: enter method=getnetworkinfo

there doesn't seem to be anything interesting around.

One of the logs we had gathered has a different order of the last four methods (patch wasn't applied in this case, notice the timestamp)

Jan 22 02:05:02 zcash[1036]: DEBUG rpc: ThreadRPCServer method=gettransaction
Jan 22 02:05:04 zcash[1036]: DEBUG rpc: ThreadRPCServer method=z_getoperationstatus
Jan 22 02:05:04 zcash[1036]: DEBUG rpc: ThreadRPCServer method=z_getoperationstatus
Jan 22 02:05:04 zcash[1036]: DEBUG rpc: ThreadRPCServer method=z_getoperationresult
Jan 22 02:05:04 zcash[1036]: DEBUG rpc: ThreadRPCServer method=listunspent
Jan 22 02:05:04 zcash[1036]: DEBUG rpc: ThreadRPCServer method=z_shieldcoinbase
Jan 22 02:05:06 zcash[1036]: DEBUG rpc: ThreadRPCServer method=getnetworkinfo
Jan 22 02:05:07 zcash[1036]: DEBUG rpc: ThreadRPCServer method=z_getoperationstatus
Jan 22 02:05:25 zcash[1036]: DEBUG rpc: ThreadRPCServer method=getnetworkinfo
Jan 22 02:06:10 zcash[1036]: DEBUG rpc: ThreadRPCServer method=getnetworkinfo

what remains the same is that z_shieldcoinbase is the last method that is finished. (We had another log where z_shieldcoinbase is also the 5th last rpc method called.)

Edit: Removed color formatting artifacts from logs.

@LarryRuane
Copy link
Collaborator

Thanks, that helps but I'd like to get more information. Can I ask you to do two more things:

First, delete debug.log (or rename it), start zcashd rpc=debug, wait for the problem to happen, and then attach here the output of:

$ egrep 'rpc: .*(enter|leave|failed)' debug.log

Second, and this would be even more helpful, do the following: When zcashd is in this bad state, attach to it with gdb:

$ pgrep zcashd
12345
$ gdb /proc/12345/exe 12345
(gdb) thr apply all bt

Please attach the output here, it will be around 600 lines.

@josef-v
Copy link
Author

josef-v commented Feb 11, 2021

First, delete debug.log (or rename it), start zcashd rpc=debug, wait for the problem to happen, and then attach here the output of:

$ egrep 'rpc: .*(enter|leave|failed)' debug.log

So you want full log from a single run of the zcashd, but filter out everything except the log messages added by the patch, right?
I can do this with the log from Feb 4 (the one of which snippet was attached to my previous comment).
Here:
debug_filtered.log

@LarryRuane LarryRuane added A-rpc-interface Area: RPC interface C-bug Category: This is a bug labels Feb 11, 2021
@LarryRuane
Copy link
Collaborator

Hi josef-v, Yes, please first do that, perfect. What I suspect we're going to see is 20 enter lines in the log without matching leave (or failed) lines (and these 20 pending are preventing any more RPCs from running). Probably the first 4 that have enter lines with unmatched leave lines are the culprits, and most likely they'll all be the same RPC method. That will narrow the problem down to that method. That method is getting stuck somewhere. It may not be clear where!

So, getting the stack traces with gdb should give us a clear indication of where they're stuck. So we may need to do that also.

But if you could do just the easy one (use the debug.log file that you have saved) now, that would be a good first step. One caution -- please be sure to extract the log lines only from the last zcashd restart -- it will confuse things greatly if you give me results across multiple run instances.

Thank you for all this help! Sorry you've run into this problem.

@LarryRuane
Copy link
Collaborator

LarryRuane commented Feb 11, 2021

Oh sorry, I see you already did that, thanks! I'll look it over now...

Update, looks like z_getoperationstatus isn't completing (right near the bottom of the file). The subsequent getnetworkinfo are just getting stuck on cs_main, which is held by z_getoperationstatus, so they're innocent bystanders.

I'll look into what can hold up z_getoperationstatus but the investigation using gdb may be needed, sorry.

@LarryRuane
Copy link
Collaborator

Here's another really easy think you could do with that existing debug.log file, and may shed some light. Could you attach here a window of the logging around the timestamp Feb 4 18:04:43 (which is when the z_getoperationstatus enters but never exits). Maybe other logging will give us a hint about what's gone wrong. Maybe 50 or so lines before and after that rpc: enter method=z_getoperationstatus line. Thanks.

@LarryRuane
Copy link
Collaborator

One more thing I just thought of, when you're in gdb to get the stack traces, please also do this (I'll show what it looks like for me):

(gdb) p cs_main
$2 = {
  <boost::recursive_mutex> = {
    m = pthread_mutex_t = {
      Type = Recursive,
      Status = Acquired, possibly with waiters,
      Owner ID = 18351,
      Robust = No,
      Shared = No,
      Protocol = None
    }
  }, <No data fields>}
(gdb) p pwalletMain->cs_wallet
$3 = {
  <boost::recursive_mutex> = {
    m = pthread_mutex_t = {
      Type = Recursive,
      Status = Acquired, possibly with no waiters,
      Owner ID = 18351,
      Robust = No,
      Shared = No,
      Protocol = None
    }
  }, <No data fields>}
(gdb) 

This will show us which threads (as shown by the stack traces) are holding these locks. Thanks!

@josef-v
Copy link
Author

josef-v commented Feb 11, 2021

Could you attach here a window of the logging around the timestamp Feb 4 18:04:43 (which is when the z_getoperationstatus enters but never exits). Maybe other logging will give us a hint about what's gone wrong. Maybe 50 or so lines before and after that rpc: enter method=z_getoperationstatus line.

snippet.log
(I have removed all the color formatting, which makes the log look much less intimidating.)

Update, looks like z_getoperationstatus isn't completing (right near the bottom of the file). The subsequent getnetworkinfo are just getting stuck on cs_main, which is held by z_getoperationstatus, so they're innocent bystanders.

Couldn't it be the z_shieldcoinbase which holds the cs_main even after leaving? This is just a speculation and I know nothing about how mutex locks in libboost work, but the other log I posted has different order of the last 4 methods. The similarity is that they are preceded by z_shieldcoinbase. (Not sure if the real order of logged methods is guaranteed?)

Anyway, I'll let you know once we have the output from gdb.

Thanks for your help.

@LarryRuane
Copy link
Collaborator

The LOCK and LOCK2 calls automatically unlock when the go out of scope. Because each of the two enter method=z_shieldcoinbase is followed by leave method=z_shieldcoinbase -- the leave couldn't happen without the lock (actually there are two locks in the case of that method, LOCK2) being released. So it still looks to me like z_getoperationstatus is the one getting hung up, and there are multiple places it can be waiting, so I'm more convinced than ever that the gdb output (the stack traces and lock variables) would be very useful.

There is some randomness with the order of the logging when more than one thread is logging "at the same time." That may explain why the order isn't always what you'd expect.

Thanks for the additional log lines; I don't see anything unusual there.

@josef-v
Copy link
Author

josef-v commented Feb 14, 2021

Hi Larry, here's the gdb backtrace
gdb_zcashd.txt

@LarryRuane
Copy link
Collaborator

Thanks, this is very helpful. Can you send me the first debug.log line that's printed when zcashd starts up? It should look something like:

Feb 16 12:28:20.804  INFO main: Zcash version v4.2.0-f2315aceb-dirty (2021-02-12 10:28:57 -0700)

Or you can run src/zcash-cli --version (should show the same version). Thanks!

@LarryRuane
Copy link
Collaborator

Here's an interesting stack trace:

Thread 10 (Thread 0x7f8f9f37e700 (LWP 15)):
#0  0x00007f8fc439f00c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000562a19c3fecc in __db_hybrid_mutex_suspend ()
#2  0x0000562a19c3f58e in __db_tas_mutex_lock_int ()
#3  0x0000562a19c6622b in __lock_get_internal ()
#4  0x0000562a19c678fc in __lock_get ()
#5  0x0000562a19ca8fdb in __db_lget ()
#6  0x0000562a19ca8871 in __db_new ()
#7  0x0000562a19cb1a6f in __db_poff ()
#8  0x0000562a19c44407 in __bam_iitem ()
#9  0x0000562a19c343a7 in __bamc_put ()
#10 0x0000562a19c8deb0 in __dbc_iput ()
#11 0x0000562a19c8c3d6 in __dbc_put ()
#12 0x0000562a19c77e5f in __db_put ()
#13 0x0000562a19ca390f in __db_put_pp ()
#14 0x0000562a19c1358b in Db::put(DbTxn*, Dbt*, Dbt*, unsigned int) ()
#15 0x0000562a1988d2cb in CDB::Write<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, uint256>, CWalletTx> (this=0x7f8f401161c0, key=..., value=..., fOverwrite=<optimized out>) at ./wallet/db.h:172
#16 0x0000562a1987e32e in CWalletDB::WriteTx (this=0x562a1aa388e0, hash=..., wtx=...) at wallet/walletdb.cpp:61
#17 0x0000562a19843892 in CWalletTx::WriteToDisk (this=<optimized out>, pwalletdb=<optimized out>) at wallet/wallet.cpp:2680
#18 CWallet::AddToWallet (this=0x562aa2106b80, wtxIn=..., fFromLoadWallet=<optimized out>, pwalletdb=0x7f8f401161c0) at wallet/wallet.cpp:1679
#19 0x0000562a1983c90f in CWallet::CommitTransaction (this=0x562aa2106b80, wtxNew=..., reservekey=...) at wallet/wallet.cpp:3892
#20 0x0000562a19816885 in SendTransaction (tx=..., reservekey=..., testmode=<optimized out>) at wallet/asyncrpcoperation_common.cpp:14
#21 0x0000562a19830b08 in ShieldToAddress::operator() (this=<optimized out>, zaddr=...) at wallet/asyncrpcoperation_shieldcoinbase.cpp:255
(.... skipping irrelevant stack frames ...)
#29 std::__1::visit<ShieldToAddress, std::__1::variant<libzcash::InvalidEncoding, libzcash::SproutPaymentAddress, libzcash::SaplingPaymentAddress>&> (__visitor=..., __vs=...) at /zcash/depends/x86_64-unknown-linux-gnu/native/bin/../include/c++/v1/variant:1543
#30 AsyncRPCOperation_shieldcoinbase::main_impl (this=<optimized out>) at wallet/asyncrpcoperation_shieldcoinbase.cpp:201
#31 0x0000562a1982b262 in AsyncRPCOperation_shieldcoinbase::main (this=0x7f8f9004eb20) at wallet/asyncrpcoperation_shieldcoinbase.cpp:122
#32 0x0000562a196ee894 in AsyncRPCQueue::run (this=0x562a1a799ab8, workerId=<optimized out>) at asyncrpcqueue.cpp:68

This thread, which is shielding coinbase, is holding the cs_main lock, which causes the other RPCs to become stuck. This thread is trying to write to the wallet database (which is BDB). It seems to be waiting for a very low-level lock. There's another thread that is also stuck down in BDB:

Thread 17 (Thread 0x7f8f6e2be700 (LWP 69)):
#0  0x00007f8fc439f00c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000562a19c3fecc in __db_hybrid_mutex_suspend ()
#2  0x0000562a19c3f58e in __db_tas_mutex_lock_int ()
#3  0x0000562a19c6622b in __lock_get_internal ()
#4  0x0000562a19c678fc in __lock_get ()
#5  0x0000562a19ca8fdb in __db_lget ()
#6  0x0000562a19c473b7 in __bam_search ()
#7  0x0000562a19c36a4e in __bamc_search ()
#8  0x0000562a19c34007 in __bamc_put ()
#9  0x0000562a19c8deb0 in __dbc_iput ()
#10 0x0000562a19c8c3d6 in __dbc_put ()
#11 0x0000562a19c77e5f in __db_put ()
#12 0x0000562a19ca390f in __db_put_pp ()
#13 0x0000562a19c1358b in Db::put(DbTxn*, Dbt*, Dbt*, unsigned int) ()
#14 0x0000562a1988d2cb in CDB::Write<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, uint256>, CWalletTx> (this=0x7f8f6e2bc7a8, key=..., value=..., fOverwrite=<optimized out>) at ./wallet/db.h:172
#15 0x0000562a1987e32e in CWalletDB::WriteTx (this=0x562a1a7dfd40, hash=..., wtx=...) at wallet/walletdb.cpp:61
#16 0x0000562a19868031 in CWallet::SetBestChainINTERNAL<CWalletDB> (this=0x562aa2106b80, walletdb=..., loc=...) at ./wallet/wallet.h:838
#17 0x0000562a1983b92b in CWallet::SetBestChain (this=0x562aa2106b80, loc=...) at wallet/wallet.cpp:683
#18 0x0000562a1983a000 in CWallet::ChainTipAdded (this=0x562aa2106b80, pindex=<optimized out>, pblock=0x7f8f6e2bd2e0, sproutTree=..., saplingTree=...) at wallet/wallet.cpp:612
#19 0x0000562a1983bbf1 in CWallet::ChainTip (this=0x562aa2106b80, pindex=0x7f8f390fbcf0, pblock=0x7f8f6e2bd2e0, added=...) at wallet/wallet.cpp:621

This thread is updating the wallet db as a result of the node receiving a new block. It's very possible that these two threads have locked some BDB-internal pair of locks in the opposite order (lock-ordering bug).

I just discovered #4502, which may be the same problem. How large is your wallet.dat file? According to #4502, a large wallet seems to make the problem more likely to occur. That ticket has been open for a while, I'd say this problem needs more priority (especially if it's the same problem). Let me tag @daira.

A simple workaround to this problem (which doesn't deserve to be called a solution) might be to simply single-thread all calls to CWalletDB::WriteTx() (this is common between these two threads). It would reduce wallet db write throughput a little, but we don't write to that database often, so I doubt it would be noticeable.

I should try to reliably reproduce the problem (I think I have a large testnet wallet), and code up a patch and verify that it fixes the problem. If you're amenable, to save time, you could try the patch too (it should be only a couple of lines). The worst that can happen is that zcashd locks up, but that's happening anyway.

@LarryRuane
Copy link
Collaborator

LarryRuane commented Feb 17, 2021

Here's a patch you can try that I believe will fix the problem:

--- a/src/wallet/db.h
+++ b/src/wallet/db.h
@@ -111,6 +111,7 @@ public:
 private:
     CDB(const CDB&);
     void operator=(const CDB&);
+    CCriticalSection cs_walletdb;
 
 protected:
     template <typename K, typename T>
@@ -169,7 +170,11 @@ protected:
         Dbt datValue(&ssValue[0], ssValue.size());
 
         // Write
-        int ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+        int ret;
+        {
+            LOCK(cs_walletdb);
+            ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+        }
 
         // Clear memory in case it was a private key
         memory_cleanse(datKey.get_data(), datKey.get_size());

In case you're not familiar, you can apply this patch by saving it to a file and then running patch -p1 <patchfile at the top of the source directory.

@josef-v
Copy link
Author

josef-v commented Feb 18, 2021

Hi Larry, to answer your questions:

Can you send me the first debug.log line that's printed when zcashd starts up?

Feb  4 15:49:34 zcash[8470]: INFO main: Zcash version v4.2.0-b812c3ad9-dirty (2020-12-21 20:08:33 +0000)

How large is your wallet.dat file?

The current one is 760MB, the largest one we had (at least what I see in our backups) was 1050MB.

Thanks for the patch,we will definitely try it.

@josef-v
Copy link
Author

josef-v commented Feb 24, 2021

Hi Larry, it seems that the patch didn't help. We have already encountered the same issue with patched zcashd. I have asked my colleagues who have access to the machine to collect the backtrace again. Are there any other things you would like to get from gdb?

@LarryRuane
Copy link
Collaborator

LarryRuane commented Feb 25, 2021

Thanks, I'm surprised that didn't work. Please send the same gdb information as last time (stack traces and the contents of those locks), and one more thing would be helpful, should look like this:

(gdb) p &cs_main
$2 = (CCriticalSection *) 0x555556276ef8 <cs_main>
(gdb) p &pwalletMain->cs_wallet
$3 = (CCriticalSection *) 0x5555defa8458
(gdb) 

In other words, it would help to see both the contents of those two locks, and their addresses. This way I can correlate the addresses in the stack traces to these locks.

@LarryRuane
Copy link
Collaborator

LarryRuane commented Mar 4, 2021

Instead of waiting for another round of gathering information and sending out a patch, why don't I just give you another patch to try, based on what I think is most likely the problem. As I explained in an earlier comment, the first set of stack traces you provided shows two threads in the "write" path, and the fix single-threads the write path, which was a minimal patch. That's evidently not enough -- there must be another method (or methods) into the BDB code that can deadlock. So here's a patch that single-threads all access to BDB. (It includes the previous patch.) I'd still like to see the information from the most recent deadlock -- that will show which other method or methods can deadlock besides write, which will be good to know.

--- a/src/wallet/db.h
+++ b/src/wallet/db.h
@@ -111,6 +111,7 @@ public:
 private:
     CDB(const CDB&);
     void operator=(const CDB&);
+    CCriticalSection cs_walletdb;
 
 protected:
     template <typename K, typename T>
@@ -128,7 +129,11 @@ protected:
         // Read
         Dbt datValue;
         datValue.set_flags(DB_DBT_MALLOC);
-        int ret = pdb->get(activeTxn, &datKey, &datValue, 0);
+        int ret;
+        {
+            LOCK(cs_walletdb);
+            ret = pdb->get(activeTxn, &datKey, &datValue, 0);
+        }
         memory_cleanse(datKey.get_data(), datKey.get_size());
         if (datValue.get_data() != NULL) {
             BOOST_SCOPE_EXIT_TPL(&datValue) {
@@ -169,7 +174,11 @@ protected:
         Dbt datValue(&ssValue[0], ssValue.size());
 
         // Write
-        int ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+        int ret;
+        {
+            LOCK(cs_walletdb);
+            ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+        }
 
         // Clear memory in case it was a private key
         memory_cleanse(datKey.get_data(), datKey.get_size());
@@ -192,7 +201,11 @@ protected:
         Dbt datKey(&ssKey[0], ssKey.size());
 
         // Erase
-        int ret = pdb->del(activeTxn, &datKey, 0);
+        int ret;
+        {
+            LOCK(cs_walletdb);
+            ret = pdb->del(activeTxn, &datKey, 0);
+        }
 
         // Clear memory
         memory_cleanse(datKey.get_data(), datKey.get_size());
@@ -212,7 +225,11 @@ protected:
         Dbt datKey(&ssKey[0], ssKey.size());
 
         // Exists
-        int ret = pdb->exists(activeTxn, &datKey, 0);
+        int ret;
+        {
+            LOCK(cs_walletdb);
+            ret = pdb->exists(activeTxn, &datKey, 0);
+        }
 
         // Clear memory
         memory_cleanse(datKey.get_data(), datKey.get_size());
@@ -224,7 +241,11 @@ protected:
         if (!pdb)
             return NULL;
         Dbc* pcursor = NULL;
-        int ret = pdb->cursor(NULL, &pcursor, 0);
+        int ret;
+        {
+            LOCK(cs_walletdb);
+            ret = pdb->cursor(NULL, &pcursor, 0);
+        }
         if (ret != 0)
             return NULL;
         return pcursor;

@josef-v
Copy link
Author

josef-v commented Mar 4, 2021

Hi Larry, thanks for the new patch!
Sorry for not sending any more backtraces. The colleague who is in charge of the server (it's a production machine with very limited access) was on vacation and his substitutes probably weren't brave enough to do anything more than restarting the service (official story is that they were too busy).
We will try to get the backtrace (the brave colleague is back from vacation) and then switch to the new patch.

@josef-v
Copy link
Author

josef-v commented Mar 8, 2021

Here is the backtrace I promised.
gdb_zcashd.txt
We are now running zcashd with the new patch. I'll let you know if there are any problems (or if there aren't any problems for long enough time).
Thanks again.

@LarryRuane
Copy link
Collaborator

@josef-v, please try this patch (by itself, not in addition to the previous ones). The locking was at the wrong level in the previous attempts.

--- a/src/wallet/db.h
+++ b/src/wallet/db.h
@@ -169,7 +169,11 @@ protected:
         Dbt datValue(&ssValue[0], ssValue.size());
 
         // Write
-        int ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+        int ret;
+        {
+            LOCK(bitdb.cs_db);
+            ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+        }
 
         // Clear memory in case it was a private key
         memory_cleanse(datKey.get_data(), datKey.get_size());

@r3ld3v r3ld3v removed this from the Core Sprint 2021-08 milestone Mar 15, 2021
@r3ld3v r3ld3v added S-committed Status: Planned work in a sprint and removed S-committed Status: Planned work in a sprint labels Jul 6, 2021
@str4d str4d added S-committed Status: Planned work in a sprint and removed S-committed Status: Planned work in a sprint labels Jul 19, 2021
@str4d str4d removed the S-committed Status: Planned work in a sprint label Jul 30, 2021
@miodragpop
Copy link
Contributor

@LarryRuane I wonder if this issue is related to the fact that cs_wallet lock is not being held in SetBestChainINTERNAL() ? Shouldn't mapWallet usage be guarded by cs_wallet ?

@LarryRuane
Copy link
Collaborator

Are you still seeing this problem? If not, can this issue be closed? We think there is a good chance that this was fixed by #5280 (included in 4.5.0), although we're not able to definitely verify this since we can't reproduce the problem reliably.

@r3ld3v r3ld3v added S-committed Status: Planned work in a sprint and removed S-committed Status: Planned work in a sprint labels Oct 13, 2021
@josef-v
Copy link
Author

josef-v commented Oct 19, 2021

We're not seeing this problem anymore. However, we use a build with upgraded BDb and have reduced RPC calls significantly.
The issue can be probably closed for now and reopened if someone encounters the problem again.

@r3ld3v
Copy link
Contributor

r3ld3v commented Oct 20, 2021

Closing since the problem isn't occurring anymore.

@r3ld3v r3ld3v closed this as completed Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rpc-interface Area: RPC interface C-bug Category: This is a bug S-committed Status: Planned work in a sprint
Projects
None yet
Development

No branches or pull requests

6 participants
@str4d @LarryRuane @miodragpop @josef-v @r3ld3v and others