New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RPC stops working and http queue fills up #4970
Comments
Some RPCs are probably getting blocked and never completing. There are, by default, four Could you please start zcashd with the That may not give us enough information. If you're building from source, would you mind applying this patch and reproducing the problem (specifying --- a/src/rpc/server.cpp
+++ b/src/rpc/server.cpp
@@ -493,13 +493,22 @@ UniValue CRPCTable::execute(const std::string &strMethod, const UniValue ¶ms
g_rpcSignals.PreCommand(*pcmd);
+ LogPrint("rpc", "enter method=%s\n", SanitizeString(strMethod));
try
{
// Execute
- return pcmd->actor(params, false);
+ UniValue ret = pcmd->actor(params, false);
+ LogPrint("rpc", "leave method=%s\n", SanitizeString(strMethod));
+ return ret;
+ }
+ catch (const UniValue& objError)
+ {
+ LogPrint("rpc", "failed method=%s\n", SanitizeString(strMethod));
+ throw objError;
}
catch (const std::exception& e)
{
+ LogPrint("rpc", "failed method=%s\n", SanitizeString(strMethod));
throw JSONRPCError(RPC_MISC_ERROR, e.what());
}
(You can apply this patch by saving it to a file and running Thank you! (Just FYI, the reason |
Here is the newest log
there doesn't seem to be anything interesting around. One of the logs we had gathered has a different order of the last four methods (patch wasn't applied in this case, notice the timestamp)
what remains the same is that Edit: Removed color formatting artifacts from logs. |
Thanks, that helps but I'd like to get more information. Can I ask you to do two more things: First, delete
Second, and this would be even more helpful, do the following: When
Please attach the output here, it will be around 600 lines. |
So you want full log from a single run of the zcashd, but filter out everything except the log messages added by the patch, right? |
Hi josef-v, Yes, please first do that, perfect. What I suspect we're going to see is 20 So, getting the stack traces with But if you could do just the easy one (use the Thank you for all this help! Sorry you've run into this problem. |
Oh sorry, I see you already did that, thanks! I'll look it over now... Update, looks like I'll look into what can hold up |
Here's another really easy think you could do with that existing |
One more thing I just thought of, when you're in
This will show us which threads (as shown by the stack traces) are holding these locks. Thanks! |
snippet.log
Couldn't it be the Anyway, I'll let you know once we have the output from Thanks for your help. |
The There is some randomness with the order of the logging when more than one thread is logging "at the same time." That may explain why the order isn't always what you'd expect. Thanks for the additional log lines; I don't see anything unusual there. |
Hi Larry, here's the gdb backtrace |
Thanks, this is very helpful. Can you send me the first
Or you can run |
Here's an interesting stack trace:
This thread, which is shielding coinbase, is holding the
This thread is updating the wallet db as a result of the node receiving a new block. It's very possible that these two threads have locked some BDB-internal pair of locks in the opposite order (lock-ordering bug). I just discovered #4502, which may be the same problem. How large is your A simple workaround to this problem (which doesn't deserve to be called a solution) might be to simply single-thread all calls to I should try to reliably reproduce the problem (I think I have a large testnet wallet), and code up a patch and verify that it fixes the problem. If you're amenable, to save time, you could try the patch too (it should be only a couple of lines). The worst that can happen is that |
Here's a patch you can try that I believe will fix the problem: --- a/src/wallet/db.h
+++ b/src/wallet/db.h
@@ -111,6 +111,7 @@ public:
private:
CDB(const CDB&);
void operator=(const CDB&);
+ CCriticalSection cs_walletdb;
protected:
template <typename K, typename T>
@@ -169,7 +170,11 @@ protected:
Dbt datValue(&ssValue[0], ssValue.size());
// Write
- int ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+ int ret;
+ {
+ LOCK(cs_walletdb);
+ ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+ }
// Clear memory in case it was a private key
memory_cleanse(datKey.get_data(), datKey.get_size()); In case you're not familiar, you can apply this patch by saving it to a file and then running |
Hi Larry, to answer your questions:
The current one is 760MB, the largest one we had (at least what I see in our backups) was 1050MB. Thanks for the patch,we will definitely try it. |
Hi Larry, it seems that the patch didn't help. We have already encountered the same issue with patched zcashd. I have asked my colleagues who have access to the machine to collect the backtrace again. Are there any other things you would like to get from gdb? |
Thanks, I'm surprised that didn't work. Please send the same gdb information as last time (stack traces and the contents of those locks), and one more thing would be helpful, should look like this:
In other words, it would help to see both the contents of those two locks, and their addresses. This way I can correlate the addresses in the stack traces to these locks. |
Instead of waiting for another round of gathering information and sending out a patch, why don't I just give you another patch to try, based on what I think is most likely the problem. As I explained in an earlier comment, the first set of stack traces you provided shows two threads in the "write" path, and the fix single-threads the write path, which was a minimal patch. That's evidently not enough -- there must be another method (or methods) into the BDB code that can deadlock. So here's a patch that single-threads all access to BDB. (It includes the previous patch.) I'd still like to see the information from the most recent deadlock -- that will show which other method or methods can deadlock besides write, which will be good to know. --- a/src/wallet/db.h
+++ b/src/wallet/db.h
@@ -111,6 +111,7 @@ public:
private:
CDB(const CDB&);
void operator=(const CDB&);
+ CCriticalSection cs_walletdb;
protected:
template <typename K, typename T>
@@ -128,7 +129,11 @@ protected:
// Read
Dbt datValue;
datValue.set_flags(DB_DBT_MALLOC);
- int ret = pdb->get(activeTxn, &datKey, &datValue, 0);
+ int ret;
+ {
+ LOCK(cs_walletdb);
+ ret = pdb->get(activeTxn, &datKey, &datValue, 0);
+ }
memory_cleanse(datKey.get_data(), datKey.get_size());
if (datValue.get_data() != NULL) {
BOOST_SCOPE_EXIT_TPL(&datValue) {
@@ -169,7 +174,11 @@ protected:
Dbt datValue(&ssValue[0], ssValue.size());
// Write
- int ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+ int ret;
+ {
+ LOCK(cs_walletdb);
+ ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+ }
// Clear memory in case it was a private key
memory_cleanse(datKey.get_data(), datKey.get_size());
@@ -192,7 +201,11 @@ protected:
Dbt datKey(&ssKey[0], ssKey.size());
// Erase
- int ret = pdb->del(activeTxn, &datKey, 0);
+ int ret;
+ {
+ LOCK(cs_walletdb);
+ ret = pdb->del(activeTxn, &datKey, 0);
+ }
// Clear memory
memory_cleanse(datKey.get_data(), datKey.get_size());
@@ -212,7 +225,11 @@ protected:
Dbt datKey(&ssKey[0], ssKey.size());
// Exists
- int ret = pdb->exists(activeTxn, &datKey, 0);
+ int ret;
+ {
+ LOCK(cs_walletdb);
+ ret = pdb->exists(activeTxn, &datKey, 0);
+ }
// Clear memory
memory_cleanse(datKey.get_data(), datKey.get_size());
@@ -224,7 +241,11 @@ protected:
if (!pdb)
return NULL;
Dbc* pcursor = NULL;
- int ret = pdb->cursor(NULL, &pcursor, 0);
+ int ret;
+ {
+ LOCK(cs_walletdb);
+ ret = pdb->cursor(NULL, &pcursor, 0);
+ }
if (ret != 0)
return NULL;
return pcursor; |
Hi Larry, thanks for the new patch! |
Here is the backtrace I promised. |
@josef-v, please try this patch (by itself, not in addition to the previous ones). The locking was at the wrong level in the previous attempts. --- a/src/wallet/db.h
+++ b/src/wallet/db.h
@@ -169,7 +169,11 @@ protected:
Dbt datValue(&ssValue[0], ssValue.size());
// Write
- int ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+ int ret;
+ {
+ LOCK(bitdb.cs_db);
+ ret = pdb->put(activeTxn, &datKey, &datValue, (fOverwrite ? 0 : DB_NOOVERWRITE));
+ }
// Clear memory in case it was a private key
memory_cleanse(datKey.get_data(), datKey.get_size()); |
@LarryRuane I wonder if this issue is related to the fact that cs_wallet lock is not being held in SetBestChainINTERNAL() ? Shouldn't mapWallet usage be guarded by cs_wallet ? |
Are you still seeing this problem? If not, can this issue be closed? We think there is a good chance that this was fixed by #5280 (included in 4.5.0), although we're not able to definitely verify this since we can't reproduce the problem reliably. |
We're not seeing this problem anymore. However, we use a build with upgraded BDb and have reduced RPC calls significantly. |
Closing since the problem isn't occurring anymore. |
version: v4.1.1, v4.1.0 (not sure about the v4.0.0)
We are encountering problem with RPC interface. At one moment, it seems that RPC calls are not serviced anymore by the RPC server and the http queue fills up. This state is recoverable only by restart of the zcashd.
We are encountering this issue approximately once a day.
The node is used together with wallet, shielding of coinbase outputs and z_sendmany calls are being used on this node. (Full list of methods used on this node: backupwallet, getinfo, getmempoolinfo, getnetworkinfo, gettransaction, getunconfirmedbalance, getwalletinfo, listunspent, settxfee, validateaddress, z_getbalance, z_getoperationresult, z_getoperationstatus, z_listoperationids, z_sendmany, z_shieldcoinbase).
There is no mining happening on this node so no getblocktemplate calls are being used (important for reading the log as getblocktemplate is not logged on the side of RPC server for some reason).
Similar issue was reported on zcash's tech support discord channel by another party.
snippet of debug log: debug.log
The text was updated successfully, but these errors were encountered: