New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rocksdb occasionally segfaults in tests #25941
Comments
Also cc @jbiseda if you want to take a look. |
One change that might make segfaults more frequent and so faster to bisect is applying this patch below to generate more shreds; Though the patch is not necessary to observe segfaults. diff --git a/ledger/src/shredder.rs b/ledger/src/shredder.rs
index 28e4ab1f0a..51969ff3b8 100644
--- a/ledger/src/shredder.rs
+++ b/ledger/src/shredder.rs
@@ -185,7 +185,7 @@ impl Shredder {
.checked_add(
u32::try_from(i)
.unwrap()
- .checked_mul(MAX_DATA_SHREDS_PER_FEC_BLOCK)
+ .checked_mul(7 * MAX_DATA_SHREDS_PER_FEC_BLOCK)
.unwrap(),
)
.unwrap();
@@ -233,13 +233,10 @@ impl Shredder {
&& shred.version() == version
&& shred.fec_set_index() == fec_set_index));
let num_data = data.len();
- let num_coding = if is_last_in_slot {
- (2 * MAX_DATA_SHREDS_PER_FEC_BLOCK as usize)
+ let num_coding =
+ (8 * MAX_DATA_SHREDS_PER_FEC_BLOCK as usize)
.saturating_sub(num_data)
- .max(num_data)
- } else {
- num_data
- };
+ .max(num_data);
let data = data.iter().map(Shred::erasure_shard_as_slice);
let data: Vec<_> = data.collect::<Result<_, _>>().unwrap();
let mut parity = vec![vec![0u8; data[0].len()]; num_coding]; |
I'll take a look at this. |
To move some conversation out of DM and into the open, I have:
I'm seeing about 1 minute / iteration (similar to what behzad observed), so will report back later |
From the trace it looks like it's not from the rocksdb core.
UserComparator is where rocksdb invokes call-back defined by the rocksdb user (i.e., us) to compare keys. Do we have any recent changes on key format or how to compare keys in any column families? |
Looks like we might not have any custom comparator based on the code search. The other line that is closed to where the trace pointed to is In addition, if PERF_COUNTER_ADD is the root-cause, then we should also see crashes at other PERF_COUNTER_ADD locations as this is not the only place where rocksdb collects PerfContext. So this two together make PerfContext less likely the cause. I will continue investigating with @steviez on this issue. |
To start narrowing this down to a date range: failure:
no failure:
|
You wouldn't see any metrics reported for this; behzad was just running a unit test in a loop, not the full validator with metrics configured |
I see. Then it's disabled as the rocks_perf_sample_interval is 0 by default. |
One possibility is that some earlier invalid memory write corrupts i.e. |
Yes, that's likely the case esp. we don't have any custom comparator. |
Meanwhile, I am in the process of bi-sect to see whether I can get more clues to this issue. |
Also did not see failures at the following commit:
|
Some other ideas worth trying in parallel to
|
Thanks for the information. Each iteration of bi-sect does take quite a long time to make sure the commit has no repro.
RocksDB does use valgrind and other sanitizers such as asan, tsan and ubsan. But it's definitely a worth-try on our side as RocksDB's check and test might not capture all the issues. |
I actually haven't been able to reproduce the issue at all; my runs thus far have been with branches as-is (not using the patch behzad supplied above)
Applied the patch to generate more shreds on one of the nodes on top of master, going to let that spin for a bit |
Here're my current data points. I am able to see repro in the following commits:
And no repro in the following commits (but doesn't mean 100% problem-free as the issue might not be easily reproducible):
I will continue doing the semi-bi-sect process. The way I am doing bi-sect is the following:
In the above process, I use the following two commits as the begin and end commit, and I've verified the begin-commit is reproducible while I haven't seen any repro in the end-commit so far:
|
I must have been doing something dumb; was able to reproduce, although I'm getting
Looking at my dump, I got the top line of the backtrace that was unknown in the backtraces behzad supplied; rest of the backtrace looks to be the same:
I don't have any other ideas besides bisect at the moment and since it takes just a few minutes to start, I'll bisect in parallel with @yhchiang-sol. Also, I combined all of the test data-points into the problem description at the top. Let's update the list there for the sake of having all results in one spot. |
The following commit has no repro on my side, and I've added it to the bi-sect data-points.
Am currently testing |
I think I have honed into the offending commit, e263be2. Inspecting test output a little closer, I noticed some instances of hte following error:
Not all instances of the error were followed by
I then cherry-picked e263be2 onto Edit: The following comment makes me unsure now |
Commit 0820065 (Tue May 17 21:02:43 2022 +0200) crashes once out of 597 runs. Didn't see the Something worth-mentioning is that 0820065 (Tue May 17 21:02:43 2022 +0200) is before e263be2 (Fri May 20 17:59:23 2022 -0700). If e263be2 is innocent, then it brings us to the possibility that the problem already exists in |
032a2b8 (Sun May 8 10:11:10 2022 -0700) has no repro after 800+ iterations. |
Currently testing commit fc793de (Thu May 12 14:48:29 2022 -0500) |
Finished git bisect start
git bisect good cda3d66b21367bd8fda16e6265fa61e7fb4ba6c9
git bisect bad 655b40a2b7ae43f5575c2141ceb0fdb322f22bf1
git bisect good a829ddc922751b909a5922845b6cdd46e984487b
git bisect good 97efbdc303072d55b2a1cdcf34eb591218336c1c
git bisect bad d4e7ebf4f8821dfa59a1f278898cf9a7ad70ebd9
git bisect bad 41f30a2383ef58b76b2938a8914c908f62d8c180
git bisect bad e263be2000a237e694a07c498151122c2169ce95
git bisect good 467431de8946e3cd4453c2ffbf06038e1d5d9f96
git bisect good a5792885ca3af699737ae81c95347cdef54fc471
git bisect good f584b249dd94056fe2760973cbf8f50080a0951b
git bisect good 8caf0aabd11a6c1d6dc035fa9633667caaa1ff80
git bisect good e02537671963cd5c8f2ad7d4b70df96776f8dfd5
e263be2000a237e694a07c498151122c2169ce95 is the first bad commit |
Commit fc793de (Thu May 12 14:48:29 2022 -0500) has no repro after 1000 iterations. |
Commit 6c10515 (Sun May 15 20:04:17 2022 +0800) has no repro after 1000 iterations. |
I think I have ample evidence that these two commits are causing segfaults: e105547, e263be2 I am using this branch for testing:
With the two reverts I have not observed any segfaults despite running the tests like ~6k times. We do observe segfaults on master branch as well, so the first 4 commits are not the issue. Both #25400 and #24111 are adding abort/panics at validator exit so maybe that is causing some illegal memory access in rocksdb code which results in segfaults. This has become a blocker for the shreds redesign work. @jbiseda @HaoranYi can we revert #25400 and #24111 and instead do something cleaner which does not cause abort/panic? |
#24111 |
@behzadnouri |
@HaoranYi I have already explained the diagnosis process here: #25941 (comment)
|
OK. By reverting #24111, we hide the segmentation issue and goes back to the hang issue. Here is to issue to track validator_exit hang #25933. |
@behzadnouri |
Problem seems to be that From a cursory examination of rocksdb code it looks like background threads are cleaned up when the DB object is torn down. One solution is to have |
Thanks @jbiseda for spotting and working on the PR! This seems to be a very likely cause as rocksdb requires all threads (that try to access rocksdb) to be joined before rocksdb terminates. As we previously observed in the crash stacktrace, it crashes when the thread tries to access the rocksdb custom comparator pointer. In this case, it becomes invalid because the process and rocksdb are terminating. |
This should be resolved by:
|
Problem
For tests which spin up a LocalCluster instance, Rocksdb is occasionally segfaulting.
Procedure to reproduce:
Then monitor with
It will catch output like:
The frequency of segfaults might be machine dependent. I have not been able to reproduce locally but it does segfaults on gce machines.
On master commit: 655b40a it will segfaults like 2-3 times out of 1000 iterations.
I am attaching some backtraces as well:
bt-master-00.txt
bt-master-01.txt
bt-master-02.txt
bt-master-full-00.txt
bt-master-full-01.txt
bt-master-full-02.txt
All point to the same spot:
Bisect Progress (Update in Place)
Below,
GOOD
means the issue did NOT reproduce whereasBAD
means it did reproduce with the given commit. Note that this method isn't completely full-proof as there is obviously some nondeterministic behavior and as such, there is potential for a falseGOOD
rating if the issue doesn't occur in the 1,000 iterations we decided on.Proposed Solution
So far I have not seen a segfault on
v1.9.28
orv1.10.24
.So some
git bisect
ing may help to find the culprit.The process is very slow though, have to run tests like 1000 times to get few segfaults.
The text was updated successfully, but these errors were encountered: