Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocksdb occasionally segfaults in tests #25941

Closed
behzadnouri opened this issue Jun 13, 2022 · 38 comments
Closed

rocksdb occasionally segfaults in tests #25941

behzadnouri opened this issue Jun 13, 2022 · 38 comments
Assignees

Comments

@behzadnouri
Copy link
Contributor

behzadnouri commented Jun 13, 2022

Problem

For tests which spin up a LocalCluster instance, Rocksdb is occasionally segfaulting.

Procedure to reproduce:

for i in $(seq 1000); do
  RUSTFLAGS='-g -C force-frame-pointers=yes -C target-cpu=native' \
  cargo +1.60.0 test --jobs 29 --package solana-cli \
    --test program --test stake --test transfer -- --nocapture
  echo "REP: $(date +%H:%M), ${i}"
done &>> ~/cargo-test-output.txt &

Then monitor with

tail -f ~/cargo-test-output.txt | grep '\(REP:\|SIGSEGV\)'

It will catch output like:

process didn't exit successfully: `~/solana/target/debug/deps/stake-c0e03b44927935e9 --nocapture` (sign
al: 11, SIGSEGV: invalid memory reference)

The frequency of segfaults might be machine dependent. I have not been able to reproduce locally but it does segfaults on gce machines.
On master commit: 655b40a it will segfaults like 2-3 times out of 1000 iterations.

I am attaching some backtraces as well:
bt-master-00.txt
bt-master-01.txt
bt-master-02.txt
bt-master-full-00.txt
bt-master-full-01.txt
bt-master-full-02.txt

All point to the same spot:

#0  0x0000000000000000 in ?? ()
#1  0x000055a5023be0ec in rocksdb::UserComparatorWrapper::Compare (this=0x7efd1b1da4e0, a=..., b=...)
    at rocksdb/util/user_comparator_wrapper.h:33

Bisect Progress (Update in Place)

Below, GOOD means the issue did NOT reproduce whereas BAD means it did reproduce with the given commit. Note that this method isn't completely full-proof as there is obviously some nondeterministic behavior and as such, there is potential for a false GOOD rating if the issue doesn't occur in the 1,000 iterations we decided on.

commit   - result (verifiers where B = Behzad, J = Jeff, S = Steve, Y = Yueh Hsuan) - commit date
v1.9.18  - GOOD (B)
v1.10.24 - GOOD (B)
v1.10.25 - GOOD (S)
a829ddc922751b909a5922845b6cdd46e984487b - GOOD (J) - Sun May 1 13:11:45 2022 +0000
032a2b82157cc6792507ad096093bd50ab5f0065 - GOOD (Y) - Sun May 8 10:11:10 2022 -0700
0514e83fa2e7444bbaeb11cdc3567dc02bb320b8 - GOOD (Y) - Wed May 11 19:08:11 2022 -0500
fc793de2ddb9b97100e6f00c2f69a1a78f871de5 - GOOD (Y) - Thu May 12 14:48:29 2022 -0500
295df294de04e2daf0744820af0ffa4f39af7130 - GOOD (Y) - Thu May 12 20:28:50 2022 -0600
6c10515ccf86f4404ac0889da06c1b0c7746c17c - GOOD (Y) - Sun May 15 20:04:17 2022 +0800
0820065c98fc6166008bf32e84d4adfaf12584c8 - BAD (Y) - Tue May 17 21:02:43 2022 +0200 
dcce90555bb26493b3379801001bf59b46a3a3fe - GOOD? (Y) - Wed May 18 11:44:35 2022 -0700
4e9b1891544227aa6330efb3586b54fc5a18102e - GOOD? (J) - Fri May 20 16:05:42 2022 -0700
e02537671963cd5c8f2ad7d4b70df96776f8dfd5 - GOOD? (S) - Fri May 20 17:00:13 2022 -0700
d65aab4d275b498b43bf5f087f0b58202b383911 - BAD  (S) - Fri May 20 20:56:06 2022 -0600
7fb0ef1fa5b6815d67d74f4ef19ad4c91bb81584 - BAD  (S) - Fri May 20 21:20:47 2022 -0700
773b2f23f4b327b96e7c0a28a2d47ee516f3f048 - BAD  (Y) - Sun May 22 01:09:26 2022 +0800
0376ab41a9b714b4792740556bb77590c6576876 - BAD  (J,Y) - Tue May 24 10:39:33 2022 -0700

Proposed Solution

So far I have not seen a segfault on v1.9.28 or v1.10.24.
So some git bisecting may help to find the culprit.
The process is very slow though, have to run tests like 1000 times to get few segfaults.

@behzadnouri
Copy link
Contributor Author

Also cc @jbiseda if you want to take a look.

@behzadnouri
Copy link
Contributor Author

One change that might make segfaults more frequent and so faster to bisect is applying this patch below to generate more shreds; Though the patch is not necessary to observe segfaults.

diff --git a/ledger/src/shredder.rs b/ledger/src/shredder.rs
index 28e4ab1f0a..51969ff3b8 100644
--- a/ledger/src/shredder.rs
+++ b/ledger/src/shredder.rs
@@ -185,7 +185,7 @@ impl Shredder {
                         .checked_add(
                             u32::try_from(i)
                                 .unwrap()
-                                .checked_mul(MAX_DATA_SHREDS_PER_FEC_BLOCK)
+                                .checked_mul(7 * MAX_DATA_SHREDS_PER_FEC_BLOCK)
                                 .unwrap(),
                         )
                         .unwrap();
@@ -233,13 +233,10 @@ impl Shredder {
             && shred.version() == version
             && shred.fec_set_index() == fec_set_index));
         let num_data = data.len();
-        let num_coding = if is_last_in_slot {
-            (2 * MAX_DATA_SHREDS_PER_FEC_BLOCK as usize)
+        let num_coding =
+            (8 * MAX_DATA_SHREDS_PER_FEC_BLOCK as usize)
                 .saturating_sub(num_data)
-                .max(num_data)
-        } else {
-            num_data
-        };
+                .max(num_data);
         let data = data.iter().map(Shred::erasure_shard_as_slice);
         let data: Vec<_> = data.collect::<Result<_, _>>().unwrap();
         let mut parity = vec![vec![0u8; data[0].len()]; num_coding];

@jbiseda
Copy link
Contributor

jbiseda commented Jun 13, 2022

Also cc @jbiseda if you want to take a look.

I'll take a look at this.

@steviez
Copy link
Contributor

steviez commented Jun 13, 2022

To move some conversation out of DM and into the open, I have:

  • One node running tip of master with above script
  • One node running v1.10.25 with above script with higher loop count to ensure that we don't see the issue

I'm seeing about 1 minute / iteration (similar to what behzad observed), so will report back later

@yhchiang-sol
Copy link
Contributor

0x000055a5023be0ec in rocksdb::UserComparatorWrapper::Compare (this=0x7efd1b1da4e0, a=..., b=...)
at rocksdb/util/user_comparator_wrapper.h:33

From the trace it looks like it's not from the rocksdb core.

   int Compare(const Slice& a, const Slice& b) const override {
     PERF_COUNTER_ADD(user_key_comparison_count, 1);
     return user_comparator_->Compare(a, b);
   }

UserComparator is where rocksdb invokes call-back defined by the rocksdb user (i.e., us) to compare keys.

Do we have any recent changes on key format or how to compare keys in any column families?

@yhchiang-sol
Copy link
Contributor

yhchiang-sol commented Jun 14, 2022

UserComparator is where rocksdb invokes call-back defined by the rocksdb user (i.e., us) to compare keys.
Do we have any recent changes on key format or how to compare keys in any column families?

Looks like we might not have any custom comparator based on the code search.

The other line that is closed to where the trace pointed to is PERF_COUNTER_ADD(user_key_comparison_count, 1); (line 32). My previous change on perf-context is disabled by default, and I also didn't see any reported samples in our chronograf. (the metric names are blockstore_rocksdb_read_perf and blockstore_rocksdb_write_perf)

In addition, if PERF_COUNTER_ADD is the root-cause, then we should also see crashes at other PERF_COUNTER_ADD locations as this is not the only place where rocksdb collects PerfContext. So this two together make PerfContext less likely the cause.

I will continue investigating with @steviez on this issue.

@jbiseda
Copy link
Contributor

jbiseda commented Jun 14, 2022

To start narrowing this down to a date range:

failure:

commit 0376ab41a9b714b4792740556bb77590c6576876 (HEAD)
Author: Jack May <jack@solana.com>
Date:   Tue May 24 10:39:33 2022 -0700

no failure:

commit a829ddc922751b909a5922845b6cdd46e984487b (HEAD)
Author: behzad nouri <behzadnouri@gmail.com>
Date:   Sun May 1 13:11:45 2022 +0000

@steviez
Copy link
Contributor

steviez commented Jun 14, 2022

My previous change on perf-context is disabled by default, and I also didn't see any reported samples in our chronograf. (the metric names are blockstore_rocksdb_read_perf and blockstore_rocksdb_write_perf

You wouldn't see any metrics reported for this; behzad was just running a unit test in a loop, not the full validator with metrics configured

@yhchiang-sol
Copy link
Contributor

You wouldn't see any metrics reported for this; behzad was just running a unit test in a loop, not the full validator with metrics configured

I see. Then it's disabled as the rocks_perf_sample_interval is 0 by default.

@behzadnouri
Copy link
Contributor Author

One possibility is that some earlier invalid memory write corrupts user_comparator_ causing this line to segfault:
https://github.com/facebook/rocksdb/blob/f105e1a50/util/user_comparator_wrapper.h#L33

i.e. user_comparator_ may not be root the cause but just the symptom of invalid memory reference.

@yhchiang-sol
Copy link
Contributor

One possibility is that some earlier invalid memory write corrupts user_comparator_ causing this line to segfault:
https://github.com/facebook/rocksdb/blob/f105e1a50/util/user_comparator_wrapper.h#L33
i.e. user_comparator_ may not be root the cause but just the symptom of invalid memory reference.

Yes, that's likely the case esp. we don't have any custom comparator.
Another possibility is the memory of slice a or b become invalid, but given the fact that it crashes exactly at L33 instead of inside the comparator, it's more likely the comparator becomes invalid.

@yhchiang-sol
Copy link
Contributor

Meanwhile, I am in the process of bi-sect to see whether I can get more clues to this issue.

@jbiseda
Copy link
Contributor

jbiseda commented Jun 14, 2022

Also did not see failures at the following commit:

commit 4e9b1891544227aa6330efb3586b54fc5a18102e (HEAD)
Author: Lijun Wang <83639177+lijunwangs@users.noreply.github.com>
Date:   Fri May 20 16:05:42 2022 -0700

@behzadnouri
Copy link
Contributor Author

Some other ideas worth trying in parallel to git bisect work:

@yhchiang-sol
Copy link
Contributor

Thanks for the information. Each iteration of bi-sect does take quite a long time to make sure the commit has no repro.

using valgrind: https://jvns.ca/blog/2017/12/23/segfault-debugging/
sanitizers: https://doc.rust-lang.org/beta/unstable-book/compiler-flags/sanitizer.html

RocksDB does use valgrind and other sanitizers such as asan, tsan and ubsan. But it's definitely a worth-try on our side as RocksDB's check and test might not capture all the issues.

@yhchiang-sol
Copy link
Contributor

yhchiang-sol commented Jun 15, 2022

@steviez, @jbiseda: Btw, for the bi-sect, does the run seem finished early when a commit is problematic (i.e., able to repro the issue) on your side?

@steviez
Copy link
Contributor

steviez commented Jun 15, 2022

I actually haven't been able to reproduce the issue at all; my runs thus far have been with branches as-is (not using the patch behzad supplied above)

  • Lack of reproduction on v1.10.25 is consistent with behzad's initial report of not seeing issue on (then) latest v1.10 and v1.9 releases.
  • Lack of reproduction on master (I'm two commits newer at c419845 - neither affects Rocks) is inconsistent. I'm running on a (fairly beefy N2) GCE node.

Applied the patch to generate more shreds on one of the nodes on top of master, going to let that spin for a bit

@yhchiang-sol
Copy link
Contributor

Here're my current data points. I am able to see repro in the following commits:

And no repro in the following commits (but doesn't mean 100% problem-free as the issue might not be easily reproducible):

  • commit dcce905
    Author: Dmitri Makarov dmakarov@alumni.stanford.edu
    Date: Wed May 18 11:44:35 2022 -0700

  • commit 295df29
    Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Date: Thu May 12 20:28:50 2022 -0600

I will continue doing the semi-bi-sect process. The way I am doing bi-sect is the following:

  • Bisect between commit B (begin) and E (end).
  • If commit (B+E)/2 is reproducible, then assign B = (B+E)/2 as the usual bisect process, but also reset E to the initial end commit as we aren't 100% sure whether those unreproducible commits are really the case.
  • If commit (B+E)/2 isn't reproducible, assign E = (B+E)/2.

In the above process, I use the following two commits as the begin and end commit, and I've verified the begin-commit is reproducible while I haven't seen any repro in the end-commit so far:

@steviez
Copy link
Contributor

steviez commented Jun 16, 2022

I must have been doing something dumb; was able to reproduce, although I'm getting SIGILL instead of SIGSEGV.

 process didn't exit successfully: `/home/steven_solana_com/solana/target/debug/deps/stake-c0e03b44927935e9
--nocapture` (signal: 4, SIGILL: illegal instruction)

Looking at my dump, I got the top line of the backtrace that was unknown in the backtraces behzad supplied; rest of the backtrace looks to be the same:

#0  rocksdb::Customizable::~Customizable (this=0x5577f1cc5f98 <rocksdb::BytewiseComparator()::bytewise>) at rocksdb/include/rocksdb/customizable.h:58
#1  0x00005577ed6ad5b5 in rocksdb::UserComparatorWrapper::Compare (this=0x7f087c155f70, a=..., b=...) at rocksdb/util/user_comparator_wrapper.h:33
...

I don't have any other ideas besides bisect at the moment and since it takes just a few minutes to start, I'll bisect in parallel with @yhchiang-sol.

Also, I combined all of the test data-points into the problem description at the top. Let's update the list there for the sake of having all results in one spot.

@yhchiang-sol
Copy link
Contributor

yhchiang-sol commented Jun 16, 2022

The following commit has no repro on my side, and I've added it to the bi-sect data-points.

commit 0514e83fa2e7444bbaeb11cdc3567dc02bb320b8
Author: beeman <beeman@beeman.nl>
Date:   Wed May 11 19:08:11 2022 -0500

    chore: move sinon-chai to devDependencies (#25151)

Am currently testing 0820065c98fc6166008bf32e84d4adfaf12584c8 (Tue May 17 21:02:43 2022 +0200)

@steviez
Copy link
Contributor

steviez commented Jun 17, 2022

I think I have honed into the offending commit, e263be2. Inspecting test output a little closer, I noticed some instances of hte following error: Failed to create JSON RPC Service: .... Here is a full snippet:

running 9 tests
[2022-06-16T19:33:27.174742666Z ERROR solana_core::validator] Failed to create JSON RPC Service: Address already in use (os error 98)
Validator process aborted. The validator log may contain further details
error: test failed, to rerun pass '-p solana-cli --test program'

Caused by:
  process didn't exit successfully: `.../solana/target/debug/deps/program-9e902f7be8a5d082 --nocapture` (signal: 4, SIGILL: illegal instruction)
REP: 19:33, 61
    Finished test [unoptimized + debuginfo] target(s) in 0.35s
     Running tests/program.rs (target/debug/deps/program-9e902f7be8a5d082)

Not all instances of the error were followed by SIGILL, but all instances of the SIGILL were preceded by the error. Bisect also got me down to this commit (bisect results reflected in the table at the top too):

$ git log --pretty=oneline
BAD d65aab4d275b498b43bf5f087f0b58202b383911 (HEAD) Fix ledger-tool slot to output tx even when tx status is missing (#25434)
SUSPECT e263be2000a237e694a07c498151122c2169ce95 handle start_http failure in rpc_service (#25400)
GOOD e02537671963cd5c8f2ad7d4b70df96776f8dfd5 Fix packet accounting after dedup (#25357)

I then cherry-picked e263be2 onto v1.10 branch and saw a SIGILL; we had not previously seen the failure on v1.10 so this seems fairly definitive. I've seen some issues on the rust GH about SIGILL getting raised in strange program exit situation (such as panic within a panic). Given that the commit added another case where we abort, this seems plausible.

Edit: The following comment makes me unsure now

@yhchiang-sol
Copy link
Contributor

yhchiang-sol commented Jun 17, 2022

Commit 0820065 (Tue May 17 21:02:43 2022 +0200) crashes once out of 597 runs. Didn't see the Failed to create JSON RPC Service prior to the crash. I'm going to test 032a2b8 (Sun May 8 10:11:10 2022 -0700).

Something worth-mentioning is that 0820065 (Tue May 17 21:02:43 2022 +0200) is before e263be2 (Fri May 20 17:59:23 2022 -0700). If e263be2 is innocent, then it brings us to the possibility that the problem already exists in v1.10?

@yhchiang-sol
Copy link
Contributor

032a2b8 (Sun May 8 10:11:10 2022 -0700) has no repro after 800+ iterations.

@yhchiang-sol
Copy link
Contributor

Currently testing commit fc793de (Thu May 12 14:48:29 2022 -0500)

@behzadnouri
Copy link
Contributor Author

Finished git bisect, and ended up with the same commit as @steviez: #25941 (comment)

git bisect start
git bisect good cda3d66b21367bd8fda16e6265fa61e7fb4ba6c9
git bisect bad 655b40a2b7ae43f5575c2141ceb0fdb322f22bf1
git bisect good a829ddc922751b909a5922845b6cdd46e984487b
git bisect good 97efbdc303072d55b2a1cdcf34eb591218336c1c
git bisect bad d4e7ebf4f8821dfa59a1f278898cf9a7ad70ebd9
git bisect bad 41f30a2383ef58b76b2938a8914c908f62d8c180
git bisect bad e263be2000a237e694a07c498151122c2169ce95
git bisect good 467431de8946e3cd4453c2ffbf06038e1d5d9f96
git bisect good a5792885ca3af699737ae81c95347cdef54fc471
git bisect good f584b249dd94056fe2760973cbf8f50080a0951b
git bisect good 8caf0aabd11a6c1d6dc035fa9633667caaa1ff80
git bisect good e02537671963cd5c8f2ad7d4b70df96776f8dfd5

e263be2000a237e694a07c498151122c2169ce95 is the first bad commit

@behzadnouri
Copy link
Contributor Author

Another commit that seems like problematic is e105547 which added more unwraps at exit time: #24111

Though I don't know why it didn't show up in git bisect.
It does seem like rocksdb may segfault if there is an unwrap/abort in rust code.

@yhchiang-sol
Copy link
Contributor

Commit fc793de (Thu May 12 14:48:29 2022 -0500) has no repro after 1000 iterations.

@yhchiang-sol
Copy link
Contributor

Commit 6c10515 (Sun May 15 20:04:17 2022 +0800) has no repro after 1000 iterations.

@behzadnouri
Copy link
Contributor Author

I think I have ample evidence that these two commits are causing segfaults: e105547, e263be2

I am using this branch for testing:
https://github.com/behzadnouri/solana/commits/rm-shred-buffer-dbg-1655855148

With the two reverts I have not observed any segfaults despite running the tests like ~6k times.
Removing either of the reverts I do see segfaults, in particular with #24111 kept there.

We do observe segfaults on master branch as well, so the first 4 commits are not the issue.

Both #25400 and #24111 are adding abort/panics at validator exit so maybe that is causing some illegal memory access in rocksdb code which results in segfaults.

This has become a blocker for the shreds redesign work. @jbiseda @HaoranYi can we revert #25400 and #24111 and instead do something cleaner which does not cause abort/panic?

@jbiseda
Copy link
Contributor

jbiseda commented Jun 22, 2022

This has become a blocker for the shreds redesign work. @jbiseda @HaoranYi can we revert #25400 and #24111 and instead do something cleaner which does not cause abort/panic?

I can revert #25400 and follow up with a new fix. Need to look into the test panic! handler.

@HaoranYi
Copy link
Contributor

#24111
OK. We will revert it. I will need to spend more time to look into the hang when validator shutting down.

@HaoranYi
Copy link
Contributor

Both #25400 and #24111 are adding abort/panics at validator exit so maybe that is causing some illegal memory access in rocksdb code which results in segfaults.

@behzadnouri
Are you sure about this? #24111 wait for 10s for threads to join before forcing exit? There are code that still access rocksdb 10s after receiving the shutdown signal? Do you see any hang in your tests after reverting #24111?

@behzadnouri
Copy link
Contributor Author

@behzadnouri Are you sure about this? #24111 wait for 10s for threads to join before forcing exit? There are code that still access rocksdb 10s after receiving the shutdown signal? Do you see any hang in your tests after reverting #24111?

@HaoranYi I have already explained the diagnosis process here: #25941 (comment)

With the two reverts I have not observed any segfaults despite running the tests like ~6k times.
Removing either of the reverts I do see segfaults, in particular with #24111 kept there.

We do observe segfaults on master branch as well, so the first 4 commits are not the issue.

@HaoranYi
Copy link
Contributor

HaoranYi commented Jun 22, 2022

Do you actually see timeout log in the tests when you have seen segmentation fault with this change?

Yes, please also see this comment: #25941 (comment)

OK. By reverting #24111, we hide the segmentation issue and goes back to the hang issue. Here is to issue to track validator_exit hang #25933.
@behzadnouri Can you take a look? I think it worth to have fresh eyes on this long standing problem.

@HaoranYi
Copy link
Contributor

@behzadnouri Are you sure about this? #24111 wait for 10s for threads to join before forcing exit? There are code that still access rocksdb 10s after receiving the shutdown signal? Do you see any hang in your tests after reverting #24111?

@HaoranYi I have already explained the diagnosis process here: #25941 (comment)

With the two reverts I have not observed any segfaults despite running the tests like ~6k times.
Removing either of the reverts I do see segfaults, in particular with #24111 kept there.
We do observe segfaults on master branch as well, so the first 4 commits are not the issue.

@behzadnouri
Let me repeat. I don't think revert #25933 correctly solves the segfault issue. The real underlying issue is to find out, why after 10s of shutting down, there are still process access rocksdb, which cause the corruption. By reverting #25933, you are masking the rocksdb corruption by a hang.

@jbiseda
Copy link
Contributor

jbiseda commented Aug 10, 2022

Problem seems to be that Validator::new fails using abort() which in tests calls panic(). This is handled by the test framework and can leave rocksdb background threads accessing partially torn down state. This is exacerbated by tests in program.rs and commands.rs running parallel tests which call Validator::new. This can be mitigated by running tests serially and using rocksdb DB::cancel_all_background_work.

From a cursory examination of rocksdb code it looks like background threads are cleaned up when the DB object is torn down. One solution is to have Validator::new return a Result and fail without using abort() which will tear down state cleanly.

@yhchiang-sol
Copy link
Contributor

Problem seems to be that Validator::new fails using abort() which in tests calls panic(). This is handled by the test framework and can leave rocksdb background threads accessing partially torn down state. This is exacerbated by tests in program.rs and commands.rs running parallel tests which call Validator::new. This can be mitigated by running tests serially and using rocksdb DB::cancel_all_background_work.

Thanks @jbiseda for spotting and working on the PR!

This seems to be a very likely cause as rocksdb requires all threads (that try to access rocksdb) to be joined before rocksdb terminates. As we previously observed in the crash stacktrace, it crashes when the thread tries to access the rocksdb custom comparator pointer. In this case, it becomes invalid because the process and rocksdb are terminating.

@jbiseda
Copy link
Contributor

jbiseda commented Aug 26, 2022

This should be resolved by:

Validator::new now unwinds gracefully on failure with Result. abort()(inline panic) has been removed from this codepath.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants