Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built-in profiling can cause a segmentation fault #9765

Closed
sticnarf opened this issue Mar 8, 2021 · 9 comments · Fixed by #9858 or #12201
Closed

Built-in profiling can cause a segmentation fault #9765

sticnarf opened this issue Mar 8, 2021 · 9 comments · Fixed by #9858 or #12201
Assignees
Labels
affects-5.3 This bug affects 5.3.x versions. affects-5.4 affects-6.0 severity/critical type/bug Type: Issue - Confirmed a bug
Milestone

Comments

@sticnarf
Copy link
Contributor

sticnarf commented Mar 8, 2021

Bug Report

What version of TiKV are you using?

Latest nightly: dc8ce2c

What operating system and CPU are you using?

Linux. It happens on both CentOS 7 (3.10.0-957.27.2.el7.x86_64) and Arch Linux (5.11.2-arch1-1).

Steps to reproduce

Run YCSB workloada on a 3-node TiKV cluster. Meanwhile, get http://{TiKV_IP}:20180/debug/pprof/profile?seconds=30.

What did you expect?

Get a CPU profiling flamegraph.

What did happened?

A segmentation fault happens. Backtrace of the core:

Program terminated with signal SIGSEGV, Segmentation fault.
[Current thread is 1 (Thread 0x7ff7f83fe640 (LWP 162560))]
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /data/deploy/tikv-20160/bin/tikv-server.
Use `info auto-load python-scripts [REGEXP]' to list them.
(gdb) bt
#0  x86_64_fallback_frame_state (context=0x7ff7f83d6a50, context=0x7ff7f83d6a50, fs=0x7ff7f83d6b40) at ./md-unwind-support.h:63
#1  uw_frame_state_for (context=0x7ff7f83d6a50, fs=0x7ff7f83d6b40) at /build/gcc/src/gcc/libgcc/unwind-dw2.c:1271
#2  0x00007ff83337433b in _Unwind_Backtrace (trace=0x556d67bdcd00 <backtrace::backtrace::libunwind::trace::trace_fn>, trace_argument=0x7ff7f83d6f40) at /build/gcc/src/gcc/libgcc/unwind.inc:302
#3  0x0000556d688c5d97 in perf_signal_handler ()
#4  <signal handler called>
#5  0x0000556d6832d941 in crossbeam_deque::Stealer<T>::steal_batch_and_pop ()
#6  0x0000000000010830 in ?? ()
#7  0x0000000000000000 in ?? ()

Frame 5 has many possibilities, it may not be crossbeam_deque::Stealer<T>::steal_batch_and_pop.

@sticnarf sticnarf added severity/minor type/bug Type: Issue - Confirmed a bug labels Mar 8, 2021
@github-actions github-actions bot added this to Need Triage in Question and Bug Reports Mar 8, 2021
@sticnarf
Copy link
Contributor Author

Probably due to incorrect DWARF info during stack probing.

pprof-rs issue: tikv/pprof-rs#57
rust issue: rust-lang/rust#83139

@cosven
Copy link
Member

cosven commented Mar 16, 2021

/severity critical
/remove-severity minor

@jebter jebter added this to the v5.0.0 ga milestone Mar 21, 2021
Question and Bug Reports automation moved this from Need Triage to Closed(This Week) Mar 23, 2021
@breezewish
Copy link
Member

Maybe we need to add unit tests for this? @YangKeao

@sticnarf
Copy link
Contributor Author

sticnarf commented Mar 24, 2021

Maybe we need to add unit tests for this? @YangKeao

It's hard. It only happens when a signal hits the stack probing part in the prologue of a function :(

I think an integration test may be more practical.

@YangKeao
Copy link
Member

I have prepared an integration test for pprof-rs. Once the rust team fixed this issue, I will submit and merge it. (And schedule a daily CI for every nightly rust.)

@YangKeao
Copy link
Member

The related PR has been merged rust-lang/rust#83412 14 hours ago, so I can test it today or tomorrow.

@breezewish breezewish reopened this Mar 15, 2022
Question and Bug Reports automation moved this from Closed(This Week) to In Progress: under investigation Mar 15, 2022
@breezewish
Copy link
Member

breezewish commented Mar 15, 2022

I experienced yet another seg fault, which is in master branch:

Release Version:   6.0.0-alpha
Edition:           Community
Git Commit Hash:   e1fae9469b33ff1ec4fd66aa143a21a5f30f5aa3
Git Commit Branch: heads/refs/tags/v6.0.0-alpha
UTC Build Time:    2022-03-12 15:16:24
Rust Version:      rustc 1.60.0-nightly (1e12aef3f 2022-02-13)
Enable Features:   jemalloc mem-profiling portable sse test-engines-rocksdb cloud-aws cloud-gcp cloud-azure
Profile:           dist_release
(gdb) bt full
#0  x86_64_fallback_frame_state (context=0x7f78451a19c0, context=0x7f78451a19c0, fs=0x7f78451a1ab0) at ./md-unwind-support.h:63
        pc = 0x1760ff90 <error: Cannot access memory at address 0x1760ff90>
        sc = <optimized out>
        new_cfa = <optimized out>
        pc = <optimized out>
        sc = <optimized out>
        new_cfa = <optimized out>
        uc_ = <optimized out>
#1  uw_frame_state_for (context=context@entry=0x7f78451a19c0, fs=fs@entry=0x7f78451a1ab0) at ../../../src/libgcc/unwind-dw2.c:1265
        fde = 0x0
        cie = <optimized out>
        aug = <optimized out>
        insn = <optimized out>
        end = <optimized out>
#2  0x00007f7898f43098 in _Unwind_Backtrace (trace=0x563b236b05f0 <backtrace::backtrace::libunwind::trace::trace_fn>, trace_argument=0x7f78451a2610) at ../../../src/libgcc/unwind.inc:302
        fs = {regs = {reg = {{loc = {reg = 0, offset = 0, exp = 0x0}, how = REG_UNSAVED} <repeats 18 times>}, prev = 0x0, cfa_offset = 0, cfa_reg = 0, cfa_exp = 0x0, 
            cfa_how = CFA_UNSET}, pc = 0x0, personality = 0x0, data_align = 0, code_align = 0, retaddr_column = 0, fde_encoding = 0 '\000', lsda_encoding = 0 '\000', saw_z = 0 '\000', 
          signal_frame = 0 '\000', eh_ptr = 0x0}
        context = {reg = {0x7f78451a3350, 0x7f78451a3348, 0x7f78451a3358, 0x7f78451a3ff0, 0x7f78451a3330, 0x7f78451a3328, 0x7f78451a4000, 0x0, 0x7f78451a32e8, 0x7f78451a32f0, 
            0x7f78451a32f8, 0x7f78451a3300, 0x7f78451a3ff8, 0x7f78451a3310, 0x7f78451a3318, 0x7f78451a3320, 0x7f78451a4008, 0x0}, cfa = 0x7f78451a4010, ra = 0x1760ff90, lsda = 0x0, 
          bases = {tbase = 0x0, dbase = 0x0, func = 0x7fff2c7db950 <clock_gettime>}, flags = 4611686018427387904, version = 0, args_size = 0, by_value = '\000' <repeats 17 times>}
        code = <optimized out>
#3  0x0000563b23b5504d in backtrace::backtrace::libunwind::trace (cb=...) at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.61/src/backtrace/libunwind.rs:90
No locals.
#4  backtrace::backtrace::trace_unsynchronized (cb=...) at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.61/src/backtrace/mod.rs:66
No locals.
#5  pprof::profiler::perf_signal_handler (_signal=<optimized out>, _siginfo=<optimized out>, ucontext=0x7f78451a32c0)
    at /rust/git/checkouts/pprof-rs-e0fcf6c5ee7a42e1/400456e/src/profiler.rs:251
        index = 0
        bt = smallvec::SmallVec<[backtrace::backtrace::Frame; 32]> {capacity: 4, data: smallvec::SmallVecData<[backtrace::backtrace::Frame; 32]>}
#6  <signal handler called>
No locals.
#7  0x00007fff2c7db6d8 in ?? ()
No symbol table info available.
#8  0x00007fff2c7db982 in clock_gettime ()
No symbol table info available.
#9  0x000000001760ff90 in ?? ()
No symbol table info available.
#10 0x00000000000057f2 in ?? ()
No symbol table info available.
#11 0xd7d6e3d38ce94500 in ?? ()
No symbol table info available.
#12 0x00007f78451a40c0 in ?? ()
No symbol table info available.
#13 0x0000563b2552c558 in rocksdb::WriteThread::AwaitState(rocksdb::WriteThread::Writer*, unsigned char, rocksdb::WriteThread::AdaptationContext*) ()
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/1494bb5/librocksdb_sys/rocksdb/db/write_thread.cc:151
        now = <optimized out>
        spin_begin = <optimized out>
        state = <optimized out>
        perf_step_timer_write_thread_wait_nanos = <error reading variable perf_step_timer_write_thread_wait_nanos (Cannot access memory at address 0x5782)>
        kMaxSlowYieldsWhileSpinning = 3
        yield_credit = <optimized out>

@sticnarf
Copy link
Contributor Author

@breeswish Can you provide some more information? What is your Linux distribution and what is your libgcc version? And do you have an estimated frequency of segfault?

@breezewish
Copy link
Member

breezewish commented Mar 17, 2022

@breeswish Can you provide some more information? What is your Linux distribution and what is your libgcc version? And do you have an estimated frequency of segfault?

@sticnarf

I experienced this crash under:

  • Instance: AWS r5.2xlarge
  • Base image: Ubuntu 18.04 LTS

The crash may happen when I applied some payloads for several hours with Conprof enabled. Conprof fetches profiles for 10s every 1 minute. The last reproduction causes 5 hours for one TiKV, and 3 hours for another TiKV. I was told that a long profiling (i.e. 120s) may also easily reproduce the problem but I have not tried yet. I reproduced the crash after profiling for 50+ minutes under 499Hz.

@mornyx is currently working on this crash. There will be a detailed material shared to you later this week.

Update: There is no crash in CentOS 7.9.

Update2: There are some harmless errors when checking dwarf for Ubuntu 18.04 (pass for CentOS 7.9), so I'm still not sure what's wrong with the vdso.so in Ubuntu 18:

➜ ~ dwarfdump -ka vdso.so 

*** HARMLESS ERROR COUNT: 2 ***

*** DWARF CHECK: DW_DLE_DEBUG_FRAME_LENGTH_NOT_MULTIPLE len=0x00000050, len size=0x00000004, extn size=0x00000000, totl length=0x00000054, addr size=0x00000008, mod=0x00000004 must be zero in fde, offset 0x00000018. ***

*** DWARF CHECK: DW_DLE_DEBUG_FRAME_LENGTH_NOT_MULTIPLE len=0x00000020, len size=0x00000004, extn size=0x00000000, totl length=0x00000024, addr size=0x00000008, mod=0x00000004 must be zero in fde, offset 0x0000011c. ***

Question and Bug Reports automation moved this from In Progress: under investigation to Closed(This Week) Mar 19, 2022
ti-chi-bot pushed a commit that referenced this issue Mar 19, 2022
close #9765

Signed-off-by: YangKeao <yangkeao@chunibyo.icu>
ti-chi-bot pushed a commit that referenced this issue Mar 22, 2022
close #9765, ref #12201

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

Co-authored-by: YangKeao <yangkeao@chunibyo.icu>
ti-chi-bot added a commit that referenced this issue Apr 28, 2022
close #9765, ref #12201

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

Co-authored-by: YangKeao <yangkeao@chunibyo.icu>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@glorv glorv added the affects-5.3 This bug affects 5.3.x versions. label Jun 14, 2022
ti-chi-bot added a commit that referenced this issue Jun 15, 2022
close #9765, ref #12201

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>

Co-authored-by: YangKeao <yangkeao@chunibyo.icu>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.3 This bug affects 5.3.x versions. affects-5.4 affects-6.0 severity/critical type/bug Type: Issue - Confirmed a bug
Projects
Question and Bug Reports
  
Closed(This Week)
Development

Successfully merging a pull request may close this issue.

8 participants