Built-in profiling can cause a segmentation fault #9765

sticnarf · 2021-03-08T13:57:57Z

Bug Report

What version of TiKV are you using?

Latest nightly: dc8ce2c

What operating system and CPU are you using?

Linux. It happens on both CentOS 7 (3.10.0-957.27.2.el7.x86_64) and Arch Linux (5.11.2-arch1-1).

Steps to reproduce

Run YCSB workloada on a 3-node TiKV cluster. Meanwhile, get http://{TiKV_IP}:20180/debug/pprof/profile?seconds=30.

What did you expect?

Get a CPU profiling flamegraph.

What did happened?

A segmentation fault happens. Backtrace of the core:

Program terminated with signal SIGSEGV, Segmentation fault.
[Current thread is 1 (Thread 0x7ff7f83fe640 (LWP 162560))]
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /data/deploy/tikv-20160/bin/tikv-server.
Use `info auto-load python-scripts [REGEXP]' to list them.
(gdb) bt
#0  x86_64_fallback_frame_state (context=0x7ff7f83d6a50, context=0x7ff7f83d6a50, fs=0x7ff7f83d6b40) at ./md-unwind-support.h:63
#1  uw_frame_state_for (context=0x7ff7f83d6a50, fs=0x7ff7f83d6b40) at /build/gcc/src/gcc/libgcc/unwind-dw2.c:1271
#2  0x00007ff83337433b in _Unwind_Backtrace (trace=0x556d67bdcd00 <backtrace::backtrace::libunwind::trace::trace_fn>, trace_argument=0x7ff7f83d6f40) at /build/gcc/src/gcc/libgcc/unwind.inc:302
#3  0x0000556d688c5d97 in perf_signal_handler ()
#4  <signal handler called>
#5  0x0000556d6832d941 in crossbeam_deque::Stealer<T>::steal_batch_and_pop ()
#6  0x0000000000010830 in ?? ()
#7  0x0000000000000000 in ?? ()

Frame 5 has many possibilities, it may not be crossbeam_deque::Stealer<T>::steal_batch_and_pop.

The text was updated successfully, but these errors were encountered:

sticnarf · 2021-03-15T11:25:08Z

Probably due to incorrect DWARF info during stack probing.

pprof-rs issue: tikv/pprof-rs#57
rust issue: rust-lang/rust#83139

cosven · 2021-03-16T01:12:01Z

/severity critical
/remove-severity minor

breezewish · 2021-03-24T02:50:43Z

Maybe we need to add unit tests for this? @YangKeao

sticnarf · 2021-03-24T03:40:01Z

Maybe we need to add unit tests for this? @YangKeao

It's hard. It only happens when a signal hits the stack probing part in the prologue of a function :(

I think an integration test may be more practical.

YangKeao · 2021-03-24T03:50:50Z

I have prepared an integration test for pprof-rs. Once the rust team fixed this issue, I will submit and merge it. (And schedule a daily CI for every nightly rust.)

YangKeao · 2021-03-24T05:21:01Z

The related PR has been merged rust-lang/rust#83412 14 hours ago, so I can test it today or tomorrow.

breezewish · 2022-03-15T00:34:28Z

I experienced yet another seg fault, which is in master branch:

Release Version:   6.0.0-alpha
Edition:           Community
Git Commit Hash:   e1fae9469b33ff1ec4fd66aa143a21a5f30f5aa3
Git Commit Branch: heads/refs/tags/v6.0.0-alpha
UTC Build Time:    2022-03-12 15:16:24
Rust Version:      rustc 1.60.0-nightly (1e12aef3f 2022-02-13)
Enable Features:   jemalloc mem-profiling portable sse test-engines-rocksdb cloud-aws cloud-gcp cloud-azure
Profile:           dist_release

(gdb) bt full
#0  x86_64_fallback_frame_state (context=0x7f78451a19c0, context=0x7f78451a19c0, fs=0x7f78451a1ab0) at ./md-unwind-support.h:63
        pc = 0x1760ff90 <error: Cannot access memory at address 0x1760ff90>
        sc = <optimized out>
        new_cfa = <optimized out>
        pc = <optimized out>
        sc = <optimized out>
        new_cfa = <optimized out>
        uc_ = <optimized out>
#1  uw_frame_state_for (context=context@entry=0x7f78451a19c0, fs=fs@entry=0x7f78451a1ab0) at ../../../src/libgcc/unwind-dw2.c:1265
        fde = 0x0
        cie = <optimized out>
        aug = <optimized out>
        insn = <optimized out>
        end = <optimized out>
#2  0x00007f7898f43098 in _Unwind_Backtrace (trace=0x563b236b05f0 <backtrace::backtrace::libunwind::trace::trace_fn>, trace_argument=0x7f78451a2610) at ../../../src/libgcc/unwind.inc:302
        fs = {regs = {reg = {{loc = {reg = 0, offset = 0, exp = 0x0}, how = REG_UNSAVED} <repeats 18 times>}, prev = 0x0, cfa_offset = 0, cfa_reg = 0, cfa_exp = 0x0, 
            cfa_how = CFA_UNSET}, pc = 0x0, personality = 0x0, data_align = 0, code_align = 0, retaddr_column = 0, fde_encoding = 0 '\000', lsda_encoding = 0 '\000', saw_z = 0 '\000', 
          signal_frame = 0 '\000', eh_ptr = 0x0}
        context = {reg = {0x7f78451a3350, 0x7f78451a3348, 0x7f78451a3358, 0x7f78451a3ff0, 0x7f78451a3330, 0x7f78451a3328, 0x7f78451a4000, 0x0, 0x7f78451a32e8, 0x7f78451a32f0, 
            0x7f78451a32f8, 0x7f78451a3300, 0x7f78451a3ff8, 0x7f78451a3310, 0x7f78451a3318, 0x7f78451a3320, 0x7f78451a4008, 0x0}, cfa = 0x7f78451a4010, ra = 0x1760ff90, lsda = 0x0, 
          bases = {tbase = 0x0, dbase = 0x0, func = 0x7fff2c7db950 <clock_gettime>}, flags = 4611686018427387904, version = 0, args_size = 0, by_value = '\000' <repeats 17 times>}
        code = <optimized out>
#3  0x0000563b23b5504d in backtrace::backtrace::libunwind::trace (cb=...) at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.61/src/backtrace/libunwind.rs:90
No locals.
#4  backtrace::backtrace::trace_unsynchronized (cb=...) at /rust/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.61/src/backtrace/mod.rs:66
No locals.
#5  pprof::profiler::perf_signal_handler (_signal=<optimized out>, _siginfo=<optimized out>, ucontext=0x7f78451a32c0)
    at /rust/git/checkouts/pprof-rs-e0fcf6c5ee7a42e1/400456e/src/profiler.rs:251
        index = 0
        bt = smallvec::SmallVec<[backtrace::backtrace::Frame; 32]> {capacity: 4, data: smallvec::SmallVecData<[backtrace::backtrace::Frame; 32]>}
#6  <signal handler called>
No locals.
#7  0x00007fff2c7db6d8 in ?? ()
No symbol table info available.
#8  0x00007fff2c7db982 in clock_gettime ()
No symbol table info available.
#9  0x000000001760ff90 in ?? ()
No symbol table info available.
#10 0x00000000000057f2 in ?? ()
No symbol table info available.
#11 0xd7d6e3d38ce94500 in ?? ()
No symbol table info available.
#12 0x00007f78451a40c0 in ?? ()
No symbol table info available.
#13 0x0000563b2552c558 in rocksdb::WriteThread::AwaitState(rocksdb::WriteThread::Writer*, unsigned char, rocksdb::WriteThread::AdaptationContext*) ()
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/1494bb5/librocksdb_sys/rocksdb/db/write_thread.cc:151
        now = <optimized out>
        spin_begin = <optimized out>
        state = <optimized out>
        perf_step_timer_write_thread_wait_nanos = <error reading variable perf_step_timer_write_thread_wait_nanos (Cannot access memory at address 0x5782)>
        kMaxSlowYieldsWhileSpinning = 3
        yield_credit = <optimized out>

sticnarf · 2022-03-17T10:02:18Z

@breeswish Can you provide some more information? What is your Linux distribution and what is your libgcc version? And do you have an estimated frequency of segfault?

breezewish · 2022-03-17T10:19:14Z

@breeswish Can you provide some more information? What is your Linux distribution and what is your libgcc version? And do you have an estimated frequency of segfault?

@sticnarf

I experienced this crash under:

Instance: AWS r5.2xlarge
Base image: Ubuntu 18.04 LTS

The crash may happen when I applied some payloads for several hours with Conprof enabled. Conprof fetches profiles for 10s every 1 minute. The last reproduction causes 5 hours for one TiKV, and 3 hours for another TiKV. ~~I was told that a long profiling (i.e. 120s) may also easily reproduce the problem but I have not tried yet.~~ I reproduced the crash after profiling for 50+ minutes under 499Hz.

@mornyx is currently working on this crash. There will be a detailed material shared to you later this week.

Update: There is no crash in CentOS 7.9.

Update2: There are some harmless errors when checking dwarf for Ubuntu 18.04 (pass for CentOS 7.9), so I'm still not sure what's wrong with the vdso.so in Ubuntu 18:

➜ ~ dwarfdump -ka vdso.so 

*** HARMLESS ERROR COUNT: 2 ***

*** DWARF CHECK: DW_DLE_DEBUG_FRAME_LENGTH_NOT_MULTIPLE len=0x00000050, len size=0x00000004, extn size=0x00000000, totl length=0x00000054, addr size=0x00000008, mod=0x00000004 must be zero in fde, offset 0x00000018. ***

*** DWARF CHECK: DW_DLE_DEBUG_FRAME_LENGTH_NOT_MULTIPLE len=0x00000020, len size=0x00000004, extn size=0x00000000, totl length=0x00000024, addr size=0x00000008, mod=0x00000004 must be zero in fde, offset 0x0000011c. ***

close #9765 Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

close #9765, ref #12201 Signed-off-by: ti-srebot <ti-srebot@pingcap.com> Co-authored-by: YangKeao <yangkeao@chunibyo.icu>

close #9765, ref #12201 Signed-off-by: ti-srebot <ti-srebot@pingcap.com> Co-authored-by: YangKeao <yangkeao@chunibyo.icu> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

sticnarf added severity/minor type/bug Type: Issue - Confirmed a bug labels Mar 8, 2021

github-actions bot added this to Need Triage in Question and Bug Reports Mar 8, 2021

ti-chi-bot added severity/critical and removed severity/minor labels Mar 16, 2021

jebter added this to the v5.0.0 ga milestone Mar 21, 2021

sticnarf mentioned this issue Mar 22, 2021

*: downgrade rust to 2021-01-16 #9858

Merged

ti-chi-bot closed this as completed in #9858 Mar 23, 2021

Question and Bug Reports automation moved this from Need Triage to Closed(This Week) Mar 23, 2021

ti-srebot mentioned this issue Mar 23, 2021

*: downgrade rust to 2021-01-16 (#9858) #9865

Merged

BusyJay mentioned this issue Apr 14, 2021

Update rust toolchain to nightly-2021-04-15 #10023

Merged

sticnarf mentioned this issue Aug 5, 2021

Fail to profile TiKV in aarch64 #10658

Closed

breezewish reopened this Mar 15, 2022

Question and Bug Reports automation moved this from Closed(This Week) to In Progress: under investigation Mar 15, 2022

jebter assigned breezewish Mar 16, 2022

VelocityLight added the affects-6.0 label Mar 17, 2022

YangKeao mentioned this issue Mar 18, 2022

add vdso to the pprof blocklist #12201

Merged

breezewish added the affects-5.4 label Mar 19, 2022

ti-chi-bot closed this as completed in #12201 Mar 19, 2022

Question and Bug Reports automation moved this from In Progress: under investigation to Closed(This Week) Mar 19, 2022

ti-chi-bot pushed a commit that referenced this issue Mar 19, 2022

add vdso to the pprof blocklist (#12201)

9a43faf

close #9765 Signed-off-by: YangKeao <yangkeao@chunibyo.icu>

ti-chi-bot pushed a commit that referenced this issue Mar 22, 2022

add vdso to the pprof blocklist (#12201) (#12209)

5b59a1e

close #9765, ref #12201 Signed-off-by: ti-srebot <ti-srebot@pingcap.com> Co-authored-by: YangKeao <yangkeao@chunibyo.icu>

breezewish mentioned this issue Apr 23, 2022

add vdso to the blocklist in README.md and lib.rs tikv/pprof-rs#111

Merged

glorv added the affects-5.3 This bug affects 5.3.x versions. label Jun 14, 2022

This was referenced Jun 23, 2022

releases: add tidb 5.3.2 release notes pingcap/docs#9029

Merged

releases: add v5.3.2 release notes pingcap/docs-cn#9914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Built-in profiling can cause a segmentation fault #9765

Built-in profiling can cause a segmentation fault #9765

sticnarf commented Mar 8, 2021

sticnarf commented Mar 15, 2021

cosven commented Mar 16, 2021 •

edited

Loading

breezewish commented Mar 24, 2021

sticnarf commented Mar 24, 2021 •

edited

Loading

YangKeao commented Mar 24, 2021

YangKeao commented Mar 24, 2021

breezewish commented Mar 15, 2022 •

edited

Loading

sticnarf commented Mar 17, 2022

breezewish commented Mar 17, 2022 •

edited

Loading

Built-in profiling can cause a segmentation fault #9765

Built-in profiling can cause a segmentation fault #9765

Comments

sticnarf commented Mar 8, 2021

Bug Report

What version of TiKV are you using?

What operating system and CPU are you using?

Steps to reproduce

What did you expect?

What did happened?

sticnarf commented Mar 15, 2021

cosven commented Mar 16, 2021 • edited Loading

breezewish commented Mar 24, 2021

sticnarf commented Mar 24, 2021 • edited Loading

YangKeao commented Mar 24, 2021

YangKeao commented Mar 24, 2021

breezewish commented Mar 15, 2022 • edited Loading

sticnarf commented Mar 17, 2022

breezewish commented Mar 17, 2022 • edited Loading

cosven commented Mar 16, 2021 •

edited

Loading

sticnarf commented Mar 24, 2021 •

edited

Loading

breezewish commented Mar 15, 2022 •

edited

Loading

breezewish commented Mar 17, 2022 •

edited

Loading