Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault while running tests against ASAN build (Clang-16) #321

Closed
ylobankov opened this issue Aug 8, 2023 · 3 comments · Fixed by tarantool/tarantool#8977
Closed

Segfault while running tests against ASAN build (Clang-16) #321

ylobankov opened this issue Aug 8, 2023 · 3 comments · Fixed by tarantool/tarantool#8977
Assignees
Labels
bug Something isn't working

Comments

@ylobankov
Copy link
Member

Tarantool

Tarantool 3.0.0-alpha1-66-g10870343a
Target: Linux-x86_64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=TRUE
Compiler: Clang-16.0.6
C_FLAGS: -fexceptions -funwind-tables -fasynchronous-unwind-tables -fno-common -msse2 -fsanitize=fuzzer-no-link -fsanitize=address -fsanitize-blacklist=/tarantool/asan/asan.supp  -fmacro-prefix-map=/tarantool=. -std=c11 -Wall -Wextra -fsanitize=alignment,bool,bounds,builtin,enum,float-cast-overflow,float-divide-by-zero,function,integer-divide-by-zero,return,shift,unreachable,vla-bound -fno-sanitize-recover=alignment,bool,bounds,builtin,enum,float-cast-overflow,float-divide-by-zero,function,integer-divide-by-zero,return,shift,unreachable,vla-bound -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror
CXX_FLAGS: -fexceptions -funwind-tables -fasynchronous-unwind-tables -fno-common -msse2 -fsanitize=fuzzer-no-link -fsanitize=address -fsanitize-blacklist=/tarantool/asan/asan.supp  -fmacro-prefix-map=/tarantool=. -std=c++11 -Wall -Wextra -fsanitize=alignment,bool,bounds,builtin,enum,float-cast-overflow,float-divide-by-zero,function,integer-divide-by-zero,return,shift,unreachable,vla-bound -fno-sanitize-recover=alignment,bool,bounds,builtin,enum,float-cast-overflow,float-divide-by-zero,function,integer-divide-by-zero,return,shift,unreachable,vla-bound -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror

Steps to reproduce

docker run -it --rm tarantool/testing:ubuntu-jammy-clang16
git clone https://github.com/tarantool/tarantool.git
cd tarantool && git submodule update --recursive --init --jobs $(nproc)

Open .test.mk file and remove run-luajit-test target from L112. Save and exit.

CC=clang-16 CXX=clang++-16 TEST_RUN_EXTRA_PARAMS=app/fiber.test.lua make -f .test.mk test-release-asan

The command above runs one test only. If you want to get all failures, remove TEST_RUN_EXTRA_PARAMS=app/fiber.test.lua from the command.

Expected result: Test passed.

Actual result:

======================================================================================
WORKR TEST                                            PARAMS          RESULT
---------------------------------------------------------------------------------
[001]
[001] [Instance "app"] Failed to start: Broken tarantool console handshake
[001]
[001] [test-run server "app"] Last 15 lines of the log file /tmp/t/001_app/app.log:
[001]   fs       0x0                0
[001]   cr2      0x7f9294552f60     140267530563424
[001]   err      0x4                4
[001]   oldmask  0x0                0
[001]   trapno   0xe                14
[001] Current time: 1691493315
[001] Please file a bug at https://github.com/tarantool/tarantool/issues
[001] Attempting backtrace... Note: since the server has already crashed,
[001] this may fail as well
[001] #1  0x55e5a1f1581f in crash_collect+575
[001] #2  0x55e5a1f155a8 in crash_signal_cb+424
[001] #3  0x7f9298145520 in __sigaction+80
[001] #4  0x55e5a1f260f9 in fiber_yield+1113
[001] #5  0x55e5a1f2d973 in fiber_loop+2259
[001] #6  0x55e5a2860273 in coro_init+307
[001]
[001] [Instance "app"] Failed to start: Broken tarantool console handshake
---------------------------------------------------------------------------------
@locker
Copy link
Member

locker commented Aug 8, 2023

To reproduce the issue, it's suffice to start tarantool in the interactive mode and then exit. The problem is buried somewhere in the on_shutdown triggers because disabling them helps.

diff --git a/src/main.cc b/src/main.cc
index 5c7587135915..40b89f6c9368 100644
--- a/src/main.cc
+++ b/src/main.cc
@@ -162,7 +162,7 @@ on_shutdown_f(va_list ap)
 	while (!is_shutting_down)
 		fiber_yield();
 
-	if (trigger_fiber_run(&box_on_shutdown_trigger_list, NULL,
+	if (0 && trigger_fiber_run(&box_on_shutdown_trigger_list, NULL,
 			      on_shutdown_trigger_timeout) != 0) {
 		say_error("on_shutdown triggers failed");
 		diag_log();

@Gumix
Copy link

Gumix commented Aug 8, 2023

This change also helps...

diff --git a/src/lib/core/fiber.c b/src/lib/core/fiber.c
index 958e4a02f..4ab7bd282 100644
--- a/src/lib/core/fiber.c
+++ b/src/lib/core/fiber.c
@@ -710,6 +710,7 @@ fiber_yield(void)
        cord->fiber = callee;
        callee->flags = (callee->flags & ~FIBER_IS_READY) | FIBER_IS_RUNNING;

+       __asm__ __volatile__("": : :"memory");
        ASAN_START_SWITCH_FIBER(asan_state,
                                (caller->flags & FIBER_IS_DEAD) == 0,
                                callee->stack,

This also helps:

diff --git a/src/lib/core/fiber.c b/src/lib/core/fiber.c
index 958e4a02f..3e717d64e 100644
--- a/src/lib/core/fiber.c
+++ b/src/lib/core/fiber.c
@@ -711,7 +711,7 @@ fiber_yield(void)
        callee->flags = (callee->flags & ~FIBER_IS_READY) | FIBER_IS_RUNNING;

        ASAN_START_SWITCH_FIBER(asan_state,
-                               (caller->flags & FIBER_IS_DEAD) == 0,
+                               true,
                                callee->stack,
                                callee->stack_size);
        coro_transfer(&caller->ctx, &callee->ctx);

A similar crash: google/sanitizers#189 (comment)

@Gumix
Copy link

Gumix commented Aug 9, 2023

So, what happens when Ctrl+D is pressed:

tarantool> ^D
fiber_recycle(103/interactive)
fiber_yield(0/, FIBER_IS_DEAD) -> 1/sched
==2744649==T0: FakeStack destroyed: 0: 594/8192; 1: 153/4096; 2: 74/2048; 3: 0/1024; 4: 30/512; 5: 0/256; 6: 4/128; 7: 0/64; 8: 87/32; 9: 0/16; 10: 0/8;
fiber_call_impl(1/sched -> 102/on_shutdown)
==2744649==T0: FakeStack created: 0x7f1c5027b000 -- 0x7f1c50800000 stack_size_log: 19; mmapped 5652K, noreserve=0
fiber_call_impl(102/on_shutdown -> 104/trigger_fiber0)
tarantool: src/lib/core/fiber.c:728: void fiber_yield(void): Assertion `asan_state != NULL' failed.
  1. The 103/interactive fiber is recycled and fiber_yield() is called to switch to the scheduler.
  2. In fiber_yield() ASAN allocates a fake stack by __asan_stack_malloc_0() for the stack-use-after-return detection.
  3. __sanitizer_start_switch_fiber() is called with fake_stack_save == NULL, because the fiber is marked as dead.
  4. __sanitizer_start_switch_fiber() destroys the fake stack.
  5. coro_transfer() switches to the scheduler, which switches to 102/on_shutdown.
  6. In trigger_fiber_run() ASAN allocates a new fake stack at the same addresses as [2].
  7. A new fiber 104/trigger_fiber0 is created from the recycled 103/interactive.
  8. 104/trigger_fiber0 is started, so now we are just after coro_transfer(), see [5].
  9. At the end of fiber_yield() ASAN tries to access the fake stack and crashes, because it was unmapped at [4] and recreated at [6].

With ASAN_START_SWITCH_FIBER(asan_state, true, ...) it doesn't crash, because the fake stack is not destroyed:

tarantool> ^D
fiber_recycle(103/interactive)
fiber_yield(0/, FIBER_IS_DEAD) -> 1/sched
fiber_call_impl(1/sched -> 102/on_shutdown)
==2750102==T0: FakeStack created: 0x7fb22f771000 -- 0x7fb22fcf6000 stack_size_log: 19; mmapped 5652K, noreserve=0
fiber_call_impl(102/on_shutdown -> 104/trigger_fiber0)
fiber_yield(104/session.shutdown, FIBER_IS_DEAD) -> 102/on_shutdown
fiber_recycle(104/session.shutdown)
fiber_recycle(102/on_shutdown)
fiber_yield(0/, FIBER_IS_DEAD) -> 1/sched
cord_destroy()

TLDR: We said to ASAN that we wouldn't switch back to a fiber, but switched to its recycled incarnation.

Gumix added a commit to Gumix/tarantool that referenced this issue Aug 11, 2023
The `__sanitizer_start_switch_fiber()` function takes a pointer as the
first argument to store the current fake stack if there is one (it is
necessary when stack-use-after-return detection is enabled). When leaving a
fiber definitely, NULL must be passed so that the fake stack is destroyed.

Before this patch, NULL was passed for dead fibers, however this is wrong
for dead fibers that are recycled and resumed. In such cases ASAN destroys
the fake stack, and the fiber crashes trying to use it in `fiber_yield()`
upon return from `coro_transfer()`.

Closes tarantool/tarantool-qa#321

NO_DOC=bugfix
NO_TEST=tested by test-release-asan workflow
Gumix added a commit to Gumix/tarantool that referenced this issue Aug 14, 2023
The `__sanitizer_start_switch_fiber()` function takes a pointer as the
first argument to store the current fake stack if there is one (it is
necessary when stack-use-after-return detection is enabled). When leaving a
fiber definitely, NULL must be passed so that the fake stack is destroyed.

Before this patch, NULL was passed for dead fibers, however this is wrong
for dead fibers that are recycled and resumed. In such cases ASAN destroys
the fake stack, and the fiber crashes trying to use it in `fiber_yield()`
upon return from `coro_transfer()`.

Closes tarantool/tarantool-qa#321

NO_DOC=bugfix
NO_TEST=tested by test-release-asan workflow
locker pushed a commit to tarantool/tarantool that referenced this issue Aug 15, 2023
The `__sanitizer_start_switch_fiber()` function takes a pointer as the
first argument to store the current fake stack if there is one (it is
necessary when stack-use-after-return detection is enabled). When leaving a
fiber definitely, NULL must be passed so that the fake stack is destroyed.

Before this patch, NULL was passed for dead fibers, however this is wrong
for dead fibers that are recycled and resumed. In such cases ASAN destroys
the fake stack, and the fiber crashes trying to use it in `fiber_yield()`
upon return from `coro_transfer()`.

Closes tarantool/tarantool-qa#321

NO_DOC=bugfix
NO_TEST=tested by test-release-asan workflow
locker pushed a commit to tarantool/tarantool that referenced this issue Aug 15, 2023
The `__sanitizer_start_switch_fiber()` function takes a pointer as the
first argument to store the current fake stack if there is one (it is
necessary when stack-use-after-return detection is enabled). When leaving a
fiber definitely, NULL must be passed so that the fake stack is destroyed.

Before this patch, NULL was passed for dead fibers, however this is wrong
for dead fibers that are recycled and resumed. In such cases ASAN destroys
the fake stack, and the fiber crashes trying to use it in `fiber_yield()`
upon return from `coro_transfer()`.

Closes tarantool/tarantool-qa#321

NO_DOC=bugfix
NO_TEST=tested by test-release-asan workflow

(cherry picked from commit 72a6abe)
locker pushed a commit to tarantool/tarantool that referenced this issue Aug 15, 2023
The `__sanitizer_start_switch_fiber()` function takes a pointer as the
first argument to store the current fake stack if there is one (it is
necessary when stack-use-after-return detection is enabled). When leaving a
fiber definitely, NULL must be passed so that the fake stack is destroyed.

Before this patch, NULL was passed for dead fibers, however this is wrong
for dead fibers that are recycled and resumed. In such cases ASAN destroys
the fake stack, and the fiber crashes trying to use it in `fiber_yield()`
upon return from `coro_transfer()`.

Closes tarantool/tarantool-qa#321

NO_DOC=bugfix
NO_TEST=tested by test-release-asan workflow

(cherry picked from commit 72a6abe)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants