Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic while formatting panic message + always_abort leads to stack overflow #122940

Open
RalfJung opened this issue Mar 23, 2024 · 24 comments
Open
Labels
A-panic Area: Panicking machinery C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@RalfJung
Copy link
Member

RalfJung commented Mar 23, 2024

Example and initial analysis by @correabuscar:

#![feature(panic_always_abort)]

use std::fmt::{Display, self};

struct MyStruct;

impl Display for MyStruct {
    fn fmt(&self, _: &mut fmt::Formatter<'_>) -> fmt::Result {
        panic!("this is bad: {}", MyStruct);
    }
}

fn main() {
    std::panic::always_abort();
    println!("{}", MyStruct);
}

The reason for this is that if both "panic while processing panic" and "panic after always_abort" apply, then we use the AlwaysAbort code path which does format the panic message.

This could be fixed by moving this check further down below the in_panic_hook check:

if global_count & ALWAYS_ABORT_FLAG != 0 {
return Some(MustAbort::AlwaysAbort);
}

But I don't know if this has other adverse side-effects. Can we even access thread-local variables after fork (i.e., when always-abort is set)?
Cc @Amanieu

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Mar 23, 2024
@correabuscar
Copy link

correabuscar commented Mar 23, 2024

Can we even access thread-local variables after fork?

we can't due to // Accessing LOCAL_PANIC_COUNT in a child created bylibc::fork would lead to a memory allocation. from here: https://github.com/rust-lang/rust/blob/9f25a04498fbe30dcf6a9c764953704c333a0137/library/std/src/panicking.rs#L372C4-L373C19

(Presumably thread locals cannot be done without allocating in rust, but can in C ?)

Unless we do thread local in a .c library which is NOT guaranteed to not allocate on all platforms (but on linux and windows shouldn't allocate)
I've an example of this being done here, reproduced below (but I doubt this will even be considered even for just the few platforms that is guaranteed to not alloc):

Cargo.toml

[package]
name = "c_thread_local_in_rust"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]

[build-dependencies]
#cc = { version="1", path = "/tmp/cc-rs"  } #used this because it doesn't do this warn: warning: c_thread_local_in_rust@0.1.0: Compiler version doesn't include clang or GCC: "cc" "--version"
cc = "1.0.90" #otherwise, using this is preferred

build.rs

fn main() {
    // Compile the C code into a shared library using cc crate
    cc::Build::new()
        .file("tl.c")
        .shared_flag(true)
        .compile("tl");
}

tl.c

// example.c
//#include <stdio.h>

// Define a thread-local variable
__thread int thread_local_var;

// Function to access the thread-local variable
int *get_thread_local_var() {
    return &thread_local_var;
}

// Function to set the value of the thread-local variable
void set_thread_local_var(int value) {
    thread_local_var = value;
}

/*
   In C, thread-local storage (TLS) can be implemented in different ways, and whether it involves heap allocation or new allocations depends on the specific implementation and the platform.

However, the most common and efficient way to implement thread-local storage in C is by using platform-specific mechanisms such as compiler extensions (`__thread` specifier) or standard library functions (`pthread_key_create`, `TlsAlloc`, etc.), which typically allocate thread-local storage directly from the system or using thread-specific data structures managed by the runtime environment.

For example, in the case of the `__thread` specifier in GCC/Clang, thread-local variables are often allocated directly within the thread's control block or using thread-specific registers, avoiding heap allocation entirely.

Similarly, with `pthread_key_create` or Windows-specific functions like `TlsAlloc`, the thread-local storage is managed by the operating system or the runtime environment, typically without involving heap allocation.

That said, it's crucial to understand that the exact implementation details can vary across platforms, compilers, and runtime environments. Therefore, while thread-local storage in C is often implemented efficiently without heap allocation, it's essential to consult the documentation and consider the specific context and requirements of your application to ensure optimal performance and behavior.

*/

main.rs

//mod ffi;
use std::thread;
use std::time::Duration;

//so, unlike thread_local!() in rust which supposedly somehow does memory allocation(s) on heap
//upon first access of the var, this in .c file thread local does not, but cannot guarantee it
//doesn't on all other platforms

fn main() {
    println!("Hello, world!");
    let value = access_thread_local_var();
    println!("Thread-local value: {}", value);
    set_thread_local_var2(20);
    let value = access_thread_local_var();
    println!("Thread-local value: {}", value);

    // Spawn a new thread
    let handle = thread::spawn(|| {
        // This closure represents the code that will run in the new thread
        for i in 1..=5 {
            let value = access_thread_local_var();
            println!("Hello from the spawned thread! Message {} ltvar={}", i,value);
            thread::sleep(Duration::from_millis(500)); // Sleep for 500 milliseconds
            set_thread_local_var2(10*i);
        }
    });

    // Main thread continues executing concurrently with the spawned thread
    for i in 1..=3 {
        let value = access_thread_local_var();
        println!("Hello from the main thread! Message {} ltvar={}", i,value);
        thread::sleep(Duration::from_millis(1000)); // Sleep for 1 second
        set_thread_local_var2(i);
    }

    // Wait for the spawned thread to finish
    handle.join().unwrap();

    println!("Main thread exiting.");
}
//
// Link against the C shared library
//#[link(name = "tl")] //looks like this isn't needed!
//extern {}

// Rust code (lib.rs)
extern {
    // Define an external function to access the thread-local variable
    fn get_thread_local_var() -> *mut i32;
    fn set_thread_local_var(value: i32);
}

// Function to access the thread-local variable from Rust
pub fn access_thread_local_var() -> i32 {
    unsafe {
        // Call the external function to get a pointer to the thread-local variable
        let ptr = get_thread_local_var();
        // Dereference the pointer to access the value of the thread-local variable
        *ptr
    }
}

// Function to set the value of the thread-local variable from Rust
pub fn set_thread_local_var2(value: i32) {
    unsafe {
        // Call the external function to set the value of the thread-local variable
        set_thread_local_var(value);
    }
}

output:

     Running `target/debug/c_thread_local_in_rust`
Hello, world!
Thread-local value: 0
Thread-local value: 20
Hello from the main thread! Message 1 ltvar=20
Hello from the spawned thread! Message 1 ltvar=0
Hello from the spawned thread! Message 2 ltvar=10
Hello from the main thread! Message 2 ltvar=1
Hello from the spawned thread! Message 3 ltvar=20
Hello from the spawned thread! Message 4 ltvar=30
Hello from the main thread! Message 3 ltvar=2
Hello from the spawned thread! Message 5 ltvar=40
Main thread exiting.

As an aside, if I do figure out how to get thread local storage on rust without allocating, then I might be able to fix a bug in libtest whereby tests ran in-process as threads (except for the case of --test-threads=1 which somehow works) will cause harness to exit because the test is using exit() or test infinitely panics and other variants, which would all be catch_unwind-able after the fix.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 23, 2024

(Presumably thread locals cannot be done without allocating in rust, but can in C ?)

This certainly used to be true, but nowadays LOCAL_PANIC_COUNT is a const TLS variable with no destructor. So that comment may be outdated. But I am not sure whether const TLS variable with no destructor are alloc-free on all platforms. @m-ou-se or @joboet might know.

EDIT: Looking at the code, fast_local does not need allocations but os_local does. So it depends on cfg(target_thread_local). At least on one of the Windows GNU MSVC targets we don't have target_thread_local so we need allocations.

Not sure how else we could detect reentrancy in the panic handler though. Maybe we can make it depend on cfg(target_thread_local) so that at least on those platforms where we don't need to allocate, we can avoid the stack overflow?

@RalfJung
Copy link
Member Author

RalfJung commented Mar 23, 2024

As an aside, if I do figure out how to get thread local storage on rust without allocating, then I might be able to fix a bug in libtest whereby tests ran in-process as threads (except for the case of --test-threads=1 which somehow works) will cause harness to exit because the test is using exit() or test infinitely panics and other variants, which would all be catch_unwind-able after the fix.

I don't know what you mean, but if there is a bug here, please report it as a new issue. It's not related to this issue, so please let's avoid getting off-topic.

@correabuscar
Copy link

As an aside, if I do figure out how to get thread local storage on rust without allocating, then I might be able to fix a bug in libtest whereby tests ran in-process as threads (except for the case of --test-threads=1 which somehow works) will cause harness to exit because the test is using exit() or test infinitely panics and other variants, which would all be catch_unwind-able after the fix.

I don't know what you mean, but if there is a bug here, please report it as a new issue. It's not related to this issue, so please let's avoid getting off-topic.

You're right. (though I didn't yet scope the whole thing about that issue, I'm only aware of 3 ways that test would break out of the test harness that aren't fixed yet, so I'm postponing reporting it until I do understand it and can formulate it properly - also, perfectionism sux)

But what I should've said, that is relevant to this issue(or so I think), is the following:
that if we have thread locals that do not alloc, in rust (or particularly at that point in rust_panic_with_hook function), then,
we could keep track of and back off of each path on each recursive panic call.
Say for example we initially try to show the 'message' (mark this path as being in progress, eg. 1 bit in some U64 maybe) then if that panics, we get back to the same place, see that the path was taken(ie. we're already in it, but panic-ed inside it, thus recursed back to it) then we take a lesser path next which is to not show 'message', so show location only, and if even that path recurses (ie. panics) then an even lesser path of not showing location but just a normal hardcoded message and even this path fails somehow and so on ...

parens

(that being said, 'message' can still allocate since it's user-controlled and they may choose to do so, it breaks the 'fork' case of needing to avoid any allocations which is why we've used panic::always_abort in the first place - but this is another issue)

however this is just the panic::always_abort() case branch of paths, this could possibly be done for the whole tree of paths inside the rust_panic_with_hook function. If we had thread locals that don't alloc on first access(maybe we do, I just don't know it yet - you mentioned fast_local which I know nothing of, atm).
Either way I plan to do it(or something like it) locally(local patch) for my own rustc usage(as I assume upstream won't be interested if it's hacky - eg. uses __thread int a_thread_local_var; from a .c lib), as soon as I find a way to get a thread local without allocating (to ensure the fork case is covered).

@RalfJung
Copy link
Member Author

that if we have thread locals that do not alloc, in rust (or particularly at that point in rust_panic_with_hook function), then,
we could keep track of and back off of each path on each recursive panic call.

I'm not sure it's worth the effort to distinguish whether it's the panic message formatting or the panic hook that caused the recursive panic. But either way that possible improvement is orthogonal to this issue as well.

@correabuscar
Copy link

correabuscar commented Mar 23, 2024

that if we have thread locals that do not alloc, in rust (or particularly at that point in rust_panic_with_hook function), then,
we could keep track of and back off of each path on each recursive panic call.

I'm not sure it's worth the effort to distinguish whether it's the panic message formatting or the panic hook that caused the recursive panic. But either way that possible improvement is orthogonal to this issue as well.

Well no, it would prevent any infinite recursion because eventually at 3rd or whatever panic recursion it would either print a harcoded message, or if even that one panic-ed, then it wouldn't print anything just reach the abort call.

That was the point, progressively back off of paths that we've been on, until the only path left is one that doesn't panic or the last one which is abort.

EDIT: that being said, maybe I'm misunderstanding something and that's why the above makes sense to me but is in fact wrong (i'm attempting to figure that out and update this after)
EDIT2: confirmed works, here's an equivalent(but wrongly implemented way) sample:

output:

    Running `target/debug/panic_in_display`
must_abort=Some(AlwaysAbort)
in panic_count::MustAbort::AlwaysAbort
panicked at /home/user/sandbox/rust/05_sandbox/tests/panic_in_display/src/main.rs:21:9:
oh2 no, 'must_abort=Some(AlwaysAbort)
in panic_count::MustAbort::AlwaysAbort
panicked after panic::always_abort(), aborting. (and recursion prevented) loc='Location { file: "/home/user/sandbox/rust/05_sandbox/tests/panic_in_display/src/main.rs", line: 14, col: 9 }'
about to abort_internal()
./gmy: line 5: 143389 Aborted                 (core dumped) cargo run
exit code: 134

for this code:

#![feature(panic_always_abort)]
//
//src: https://github.com/rust-lang/rust/issues/97181
//this double panic no longer shows stacktrace due to https://github.com/rust-lang/rust/pull/110975

use std::fmt::{Display, self};

struct MyStruct;
struct MyStruct2;
impl Display for MyStruct2 {
    fn fmt(&self, _: &mut fmt::Formatter<'_>) -> fmt::Result {
        //todo!()
        let instance = MyStruct;
        panic!("oh1 no, '{}' was unexpected", instance);
    }
}

impl Display for MyStruct {
    fn fmt(&self, _: &mut fmt::Formatter<'_>) -> fmt::Result {
        let instance2 = MyStruct2;
        panic!("oh2 no, '{}' was unexpected", instance2);
        //todo!()
    }
}

fn main() {
    let instance = MyStruct;

    std::panic::always_abort();//issue: https://github.com/rust-lang/rust/issues/122940
    println!("{}", instance);
    //assert!(false, "oh no, '{}' was unexpected", instance);
}

With these rustc changes
(ignore the irrelevant parts in the patch which I've included so the output above makes sense)
The only relevant in this patch is the block of panic_count::MustAbort::AlwaysAbort => { which emulates a path (this is the wrong var(atomicbool) to track this, but is ok for the current example)

Index: /var/tmp/portage/dev-lang/rust-1.75.0-r1/work/rustc-1.75.0-src/library/std/src/panicking.rs
===================================================================
--- .orig/var/tmp/portage/dev-lang/rust-1.75.0-r1/work/rustc-1.75.0-src/library/std/src/panicking.rs
+++ /var/tmp/portage/dev-lang/rust-1.75.0-r1/work/rustc-1.75.0-src/library/std/src/panicking.rs
@@ -739,6 +739,7 @@ fn rust_panic_with_hook(
     force_no_backtrace: bool,
 ) -> ! {
     let must_abort = panic_count::increase(true);
+    rtprintpanic!("must_abort={:?}\n", must_abort);
 
     // Check if we need to abort immediately.
     if let Some(must_abort) = must_abort {
@@ -746,11 +747,27 @@ fn rust_panic_with_hook(
             panic_count::MustAbort::PanicInHook => {
                 // Don't try to print the message in this case
                 // - perhaps that is causing the recursive panics.
-                rtprintpanic!("thread panicked while processing panic. aborting.\n");
-            }
-            panic_count::MustAbort::AlwaysAbort => {
+                //rtprintpanic!("thread panicked while processing panic. aborting. msg='{:?}' loc='{:?}'\n",message, location);
+                //FIXME: I need thread local(that doesn't do allocations when accessing it) and
+                //which will let me progressively remove things from the output which might panic
+                //like the 'message' at first, then 'location', then the backtrace dump?
+                rtprintpanic!("thread panicked while processing panic. aborting. msg='not shown' loc='{:?}'\n", location);
+                //let panicinfo = PanicInfo::internal_constructor(
+                //    message,//this would infinite panic
+                //    location,
+                //    can_unwind,
+                //    force_no_backtrace,
+                //);
+                //rtprintpanic!("{panicinfo}\nhere's a stacktrace attempt:\n");
+                rtprintpanic!("{}\n", crate::backtrace::Backtrace::force_capture());
+             }
+             panic_count::MustAbort::AlwaysAbort => {
+                rtprintpanic!("in panic_count::MustAbort::AlwaysAbort\n");
                 // Unfortunately, this does not print a backtrace, because creating
                 // a `Backtrace` will allocate, which we must to avoid here.
+               static BEEN_HERE:AtomicBool=AtomicBool::new(false);
+               //FIXME: this is the wrong way; need a thread local var, but one that doesn't alloc (for the fork case)
+               if false==BEEN_HERE.swap(true, Ordering::SeqCst) {
                 let panicinfo = PanicInfo::internal_constructor(
                     message,
                     location,
@@ -758,13 +775,27 @@ fn rust_panic_with_hook(
                     force_no_backtrace,
                 );
                 rtprintpanic!("{panicinfo}\npanicked after panic::always_abort(), aborting.\n");
+                //XXX: we'd restore this even if thread local because this whole thing may end up in a
+                //landing pad and some future new panic(like if this is ran within libtest via
+                //'cargo test' with --test-threads 1, it could be catch_unwind-ed aka
+                //caught, then the next test that panics gets back here)
+                //might get itself here and have an unclean path state from this old panic we're doing now.
+                //even the abort below might get caught by atexit hook(iirc) and end up in landing
+                //pad (well, it would with my hacky patch that's about that, iirc)
+                BEEN_HERE.store(false, Ordering::SeqCst);
+               } else {//been here, prevent recursion
+                    rtprintpanic!("panicked after panic::always_abort(), aborting. (and recursion prevented) loc='{:?}'\n", location);
+               }
             }
         }
+        rtprintpanic!("about to abort_internal()\n");
         crate::sys::abort_internal();
     }
 
+    rtprintpanic!("about to mut info\n");
     let mut info =
         PanicInfo::internal_constructor(message, location, can_unwind, force_no_backtrace);
+    rtprintpanic!("about to hook read\n");
     let hook = HOOK.read().unwrap_or_else(PoisonError::into_inner);
     match *hook {
         // Some platforms (like wasm) know that printing to stderr won't ever actually
@@ -773,31 +804,48 @@ fn rust_panic_with_hook(
         // methods, this means we avoid formatting the string at all!
         // (The panic runtime might still call `payload.take_box()` though and trigger
         // formatting.)
-        Hook::Default if panic_output().is_none() => {}
+        Hook::Default if panic_output().is_none() => {
+            rtprintpanic!("1\n");
+        }
         Hook::Default => {
-            info.set_payload(payload.get());
+            rtprintpanic!("2\n");
+            let p=payload.get();
+            rtprintpanic!("22\n");
+            info.set_payload(p);
+            rtprintpanic!("222\n");
             default_hook(&info);
         }
         Hook::Custom(ref hook) => {
-            info.set_payload(payload.get());
+            rtprintpanic!("3\n");
+            let p=payload.get();
+            rtprintpanic!("33\n");
+            info.set_payload(p);
+            //info.set_payload(payload.get());
+            rtprintpanic!("333\n");
             hook(&info);
         }
     };
+    rtprintpanic!("4\n");
     drop(hook);
 
+    rtprintpanic!("5\n");
     // Indicate that we have finished executing the panic hook. After this point
     // it is fine if there is a panic while executing destructors, as long as it
     // it contained within a `catch_unwind`.
     panic_count::finished_panic_hook();
 
+    rtprintpanic!("6\n");
     if !can_unwind {
+        rtprintpanic!("7\n");
         // If a thread panics while running destructors or tries to unwind
         // through a nounwind function (e.g. extern "C") then we cannot continue
         // unwinding and have to abort immediately.
         rtprintpanic!("thread caused non-unwinding panic. aborting.\n");
+        rtprintpanic!("8ai\n");
         crate::sys::abort_internal();
     }
 
+    rtprintpanic!("9rp\n");
     rust_panic(payload)
 }
 

@jieyouxu jieyouxu added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-panic Area: Panicking machinery labels Mar 23, 2024
@correabuscar
Copy link

(Presumably thread locals cannot be done without allocating in rust, but can in C ?)

This certainly used to be true, but nowadays LOCAL_PANIC_COUNT is a const TLS variable with no destructor. So that comment may be outdated. But I am not sure whether const TLS variable with no destructor are alloc-free on all platforms. @m-ou-se or @joboet might know.

const was there since 23rd of April 2022,
then the comment

    // Accessing LOCAL_PANIC_COUNT in a child created by `libc::fork` would lead to a memory 
    // allocation.

was added at line 315 on October 6th 2022,
so that's 5 months later, but they might not have been aware that 'const' was there at line 300 when they made the comment ??

Would be nice if more people would please confirm whether or not that comment still applies even though it's const like thread_local! { static LOCAL_PANIC_COUNT: Cell<usize> = const { Cell::new(0) } } which suggests it doesn't ? or depends on platform (as per above find) ?

I'm not too sure about what the docs for thread_local! say; they appear to imply the comment is outdated(which is what RalfJung says and I too thought it to be so yesterday when I tried to ask about it on libera chat, but I guess it wasn't the right place to ask; and changed my mind later on, for some reason, and believed the comment still applies):

This macro supports a special const {} syntax that can be used when the initialization expression can be evaluated as a constant. This can enable a more efficient thread local implementation that can avoid lazy initialization. For types that do not need to be dropped, this can enable an even more efficient implementation that does not need to track any additional state.

use std::cell::Cell;
thread_local! {
    pub static FOO: Cell<u32> = const { Cell::new(1) };
}

assert_eq!(FOO.get(), 1);

(feel free to mark this comment as off-topic)

@Amanieu
Copy link
Member

Amanieu commented Mar 23, 2024

Rather than messing with thread locals, we could just add a second bit to global_count to abort immediate but without printing the panic payload. There is some potential for a race condition if 2 threads are panicking at the same time, but I think this is acceptable since always_abort is currently only used after fork which means there are no other threads in the new process.

@correabuscar
Copy link

correabuscar commented Mar 23, 2024

EDIT2: actually I think I missed something when I said what I said below, if by without printing the panic payload. you meant without printing the message (i must've gotten confused due to "payload" and "message" being two different args in that function; however there's no "payload" in the always abort branch, so you could've only meant "message" there, so what I said below is only valid because I wrongly assumed that you still meant to print the "message", but this doesn't make sense given what you've said, so my bad) so, basically we both agreed that we shouldn't print the 'message' which does solve 2 problems at once. The "EDIT" at the end still applies though.
EDIT3: oh wait actually I remember, I thought you meant to abort only if this is the second time we're in the always abort branch, as if, because the first try was to print the 'message' and this itself panic-ed again so we're back here, we're not gonna try and print it again. But this means, we still tried to print 'message' in first try which leaves the user could've allocated in the Display impl, problem on the table.

Rather than messing with thread locals, we could just add a second bit to global_count to abort immediate but without printing the panic payload. There is some potential for a race condition if 2 threads are panicking at the same time, but I think this is acceptable since always_abort is currently only used after fork which means there are no other threads in the new process.

If indeed it is so, as even the comments re-iterate:

// panic::always_abort() is usually called to prevent memory allocations done by
// the panic handling in the child created by libc::fork.
// Memory allocations performed in a child created with libc::fork are undefined
// behavior in most operating systems.

then I posit that message(line 755 below) should never be shown because just like the panic! calls in the fmt of impl Display, the user may allocate there, so to be sure allocation(s) don't happen in a fork ... message cannot be used, therefore recursion is averted, thus 2 problems solved.

for reference:

panic_count::MustAbort::AlwaysAbort => {
// Unfortunately, this does not print a backtrace, because creating
// a `Backtrace` will allocate, which we must to avoid here.
let panicinfo = PanicInfo::internal_constructor(
message,
location,
can_unwind,
force_no_backtrace,
);
rtprintpanic!("{panicinfo}\npanicked after panic::always_abort(), aborting.\n");

EDIT: that usually though, if others are gonna use it(like after it's stabilized?) for other more benign purposes they won't be seeing a (proper) message ...

@RalfJung
Copy link
Member Author

RalfJung commented Mar 23, 2024

An alternative would be to set the message to message.and_then(fmt::Arguments::as_str).map(|s| &fmt::Arguments::new_const(&[s])) (well, not quite, the lifetimes don't work out, but you get the idea). That way panic messages that are simple strings will still work but no custom formatting code will be invoked.

In fact I could probably have done that in #122930 as well.

@joboet
Copy link
Contributor

joboet commented Mar 23, 2024

Would be nice if more people would please confirm whether or not that comment still applies even though it's const like thread_local! { static LOCAL_PANIC_COUNT: Cell<usize> = const { Cell::new(0) } } which suggests it doesn't ? or depends on platform (as per above find) ?

I'm not too sure about what the docs for thread_local! say; they appear to imply the comment is outdated(which is what RalfJung says and I too thought it to be so yesterday when I tried to ask about it on libera chat, but I guess it wasn't the right place to ask; and changed my mind later on, for some reason, and believed the comment still applies)

The comment still applies if you replace "would" by "could". On platforms like x86_64 Linux, const-initialized, !Drop variables in thread_local! do not allocate. But on platforms like GNU/Windows, we always use keys, which need to allocate.

We could get around this by using pthread_getspecific/pthread_setspecific directly on the relevant platforms, as long as the value fits into a pointer, we don't need to allocate. However, I'm not convinced this is correct (even on platforms with native TLS). fork requires all functions called after it to be async-signal safe, which both pthread_key_* and #[thread_local] are not documented to be, so all TLS accesses after fork risk running into deadlocks if the platform has not considered this extremely niche use-case. In practice, this works on GNU/Linux, but as for other platforms, who knows...

An alternative would be to set the message to message.and_then(fmt::Arguments::as_str).map(|s| &fmt::Arguments::new_const(&[s])) (well, not quite, the lifetimes don't work out, but you get the idea). That way panic messages that are simple strings will still work but no custom formatting code will be invoked.

In fact I could probably have done that in #122930 as well.

I'd much prefer that solution.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 23, 2024

But on platforms like GNU/Windows, we always use keys, which need to allocate.

Not always. 32bit GNU/Windows uses #[thead_local] right now, 64bit uses keys. (Or vice versa? Something like that.) But this keeps changing as bugs in the GNU toolchain are discovered and fixed.

(EDIT: Actually it's i686-pc-windows-msvc that uses keys.)

fork requires all functions called after it to be async-signal safe, which both pthread_key_* and #[thread_local] are not documented to be,

#[thead_local] is not a function so where would one even check if it is async-signal-safe...?

@joboet
Copy link
Contributor

joboet commented Mar 23, 2024

#[thead_local] is not a function so where would one even check if it is async-signal-safe...?

Hah, I found some documentation: C++ specifies it as not-signal-safe. As we rely on the same platform support as C++, I guess that makes it not-signal-safe for us as well.

@RalfJung
Copy link
Member Author

That way panic messages that are simple strings will still work but no custom formatting code will be invoked.

That still seems like an unfortunate restriction -- most our formatting code does not need to allocate, so it seems reasonable to say that post-fork code must ensure to not use a panic message that needs formatting. And this way we can at least produce nice panic messages.

@workingjubilee
Copy link
Contributor

workingjubilee commented Mar 24, 2024

@joboet That is because it assumes only access to the "lowest-common-denominator" parts of the pthread model... something we rarely actually are forced to use. But as you noted, rarely is not never (e.g. GNU Windows).

I don't know of a more exhaustive or up-to-date treatment of how TLS variables are handled than https://www.akkadia.org/drepper/tls.pdf unfortunately...

@joboet
Copy link
Contributor

joboet commented Mar 24, 2024

Oh actually, we already rely on thread-locals being async-signal-safe in practice cause we store the stack overflow guard page location in TLS and access it in the SIGSEGV handler. In that case, the no-allocation solution would work.

@joboet
Copy link
Contributor

joboet commented Mar 24, 2024

I don't know of a more exhaustive or up-to-date treatment of how TLS variables are handled than https://www.akkadia.org/drepper/tls.pdf unfortunately...

The problem is that in the general/local dynamic model, we call __tlv_get_addr, which can allocate. I guess what saves us in the stack overflow case is that we access the variable upon thread creation, which means that the storage is already allocated.

@RalfJung
Copy link
Member Author

An alternative would be to set the message to message.and_then(fmt::Arguments::as_str).map(|s| &fmt::Arguments::new_const(&[s])) (well, not quite, the lifetimes don't work out, but you get the idea). That way panic messages that are simple strings will still work but no custom formatting code will be invoked.

I brought this up in a recent adjacent PR and the feedback was that we should first investigate sacrificing another bit of the global count to track "is there a panic in progress in any thread".

@correabuscar
Copy link

correabuscar commented Mar 24, 2024

An alternative would be to set the message to message.and_then(fmt::Arguments::as_str).map(|s| &fmt::Arguments::new_const(&[s])) (well, not quite, the lifetimes don't work out, but you get the idea). That way panic messages that are simple strings will still work but no custom formatting code will be invoked.

I brought this up in a recent adjacent PR and the feedback was that we should first investigate sacrificing another bit of the global count to track "is there a panic in progress in any thread".

This means, we'll still try to print the 'message'(ie. formatted) on the first try, right?(hover mouse for tooltip(s)) and if that panics, the bit would've been set and we just abort instead of trying to print 'message' again.

If so, then this still leaves the problem of user could've allocated1 in its Display impl, which is bad for the 'fork' case. Would we(well, hopefully you) make a different issue (on github) to solve this one? it kinda makes sense this way, to have them be as two different issues, but this latter one may undo the fix that will be done here for the current issue.

I like your alternative(show unformatted 'message' to avoid both a possible panic and any user-made allocs, in their Display impl), but I imagine after std::panic::always_abort(); gets stabilized and it gets used for other non-'fork' purposes, then having an unformatted 'message' for those panics might not be appreciated.

(click me to expand)

might be just "better" to detect if it's a fork? then we would know that abort must be used without a formatted message(to avoid panic recursion and user-made allocs), but if it's not a fork then format the 'message' even if it's an always_abort() (which could still recurse if not guarded by a been-here-already flag)

I wonder if somehow saving the process id within the current process before calling main(), and then later in rust_panic_with_hook we would test if it's different than the current process id, would be a better(?) way2 to know whether or not we're within the fork (or the parent) process, and thus take appropriate abort action, without the need to use std::panic::always_abort();.

Or some other way to know it's a fork:

The child process is an exact duplicate of the parent process except for the following points:

  • The child has its own unique process ID, and this PID does not match the ID of any existing process group (setpgid(2)) or session.

(this ^ is the variant that I mentioned above)

  • The child’s parent process ID is the same as the parent’s process ID.
  • The child does not inherit its parent’s memory locks (mlock(2), mlockall(2)).

Footnotes

  1. here I mean user allocated anything for any reason, like made a vector, resized a vector, any heap allocation for whatever reason user wanted - we can't anticipate all the reasons for which a user wants to alloc there in Display impl, but can't assume user won't(although we currently do), however it might be true that users may not alloc in Display impls in most cases, which i dno if true or false, i've not enough experience, but they might just indirectly alloc by using other rust things that alloc internally

  2. EDIT100 no it's not reliable enough, think: main->fork1->fork2, fork2 may already have main's pid due to pid recycling and assuming main already exited, and even if I also used a pending signal in main that's no longer detected in fork(s) that's still not good enough because user can delete all pending signals if wanted to. But there's man pthread_atfork which can install 3 hooks that get executed on any fork but not man vfork, thus can just set a "'bool'" in the child hook to say it's a fork.

@RalfJung
Copy link
Member Author

If so, then this still leaves the problem of user could've allocated in its Display impl, which is bad for the 'fork' case.

When you eprintln! after a fork, you must make sure that that does not allocate. I think it is reasonable to require the same for panicking. All code after fork must be under your control anyway to ensure correctness.

@workingjubilee
Copy link
Contributor

...huh?

if you fork, until you execve, while in a multithreaded program, you cannot call anything that is not async-signal-safe. my understanding is that our eprintln! impl is not async signal safe, as it is locking:

#[must_use]
#[stable(feature = "rust1", since = "1.0.0")]
#[cfg_attr(not(test), rustc_diagnostic_item = "io_stderr")]
pub fn stderr() -> Stderr {
// Note that unlike `stdout()` we don't use `at_exit` here to register a
// destructor. Stderr is not buffered, so there's no need to run a
// destructor for flushing the buffer
static INSTANCE: ReentrantLock<RefCell<StderrRaw>> =
ReentrantLock::new(RefCell::new(stderr_raw()));
Stderr { inner: &INSTANCE }
}

my understanding is that making the lock reentrant with respect to its own thread doesn't actually make it reentrant with respect to the sense that "async signal safe" means.

@RalfJung
Copy link
Member Author

RalfJung commented Mar 24, 2024

Ah, I didn't think stderr would be buffered locked.

Well then you can still write! to a file descriptor after a fork. If you are sure that formatting won't allocate.

@workingjubilee
Copy link
Contributor

indeed, that is "async signal "safe"". :^)

/// Unbuffered, unsynchronized writer to stderr.
///
/// Only acceptable because everything will end soon anyways.
struct RawStderr(());
impl fmt::Write for RawStderr {
fn write_str(&mut self, s: &str) -> Result<(), fmt::Error> {
let ret = unsafe { libc::write(libc::STDERR_FILENO, s.as_ptr().cast(), s.len()) };
if ret == -1 { Err(fmt::Error) } else { Ok(()) }
}
}
/// We don't really care how many bytes we actually get out. SIGSEGV comes for our head.
/// Splash stderr with letters of our own blood to warn our friends about the monster.
macro raw_errln($tokens:tt) {
let _ = ::core::fmt::Write::write_fmt(&mut RawStderr(()), format_args!($tokens));
let _ = ::core::fmt::Write::write_char(&mut RawStderr(()), '\n');
}
/// Signal handler installed for SIGSEGV
extern "C" fn print_stack_trace(_: libc::c_int) {

@RalfJung
Copy link
Member Author

FWIW I was wrong about which Windows target does not have #[thread_local] -- it's 32bit MSVC, not GNU. This is to work around #95429.

@jieyouxu jieyouxu added the C-bug Category: This is a bug. label Apr 15, 2024
@jieyouxu jieyouxu added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Apr 15, 2024
@saethlin saethlin removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-panic Area: Panicking machinery C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants