Great stack overflow error messages #51405

gilescope · 2018-06-07T00:02:58Z

As a coder
Given I have a fn main() { main() }
Then I expect the following output:

Exception in thread "main" java.lang.StackOverflowError
	at X.main(X.java:3)
	at X.main(X.java:3)
	at X.main(X.java:3)
	at X.main(X.java:3)

Obviously this is an example of how Java handles stack overflows, but you get the idea.

There didn't seem to be a tracking issue for this, so here is one.
(Some discussion here: https://users.rust-lang.org/t/how-to-diagnose-a-stack-overflow-issues-cause/17320/9 )

I've had a little look and I naively think we need to do something like this in sys_common/util.rs

    #[allow(dead_code)] // stack overflow detection not enabled on all platforms    
    pub unsafe fn report_overflow() {
        dumb_print(format_args!("\nthread '{}' has overflowed its stack\n",
                            thread::current().name().unwrap_or("<unknown>")));

        #[cfg(feature = "backtrace")]
        {
            let log_backtrace = backtrace::log_enabled();

            use sync::atomic::{AtomicBool, Ordering};

            static FIRST_PANIC: AtomicBool = AtomicBool::new(true);

            if let Some(format) = log_backtrace {
                if let Ok(mut stderr) = Stderr::new() {
                    let _ = backtrace::print(&mut stderr, format);
                }
            } else if FIRST_PANIC.compare_and_swap(true, false, Ordering::SeqCst) {
                dumb_print(format_args!("note: Run with `RUST_BACKTRACE=1` for a backtrace."));
            }
        }
    }

Quite possibly one can ditch the first panic checks. I'm sure there's lots of concerns here, e.g. have we got enough stack headroom to report without going pop.

The text was updated successfully, but these errors were encountered:

Techcable · 2018-06-07T04:55:24Z

I often wish that this was the case, given that most managed languages have this feature and preventing segfaults is a major goal of rust.

Unfortunately, this could break panic safety and introduce undefined behavior.
It is very common for unsafe code to assume that certain functions can never panic based on both documented or undocumented guarantees. For example ptr::copy_nonovoerlapping ptr.offset and ptr::write are often assumed to never panic.

vec.reserve(1);
unsafe {
    let i = vec.len();
    let p = vec.as_mut_ptr();
    // Right now `vec[i]` is uninitialized and should never be observed by user code
    vec.set_len(vec.len() + 1);
    /*
     * Keeping `vec[i]` uninitlaized is safe as long as untrusted user code
     * can never read the uninitialized memory because it's never called.
     * Since we never directly invoke untrusted code,
     * our only risk is from panicking causing user destructors to run.  
     * Because `ptr::copy_nonoverlapping` and `ptr.offset` never panic we
     * should be safe and users will never see our uninitialized memory.
     */
    ptr::copy_nonoverlapping(existingPointer, p.offset(i), 1));
}

If stack overflows could trigger a panic then both ptr::copy_nonoverlapping and ptr.offset are turned into potentially panicking functions. The stack could unwind and user code could observe the uninitialized memory when running their destructors.
This could trigger a large amount of potential unsafety across the stdlib and entire ecosystem. Furthermore, it makes it nigh on impossible to write correct unsafe code in the future because crucial functions like ptr::copy_nonoverlapping and ptr::write can now panic.

First of all, we could unwind the stack without calling the destructors, which should avoid most cases of untrusted code seeing unsafe states.
The downside of this approach is that it leaks memory and may be unsound.

The other alternative is a system to mark functions as #[nopanic].
Then we could statically know which functions are 'allowed' to panic without impacting memory safety. If a #[nopanic] function is on the stack during stack overflow, it'd be forbidden to panic and we would trigger an abort to avoid undefined behavior. Otherwise, the function is allowed to panic and the runtime would prefer that when dealing with a stack overflow.
This would force unsafe code to statically mark all their assumptions about which functions can and can not panic in order to retain memory safety. However, this would still be a major barrier to adoption and would probably require an RFC before it could be relied on.

gilescope · 2018-06-07T08:03:13Z

Thank you for the pointers. I'm working on the assumption that any thread that stack overflows is going to trigger an abort in the process. I.e. we're not planning on having processes that can survive a stack overflow in one thread. Please disabuse me of this notion if not correct!

If an abort is the only outcome of a stack overflow, then leaking memory is probably ok as the OS is just about to tear down the process.

I suspect full 'unwinding' is unnecessary - can we walk the stack's return addresses given we're really just after a good stack trace to show how the overflow occurred. For the first cut a list of addresses would show how many frames were in the loop which is more info than none.
(Let - me rename the issue to be more focused on the outcome rather than the implementation)

jonas-schievink · 2018-06-07T09:35:30Z

So what you want is just a backtrace, not actual unwinding. That would probably be useful for debugging, yeah.

gilescope · 2018-06-07T10:36:59Z

Ok so instead of using libunwind if we use glibc's backtrace in this case we should be able to have both safety and debug info on platforms where glibc backtrace is supported. I get that we might not get such a good stacktrace in release mode versus debug mode due to optimisations, but anything's better than nothing.

I couldn't see unwinding in glibc's backtrace source:
https://github.com/lattera/glibc/blob/master/debug/backtrace.c

sfackler · 2018-06-07T16:05:44Z

Our current backtrace implementation doesn't unwind either.

The main complexity in doing this is making sure we have enough stack space to create the backtrace even though we've already run out of stack space.

jonas-schievink · 2018-06-07T17:19:49Z

On Unix, this is running in signal context, which uses the alternative signal stack, which is usually a few KB in size (smallest size on supported Linux platforms seems to be 8K). It's possible to reconfigure this size using sigaltstack.

EDIT:

On Windows, Rust uses Vectored Exception Handling to register a handler that detects and reports a stack overflow. The stack size guarantee can be set with SetThreadStackGuarantee, and this is in fact already done here (it's set to 20 KB to make space for printing and formatting).

Now whether it's even possible to use libbacktrace/libunwind or StackWalk64 from the handler while the stack is in this overflowed state yet remains to be seen, but capturing a stacktrace on segfaults seems possible at least.

sfackler · 2018-06-07T18:46:29Z

If we'd be running the backtrace in the signal handler, we'd need to switch back to libbacktrace's mmap allocator rather than malloc/free which I seem to remember having some pretty severe perf issues. Not sure if the allocator can be configured at runtime.

jonas-schievink · 2018-06-07T18:48:39Z

Relevant: #51408 (removes libbacktrace)

gilescope · 2018-06-08T08:33:50Z

Thanks @jonas-schievink, that's great instead of relying on backtrace_symbols or backtrace_symbols_fd. We'd still need to walk the stack frames for a pure rust solution - maybe I'm missing something but I couldn't see anything currently built / being built that tries to walk the stack in pure rust.

I tested the above code on OSX and it worked a treat. But it depends on libunwind and looks to me like libunwind does indeed unwind when creating the backtrace so as @Techcable mentioned it won't be panic safe. I'm going to try and see if I can get glibc's backtrace working (no luck yet).

(Just thinking outside the box, one could spawn a separate thread to output the backtrace and pass it a reference to the overflowed thread's stack and join on the thread. It sounds like the major OSes give you enough stack during a stack overflow that we don't need to go there.)

sfackler · 2018-06-09T03:48:29Z

But it depends on libunwind and looks to me like libunwind does indeed unwind when creating the backtrace so as @Techcable mentioned it won't be panic safe.

Using libunwind is not the same thing as unwinding the stack. Taking a backtrace with libbacktrace or glibc backtrace or backtrace_symbols are all the same in that they do not change the state of the stack.

retep998 · 2018-06-09T11:28:58Z

It's more accurate to say that libunwind is merely being used to walk the stack to build up a stacktrace (hence why the windows equivalent is called StackWalk)

Techcable · 2018-06-10T19:36:56Z

The java virtual machine (and others) typically reserve a separate 'yellow zone' and 'red zone' in the stack space which is used for the JVM as reserved space for handling stack overflows [1].
When normal code runs out of room it will try and read/write in the yellow zone which will trigger a SEGFAULT that the JVM will catch.
In order to give the user a nice pretty StackOverflowError the JVM will remove the protection of the yellow zone to give enough stack space to handle the error and unwind the stack. In case that's not enough room, the functions will run into the red zone and the JVM will issue an internal error and give a crash dump.

While this approach is often taken by fast managed languages like V8, C#, and the JVM, reserving an extra zone of stack space is not necessarily an appropriate choice for Rust. While it has exactly zero overhead in the common case, this approach requires a few KB of reserved stack space for each thread and extra system calls on thread creation. This is kind of uncharted territory since rust is both a hardcore system programming language like C/C++ but also has much of the safety and convenience of a managed language.
If we can't figure out how to efficiently print backtraces in a signal handling context, I believe a similar approach may be worthwhile (at the minimum for debug builds) in order to provide nice pretty error messages to make it easier to track down where the heck we had a stack overflow.

Techcable · 2018-06-10T20:20:03Z

@sfackler libbacktrace's allocator can't be configured at runtime.However, the mmap allocator is enabled by default when mmap is present so it should already be enabled on all the major unix systems. The presense of the mmap allocator and signal safety can be tested at compile time by the C define flag BACKTRACE_USES_MALLOC which is 0 on Debian (signal safe). Since we already use the system libbacktrace library, we're already stuck with their defaults so this shouldn't make backtrace performance any worse 👿

Also, we may not even need to use libbacktrace at all, since libunwind provides a signal-safe unw_get_proc_name and unw_get_proc_info although we'd need to do some trickery to get line number info.

sfackler · 2018-06-10T23:13:09Z

Since we already use the system libbacktrace library

We do not use the system libbacktrace. We use our fork: https://github.com/rust-lang-nursery/libbacktrace/tree/f4d02bbdbf8a2c5a31f0801dfef597a86caad9e3, which in particular has the mmap allocator disabled: 2f3c412

da-x · 2019-07-19T15:00:10Z

If it's anyone's interest, I've opened rust-lang/compiler-builtins#304 . Essentially if we add CFI to the probestack part of compiler-builtins, it would enable backtrace mechanisms that are dependent on debuginfo to work, and make stack overflows a joy in Rust.

gilescope · 2019-07-30T10:17:47Z

@da-x totally still want joyful stack overflows. I got as far as getting a stack trace on OSX, but got bogged down trying to find a cross-OS way for https://github.com/gilescope/findshlibs to deal with sections and segments as they're a bit different in OSX and Linux.

I don't mind how we do it, it would be great to have a good stacktrace.

jyn514 · 2019-12-30T04:51:39Z

@gilescope would using a library like object or faerie help with cross-platform compatibility? I know object also supports Windows.

philipc · 2019-12-30T07:15:07Z

@jyn514 findshlibs needs to deal with the image after it is already loaded in memory, so it's a bit different from the problem that object solves (and faerie is only for writing, not reading), and findshlibs is already cross-platform too.

However, I'm not sure how findshlibs is relevant to this issue. rust can already get stack traces without needing to use findshlibs. As I understand it, the problem is:

__rust_probestack needed CFI annotations (and this is now fixed and works for generating the backtrace in gdb)
we need to make sure backtrace-rs is safe to use from the SIGSEGV handler
the SIGSEGV handler needs to call backtrace-rs to generate and print the stack trace

findshlibs is only needed if you want to rewrite backtrace-rs in rust. Now maybe that is the only way to make backtrace-rs safe to use in signal handlers, but I'm doubtful of that.

retep998 · 2019-12-30T07:47:51Z

On Windows this is reasonably simple. There are few special restrictions on what you can do inside an exception handler, so it would be trivial to get a backtrace and print it out. The exception record provides a CONTEXT we can pass to StackWalkEx to get the stack trace at the point of the overflow. We'd also have to call SetThreadStackGuarantee to make sure there is enough space on the stack for that backtrace generating code (and if we stack overflow again Windows just ends us so it's fine). To avoid panic unsafety issues, just wrap the code in a catch_unwind and print out a dumber message if the fancy backtrace failed.

sjep · 2020-04-09T14:32:15Z

Is there any update on this? I just ran into this issue in a graphics application I'm working on (MacOS) and I'm still not sure how to get a debug print out.

rustc --version: rustc 1.41.0 (5e1a799 2020-01-27)

gilescope · 2020-04-09T14:36:03Z

Not that I’m aware of. I backed off this a year ago ostensibly because I had bit off a bit more than I could chew...

_{Sent with GitHawk}

philipc · 2020-04-09T21:55:26Z

@sjep Run it in a debugger and use the debugger's backtrace facility instead.

sjep · 2020-04-10T14:52:54Z

As I said, this is a graphics application on MacOS, I can't easily virtualize (into docker for instance) and it's particularly difficult to set up gdb natively on Mac. I do agree that this seems to be the best workaround. I mainly upvote @gilescope in the need for a proper debug stack peek.

sfackler · 2020-04-10T14:54:59Z

lldb is available on MacOS.

sjep · 2020-04-10T15:06:53Z

Thanks for the tip - this is certainly the best workaround 👍

matklad · 2020-11-04T13:59:04Z

I've published a yolo workaround for those who don't want to fire gdb: https://docs.rs/backtrace-on-stack-overflow/0.1.0/backtrace_on_stack_overflow/fn.enable.html

Rust produces bad error messages on stack overflow, like "thread 'foo' has overflowed its stack" which provides very little insight into where the recursion that caused the stack to overflow occurred. See rust-lang/rust#51405 for details. This commit adds a SIGSEGV handler that attempts to print a backtrace, following the approach in the backtrace-on-stack-overflow crate. I copied the code from that crate into Materialize and tweaked it because it's a very small amount of code that we'll likely need to modify, and I wanted to improve its error handling. In my manual testing this produces a nice backtrace when Materialize overflows its stack.

hamirmahal · 2023-11-11T22:16:55Z

I just wanted to bump this feature because I'm also interested in it.

gilescope changed the title ~~Unwindable stack overflows~~ Great stack overflow error messages Jun 7, 2018

frewsxcv added the A-diagnostics Area: Messages for errors, warnings, and lints label Jun 8, 2018

gilescope mentioned this issue Jun 12, 2018

[WIP Do not merge] Helpful stack overflow messages. #51514

Closed

estebank added A-debuginfo Area: Debugging information in compiled programs (DWARF, PDB, etc.) and removed A-diagnostics Area: Messages for errors, warnings, and lints labels Apr 29, 2019

da-x mentioned this issue Jul 19, 2019

__rust_probestack needs CFI annotations to assist in backtracing rust-lang/compiler-builtins#304

Closed

benesch mentioned this issue Feb 1, 2021

materialized: attempt to produce nice backtraces on stack overflow MaterializeInc/materialize#5549

Merged

kosta mentioned this issue Jan 27, 2022

panic handling middleware tokio-rs/axum#728

Closed

Noratrieb added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Apr 5, 2023

CertainLach mentioned this issue Jun 14, 2023

Tailstrict causes stack overflow CertainLach/jrsonnet#115

Closed

theRookieCoder mentioned this issue Jun 1, 2024

Adding mods from GitHub releases reports "thread 'main' has overflowed its stack" gorilla-devs/ferium#377

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Great stack overflow error messages #51405

Great stack overflow error messages #51405

gilescope commented Jun 7, 2018 •

edited by nikomatsakis

Loading

Techcable commented Jun 7, 2018 •

edited

Loading

gilescope commented Jun 7, 2018

jonas-schievink commented Jun 7, 2018

gilescope commented Jun 7, 2018

sfackler commented Jun 7, 2018

jonas-schievink commented Jun 7, 2018 •

edited

Loading

sfackler commented Jun 7, 2018

jonas-schievink commented Jun 7, 2018

gilescope commented Jun 8, 2018

sfackler commented Jun 9, 2018

retep998 commented Jun 9, 2018

Techcable commented Jun 10, 2018 •

edited

Loading

Techcable commented Jun 10, 2018

sfackler commented Jun 10, 2018

da-x commented Jul 19, 2019

gilescope commented Jul 30, 2019

jyn514 commented Dec 30, 2019

philipc commented Dec 30, 2019

retep998 commented Dec 30, 2019 •

edited

Loading

sjep commented Apr 9, 2020

gilescope commented Apr 9, 2020

philipc commented Apr 9, 2020

sjep commented Apr 10, 2020

sfackler commented Apr 10, 2020

sjep commented Apr 10, 2020

matklad commented Nov 4, 2020

hamirmahal commented Nov 11, 2023

Great stack overflow error messages #51405

Great stack overflow error messages #51405

Comments

gilescope commented Jun 7, 2018 • edited by nikomatsakis Loading

Techcable commented Jun 7, 2018 • edited Loading

gilescope commented Jun 7, 2018

jonas-schievink commented Jun 7, 2018

gilescope commented Jun 7, 2018

sfackler commented Jun 7, 2018

jonas-schievink commented Jun 7, 2018 • edited Loading

sfackler commented Jun 7, 2018

jonas-schievink commented Jun 7, 2018

gilescope commented Jun 8, 2018

sfackler commented Jun 9, 2018

retep998 commented Jun 9, 2018

Techcable commented Jun 10, 2018 • edited Loading

Techcable commented Jun 10, 2018

sfackler commented Jun 10, 2018

da-x commented Jul 19, 2019

gilescope commented Jul 30, 2019

jyn514 commented Dec 30, 2019

philipc commented Dec 30, 2019

retep998 commented Dec 30, 2019 • edited Loading

sjep commented Apr 9, 2020

gilescope commented Apr 9, 2020

philipc commented Apr 9, 2020

sjep commented Apr 10, 2020

sfackler commented Apr 10, 2020

sjep commented Apr 10, 2020

matklad commented Nov 4, 2020

hamirmahal commented Nov 11, 2023

gilescope commented Jun 7, 2018 •

edited by nikomatsakis

Loading

Techcable commented Jun 7, 2018 •

edited

Loading

jonas-schievink commented Jun 7, 2018 •

edited

Loading

Techcable commented Jun 10, 2018 •

edited

Loading

retep998 commented Dec 30, 2019 •

edited

Loading