WaitForSingleObjectEx returns WAIT_ABANDONED when collecting backtraces #399

Jasper-Bekkers · 2020-12-22T11:13:15Z

OS Version: windows None (None)
Report Version: 104


Application Specific Information:
assertion failed: `(left == right)`
  left: `128`,
 right: `0`

Thread 0 Crashed:
0   std                             0x7ff7e6d46b7c      std::panicking::begin_panic_fmt (panicking.rs:437)
1   backtrace                       0x7ff7e695cbdc      backtrace::dbghelp::init (dbghelp.rs:315)
2   backtrace                       0x7ff7e695b8f4      backtrace::symbolize::dbghelp::resolve (dbghelp.rs:76)
3   backtrace                       0x7ff7e695995a      [inlined] backtrace::symbolize::resolve_frame_unsynchronized (mod.rs:178)
4   backtrace                       0x7ff7e695995a      backtrace::symbolize::resolve_frame<T> (mod.rs:105)
5   backtrace                       0x7ff7e6958ee8      backtrace::capture::Backtrace::resolve (capture.rs:232)
6   backtrace                       0x7ff7e6958db5      backtrace::capture::Backtrace::new (capture.rs:141)
7   gpu_allocator                   0x7ff7e6875c5f      gpu_allocator::VulkanAllocator::allocate (lib.rs:688)
8   breda                           0x7ff7e67c7a01      breda::renderer::vulkan::allocator::gpu_allocator_wrapper::{{impl}}::alloc (gpu_allocator_wrapper.rs:100)

...

backtrace-rs/src/dbghelp.rs

Line 315 in 47069af

debug_assert_eq!(r, 0);

First seen here: Traverse-Research/gpu-allocator#13

We've encountered it twice but haven't been able to trap it in the debugger, I'll update this issue if we have more information around it.

The text was updated successfully, but these errors were encountered:

MarijnS95 · 2020-12-22T11:17:50Z

This must be the Drop that is not synchronized. ReleaseMutex(self.lock) releases the lock without ever clearing the atomic in static LOCK. Subsequent calls to init() (after a prior allocated Init is dropped) will pick up the released lock and return WAIT_ABANDONED when WaitForSingleObjectEx is called on it.

EDIT: Hold up, that observation is wrong. ReleaseMutex releases the lock without abandoning it (that would be CloseHandle), so it's fine to keep it alive in LOCK.

I cannot find the rules surrounding releasing/closing handles in shared namespaces. I would assume CreateMutexA simply increments the refcount of the shared mutex and retrieves a "view" on it, and CloseHandle removes the view and decrements the refcount without cleaning up the shared resource (except if the refcount is 0).
Perhaps that is not the case, and running CloseHandle on the lock actually destroys it in the shared namespace as well, meaning that the mutex retrieved from LOCK is now referencing an abandoned mutex in the shared namespace.

Jasper-Bekkers · 2020-12-22T12:01:39Z

I think the call to CloseHandle is incorrect since we're creating a session shared mutex and CreateMutexA returns the shared mutex if it already exists:

From https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-createmutexa

If the mutex is a named mutex and the object existed before this function call, the return value is a handle to the existing object, and the GetLastError function returns ERROR_ALREADY_EXISTS.

Edit: Actually it looks like that behavior would return WAIT_FAILED instead.

MarijnS95 · 2020-12-24T14:49:44Z

I get a unique handle value for every call to CreateMutexA with the same shared mutex name. Closing that handle does not seem to affect the shared mutex in any sort of way.

The only way to repro WAIT_ABANDONED from WaitForSingleObjectEx is by locking the mutex from a thread that terminates without unlocking it from ReleaseMutex first (note that CloseHandle will not unlock it).

#[test]
fn wait_abandoned() {
    unsafe {
        let l1 = CreateMutexA(
            ptr::null_mut(),
            FALSE,
            "Local\\RustBacktraceMutex\0".as_ptr() as _,
        );
        dbg!(l1);

        let t = std::thread::spawn(|| {
            let l2 = CreateMutexA(
                ptr::null_mut(),
                FALSE,
                "Local\\RustBacktraceMutex\0".as_ptr() as _,
            );
            dbg!(l2);

            dbg!(WaitForSingleObjectEx(l2, INFINITE, FALSE));
            // CloseHandle does not unlock the mutex
            // dbg!(CloseHandle(l2));
        });

        std::thread::sleep_ms(200);

        dbg!(WaitForSingleObjectEx(l1, INFINITE, FALSE));
        dbg!(ReleaseMutex(l1));
        dbg!(CloseHandle(l1));
        t.join().unwrap();
    }
}

That'd mean the Drop handler is not called for a panicked and terminated thread, while another thread was waiting in init() for the lock. There seems to be no sign of another thread panicking during backtrace handling (while holding the lock) though.

https://devblogs.microsoft.com/oldnewthing/20050912-14/?p=34253 According to common-sense and this article, treating WAIT_ABANDONED as a successful lock is potentially dangerous when protecting datastructures.

alexcrichton · 2021-01-04T15:15:12Z

Thanks for the report! It looks like it shoudl be fine to handle WAIT_ABANDONED here since it gives us the lock anyway, it's just telling us that something else was zapped. Can you confirm though that your application has a thread that's disappearing? I'm curious how the thread is exiting without releasing the lock.

MarijnS95 · 2021-01-04T15:23:54Z

@alexcrichton I'm hesitant to accept WAIT_ABANDONED as "success" for aforementioned reasons, who knows what state dbghelp is in and how much that might break.

We definitely have to perform some tracing as the assertion above comes from a codepath that's not called when our application is in total despair: it's rather unlikely for threads to disappear there. It is however possible - quite likely, even - that the application panicked on another thread, but shouldn't that exit gracefully by dropping Init?

alexcrichton · 2021-01-04T15:34:04Z

It's true yeah that dbghelp may not be in a great spot, but it depends on where the thread exited. It may have exited in Rust code without being inside dbghelp, for example. It's true, though, that panics should run destructors and unlock the mutex here.

I think it might be good to try to minimize this to figure out what's causing it. There may be interactions with other languages or something like that, or maybe some system library is zapping a thread? Hard to say unfortunately :(

repi · 2021-08-14T19:53:28Z

Ran into exactly this as well today in our app, happened on our main thread and don't think there was any other threads that was being destroyed.

@MarijnS95 did y'all track this any further?

Jasper-Bekkers · 2021-08-17T08:38:28Z

@repi Nope, we just started doing much less frequent callstack tracing instead 🙈

MarijnS95 · 2021-08-17T17:59:42Z

@repi Unfortunately we have not been able to repro this anymore bar the threading example above. Maybe there are short-living and panicking threads spawned after all? This seems to be the only way to run into WAIT_ABANDONED.

I'd suggest adding logging/tracing code around here to keep track of threads locking and (potentially forgetting to) unlock(ing) this mutex, in hopes of finding the cause.

Jasper-Bekkers · 2021-08-17T18:03:35Z

@MarijnS95 I had just disabled our intensive callstack recording to work around it since it was mostly a debug-only feature anyway. It's still there.

MarijnS95 · 2021-08-17T18:05:41Z

I count that as not repro'ing it anymore!

For us it was probably recording a callstack right as a panic was being handled, which did not end up releasing before closing the mutex (and/or terminating the entire thread).

p--b · 2023-03-31T17:02:15Z

We think we've just encountered this too. I'm curious about the use of a Session-wide mutex with a fixed name - the aim described in the comment is to synchronise all accesses to dbghelp.dll across the process, but my understanding of the effect of using a named kernel object in the session namespace is that this will actually synchronise access across the entire session - i.e. every process in the session will share the same kernel mutex objcet. This doesn't seem to me to be the original intent of the code, and could perhaps be related to these symptoms? If another thread in another (entirely unrelated) process has this lock taken and then aborts / is terminated, I think that could be leading to what we're seeing here...

I think one way around this would be to include the PID in the name of the created mutex, as the PID should be stable across the entire execution of the process and would disambiguate the created mutex between different processes. I was going to try this to test my theory, but I believe testing it would require rebuilding the standard library...

ChrisDenton · 2023-03-31T17:45:35Z

Huh, that is odd. There's indeed no reason I can see to synchronize between processes. The docs only say that threads need to be synchronized.

I think it would be more appropriate to use a critical section rather than mess about in the session namespace.

EDIT: Oh I guess the problem is if the standard library's backtrace and another instance of backtrace are used in different DLLs but loaded within the same process, then it becomes kind of awkward to share a global symbol without some way to coordinate.

MarijnS95 mentioned this issue Dec 28, 2020

Traverse Research open issues Traverse-Research/opensource-ecosystem#2

Open

20 tasks

ChrisDenton mentioned this issue Mar 31, 2023

dbghlp: Make mutex name unique to the process #518

Merged

workingjubilee closed this as completed in #518 Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WaitForSingleObjectEx returns WAIT_ABANDONED when collecting backtraces #399

WaitForSingleObjectEx returns WAIT_ABANDONED when collecting backtraces #399

Jasper-Bekkers commented Dec 22, 2020 •

edited

MarijnS95 commented Dec 22, 2020 •

edited

Jasper-Bekkers commented Dec 22, 2020 •

edited

MarijnS95 commented Dec 24, 2020 •

edited

alexcrichton commented Jan 4, 2021

MarijnS95 commented Jan 4, 2021

alexcrichton commented Jan 4, 2021

repi commented Aug 14, 2021

Jasper-Bekkers commented Aug 17, 2021

MarijnS95 commented Aug 17, 2021

Jasper-Bekkers commented Aug 17, 2021

MarijnS95 commented Aug 17, 2021

p--b commented Mar 31, 2023 •

edited

ChrisDenton commented Mar 31, 2023 •

edited

WaitForSingleObjectEx returns WAIT_ABANDONED when collecting backtraces #399

WaitForSingleObjectEx returns WAIT_ABANDONED when collecting backtraces #399

Comments

Jasper-Bekkers commented Dec 22, 2020 • edited

MarijnS95 commented Dec 22, 2020 • edited

Jasper-Bekkers commented Dec 22, 2020 • edited

MarijnS95 commented Dec 24, 2020 • edited

alexcrichton commented Jan 4, 2021

MarijnS95 commented Jan 4, 2021

alexcrichton commented Jan 4, 2021

repi commented Aug 14, 2021

Jasper-Bekkers commented Aug 17, 2021

MarijnS95 commented Aug 17, 2021

Jasper-Bekkers commented Aug 17, 2021

MarijnS95 commented Aug 17, 2021

p--b commented Mar 31, 2023 • edited

ChrisDenton commented Mar 31, 2023 • edited

Jasper-Bekkers commented Dec 22, 2020 •

edited

MarijnS95 commented Dec 22, 2020 •

edited

Jasper-Bekkers commented Dec 22, 2020 •

edited

MarijnS95 commented Dec 24, 2020 •

edited

p--b commented Mar 31, 2023 •

edited

ChrisDenton commented Mar 31, 2023 •

edited