Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaitForSingleObjectEx returns WAIT_ABANDONED when collecting backtraces #399

Closed
Tracked by #2
Jasper-Bekkers opened this issue Dec 22, 2020 · 13 comments · Fixed by #518
Closed
Tracked by #2

WaitForSingleObjectEx returns WAIT_ABANDONED when collecting backtraces #399

Jasper-Bekkers opened this issue Dec 22, 2020 · 13 comments · Fixed by #518

Comments

@Jasper-Bekkers
Copy link

Jasper-Bekkers commented Dec 22, 2020

OS Version: windows None (None)
Report Version: 104


Application Specific Information:
assertion failed: `(left == right)`
  left: `128`,
 right: `0`

Thread 0 Crashed:
0   std                             0x7ff7e6d46b7c      std::panicking::begin_panic_fmt (panicking.rs:437)
1   backtrace                       0x7ff7e695cbdc      backtrace::dbghelp::init (dbghelp.rs:315)
2   backtrace                       0x7ff7e695b8f4      backtrace::symbolize::dbghelp::resolve (dbghelp.rs:76)
3   backtrace                       0x7ff7e695995a      [inlined] backtrace::symbolize::resolve_frame_unsynchronized (mod.rs:178)
4   backtrace                       0x7ff7e695995a      backtrace::symbolize::resolve_frame<T> (mod.rs:105)
5   backtrace                       0x7ff7e6958ee8      backtrace::capture::Backtrace::resolve (capture.rs:232)
6   backtrace                       0x7ff7e6958db5      backtrace::capture::Backtrace::new (capture.rs:141)
7   gpu_allocator                   0x7ff7e6875c5f      gpu_allocator::VulkanAllocator::allocate (lib.rs:688)
8   breda                           0x7ff7e67c7a01      breda::renderer::vulkan::allocator::gpu_allocator_wrapper::{{impl}}::alloc (gpu_allocator_wrapper.rs:100)

...

debug_assert_eq!(r, 0);

First seen here: Traverse-Research/gpu-allocator#13

We've encountered it twice but haven't been able to trap it in the debugger, I'll update this issue if we have more information around it.

@MarijnS95
Copy link

MarijnS95 commented Dec 22, 2020

This must be the Drop that is not synchronized. ReleaseMutex(self.lock) releases the lock without ever clearing the atomic in static LOCK. Subsequent calls to init() (after a prior allocated Init is dropped) will pick up the released lock and return WAIT_ABANDONED when WaitForSingleObjectEx is called on it.

EDIT: Hold up, that observation is wrong. ReleaseMutex releases the lock without abandoning it (that would be CloseHandle), so it's fine to keep it alive in LOCK.

I cannot find the rules surrounding releasing/closing handles in shared namespaces. I would assume CreateMutexA simply increments the refcount of the shared mutex and retrieves a "view" on it, and CloseHandle removes the view and decrements the refcount without cleaning up the shared resource (except if the refcount is 0).
Perhaps that is not the case, and running CloseHandle on the lock actually destroys it in the shared namespace as well, meaning that the mutex retrieved from LOCK is now referencing an abandoned mutex in the shared namespace.

@Jasper-Bekkers
Copy link
Author

Jasper-Bekkers commented Dec 22, 2020

I think the call to CloseHandle is incorrect since we're creating a session shared mutex and CreateMutexA returns the shared mutex if it already exists:

From https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-createmutexa

If the mutex is a named mutex and the object existed before this function call, the return value is a handle to the existing object, and the GetLastError function returns ERROR_ALREADY_EXISTS.

Edit: Actually it looks like that behavior would return WAIT_FAILED instead.

@MarijnS95
Copy link

MarijnS95 commented Dec 24, 2020

I get a unique handle value for every call to CreateMutexA with the same shared mutex name. Closing that handle does not seem to affect the shared mutex in any sort of way.

The only way to repro WAIT_ABANDONED from WaitForSingleObjectEx is by locking the mutex from a thread that terminates without unlocking it from ReleaseMutex first (note that CloseHandle will not unlock it).

#[test]
fn wait_abandoned() {
    unsafe {
        let l1 = CreateMutexA(
            ptr::null_mut(),
            FALSE,
            "Local\\RustBacktraceMutex\0".as_ptr() as _,
        );
        dbg!(l1);

        let t = std::thread::spawn(|| {
            let l2 = CreateMutexA(
                ptr::null_mut(),
                FALSE,
                "Local\\RustBacktraceMutex\0".as_ptr() as _,
            );
            dbg!(l2);

            dbg!(WaitForSingleObjectEx(l2, INFINITE, FALSE));
            // CloseHandle does not unlock the mutex
            // dbg!(CloseHandle(l2));
        });

        std::thread::sleep_ms(200);

        dbg!(WaitForSingleObjectEx(l1, INFINITE, FALSE));
        dbg!(ReleaseMutex(l1));
        dbg!(CloseHandle(l1));
        t.join().unwrap();
    }
}

That'd mean the Drop handler is not called for a panicked and terminated thread, while another thread was waiting in init() for the lock. There seems to be no sign of another thread panicking during backtrace handling (while holding the lock) though.

https://devblogs.microsoft.com/oldnewthing/20050912-14/?p=34253 According to common-sense and this article, treating WAIT_ABANDONED as a successful lock is potentially dangerous when protecting datastructures.

@alexcrichton
Copy link
Member

Thanks for the report! It looks like it shoudl be fine to handle WAIT_ABANDONED here since it gives us the lock anyway, it's just telling us that something else was zapped. Can you confirm though that your application has a thread that's disappearing? I'm curious how the thread is exiting without releasing the lock.

@MarijnS95
Copy link

@alexcrichton I'm hesitant to accept WAIT_ABANDONED as "success" for aforementioned reasons, who knows what state dbghelp is in and how much that might break.

We definitely have to perform some tracing as the assertion above comes from a codepath that's not called when our application is in total despair: it's rather unlikely for threads to disappear there. It is however possible - quite likely, even - that the application panicked on another thread, but shouldn't that exit gracefully by dropping Init?

@alexcrichton
Copy link
Member

It's true yeah that dbghelp may not be in a great spot, but it depends on where the thread exited. It may have exited in Rust code without being inside dbghelp, for example. It's true, though, that panics should run destructors and unlock the mutex here.

I think it might be good to try to minimize this to figure out what's causing it. There may be interactions with other languages or something like that, or maybe some system library is zapping a thread? Hard to say unfortunately :(

@repi
Copy link

repi commented Aug 14, 2021

Ran into exactly this as well today in our app, happened on our main thread and don't think there was any other threads that was being destroyed.

@MarijnS95 did y'all track this any further?

@Jasper-Bekkers
Copy link
Author

@repi Nope, we just started doing much less frequent callstack tracing instead 🙈

@MarijnS95
Copy link

@repi Unfortunately we have not been able to repro this anymore bar the threading example above. Maybe there are short-living and panicking threads spawned after all? This seems to be the only way to run into WAIT_ABANDONED.

I'd suggest adding logging/tracing code around here to keep track of threads locking and (potentially forgetting to) unlock(ing) this mutex, in hopes of finding the cause.

@Jasper-Bekkers
Copy link
Author

@MarijnS95 I had just disabled our intensive callstack recording to work around it since it was mostly a debug-only feature anyway. It's still there.

@MarijnS95
Copy link

I count that as not repro'ing it anymore!

For us it was probably recording a callstack right as a panic was being handled, which did not end up releasing before closing the mutex (and/or terminating the entire thread).

@p--b
Copy link

p--b commented Mar 31, 2023

We think we've just encountered this too. I'm curious about the use of a Session-wide mutex with a fixed name - the aim described in the comment is to synchronise all accesses to dbghelp.dll across the process, but my understanding of the effect of using a named kernel object in the session namespace is that this will actually synchronise access across the entire session - i.e. every process in the session will share the same kernel mutex objcet. This doesn't seem to me to be the original intent of the code, and could perhaps be related to these symptoms? If another thread in another (entirely unrelated) process has this lock taken and then aborts / is terminated, I think that could be leading to what we're seeing here...

I think one way around this would be to include the PID in the name of the created mutex, as the PID should be stable across the entire execution of the process and would disambiguate the created mutex between different processes. I was going to try this to test my theory, but I believe testing it would require rebuilding the standard library...

@ChrisDenton
Copy link
Contributor

ChrisDenton commented Mar 31, 2023

Huh, that is odd. There's indeed no reason I can see to synchronize between processes. The docs only say that threads need to be synchronized.

I think it would be more appropriate to use a critical section rather than mess about in the session namespace.

EDIT: Oh I guess the problem is if the standard library's backtrace and another instance of backtrace are used in different DLLs but loaded within the same process, then it becomes kind of awkward to share a global symbol without some way to coordinate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants