Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining thread in thread local storage destructor never returns on Windows #74875

Closed
emgre opened this issue Jul 28, 2020 · 14 comments
Closed

Joining thread in thread local storage destructor never returns on Windows #74875

emgre opened this issue Jul 28, 2020 · 14 comments
Labels
A-thread-locals Area: Thread local storage (TLS) O-windows Operating system: Windows T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@emgre
Copy link

emgre commented Jul 28, 2020

The following code never returns on Windows. The console never prints Done, and sometimes the Start thread and End thread are also not even printed. On Linux, it runs as expected.

use std::thread;
use std::time::Duration;

struct Test {
    thread: Option<thread::JoinHandle<()>>,
}

impl Test {
    fn new() -> Self {
        let thread = std::thread::spawn(move || {
            println!("Start thread");
            thread::sleep(Duration::from_millis(500));
            println!("End thread");
        });

        Self {
            thread: Some(thread),
        }
    }
}

impl Drop for Test {
    fn drop(&mut self) {
        if let Some(thread) = self.thread.take() {
            thread.join().unwrap();
        }
    }
}

fn main() {
    thread_local!(
        static R: Test = Test::new();
    );

    thread::spawn(|| {
        R.with(|_| {});
    })
    .join()
    .unwrap();

    println!("Done");
}

(Playground)

Output:

Start thread
End thread
Done

Errors:

   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 1.20s
     Running `target/debug/playground`

My guess is that this wizardry is in cause:

use crate::mem;
use crate::ptr;
use crate::sync::atomic::AtomicPtr;
use crate::sync::atomic::Ordering::SeqCst;
use crate::sys::c;
pub type Key = c::DWORD;
pub type Dtor = unsafe extern "C" fn(*mut u8);
// Turns out, like pretty much everything, Windows is pretty close the
// functionality that Unix provides, but slightly different! In the case of
// TLS, Windows does not provide an API to provide a destructor for a TLS
// variable. This ends up being pretty crucial to this implementation, so we
// need a way around this.
//
// The solution here ended up being a little obscure, but fear not, the
// internet has informed me [1][2] that this solution is not unique (no way
// I could have thought of it as well!). The key idea is to insert some hook
// somewhere to run arbitrary code on thread termination. With this in place
// we'll be able to run anything we like, including all TLS destructors!
//
// To accomplish this feat, we perform a number of threads, all contained
// within this module:
//
// * All TLS destructors are tracked by *us*, not the windows runtime. This
// means that we have a global list of destructors for each TLS key that
// we know about.
// * When a thread exits, we run over the entire list and run dtors for all
// non-null keys. This attempts to match Unix semantics in this regard.
//
// This ends up having the overhead of using a global list, having some
// locks here and there, and in general just adding some more code bloat. We
// attempt to optimize runtime by forgetting keys that don't have
// destructors, but this only gets us so far.
//
// For more details and nitty-gritty, see the code sections below!
//
// [1]: http://www.codeproject.com/Articles/8113/Thread-Local-Storage-The-C-Way
// [2]: https://github.com/ChromiumWebApps/chromium/blob/master/base
// /threading/thread_local_storage_win.cc#L42
// -------------------------------------------------------------------------
// Native bindings
//
// This section is just raw bindings to the native functions that Windows
// provides, There's a few extra calls to deal with destructors.
#[inline]
pub unsafe fn create(dtor: Option<Dtor>) -> Key {
let key = c::TlsAlloc();
assert!(key != c::TLS_OUT_OF_INDEXES);
if let Some(f) = dtor {
register_dtor(key, f);
}
key
}
#[inline]
pub unsafe fn set(key: Key, value: *mut u8) {
let r = c::TlsSetValue(key, value as c::LPVOID);
debug_assert!(r != 0);
}
#[inline]
pub unsafe fn get(key: Key) -> *mut u8 {
c::TlsGetValue(key) as *mut u8
}
#[inline]
pub unsafe fn destroy(_key: Key) {
rtabort!("can't destroy tls keys on windows")
}
#[inline]
pub fn requires_synchronized_create() -> bool {
true
}
// -------------------------------------------------------------------------
// Dtor registration
//
// Windows has no native support for running destructors so we manage our own
// list of destructors to keep track of how to destroy keys. We then install a
// callback later to get invoked whenever a thread exits, running all
// appropriate destructors.
//
// Currently unregistration from this list is not supported. A destructor can be
// registered but cannot be unregistered. There's various simplifying reasons
// for doing this, the big ones being:
//
// 1. Currently we don't even support deallocating TLS keys, so normal operation
// doesn't need to deallocate a destructor.
// 2. There is no point in time where we know we can unregister a destructor
// because it could always be getting run by some remote thread.
//
// Typically processes have a statically known set of TLS keys which is pretty
// small, and we'd want to keep this memory alive for the whole process anyway
// really.
//
// Perhaps one day we can fold the `Box` here into a static allocation,
// expanding the `StaticKey` structure to contain not only a slot for the TLS
// key but also a slot for the destructor queue on windows. An optimization for
// another day!
static DTORS: AtomicPtr<Node> = AtomicPtr::new(ptr::null_mut());
struct Node {
dtor: Dtor,
key: Key,
next: *mut Node,
}
#[cfg(miri)]
extern "Rust" {
/// Miri-provided extern function to mark the block `ptr` points to as a "root"
/// for some static memory. This memory and everything reachable by it is not
/// considered leaking even if it still exists when the program terminates.
///
/// `ptr` has to point to the beginning of an allocated block.
fn miri_static_root(ptr: *const u8);
}
unsafe fn register_dtor(key: Key, dtor: Dtor) {
let mut node = Box::new(Node { key, dtor, next: ptr::null_mut() });
let mut head = DTORS.load(SeqCst);
loop {
node.next = head;
match DTORS.compare_exchange(head, &mut *node, SeqCst, SeqCst) {
Ok(_) => {
#[cfg(miri)]
miri_static_root(&*node as *const _ as *const u8);
mem::forget(node);
return;
}
Err(cur) => head = cur,
}
}
}
// -------------------------------------------------------------------------
// Where the Magic (TM) Happens
//
// If you're looking at this code, and wondering "what is this doing?",
// you're not alone! I'll try to break this down step by step:
//
// # What's up with CRT$XLB?
//
// For anything about TLS destructors to work on Windows, we have to be able
// to run *something* when a thread exits. To do so, we place a very special
// static in a very special location. If this is encoded in just the right
// way, the kernel's loader is apparently nice enough to run some function
// of ours whenever a thread exits! How nice of the kernel!
//
// Lots of detailed information can be found in source [1] above, but the
// gist of it is that this is leveraging a feature of Microsoft's PE format
// (executable format) which is not actually used by any compilers today.
// This apparently translates to any callbacks in the ".CRT$XLB" section
// being run on certain events.
//
// So after all that, we use the compiler's #[link_section] feature to place
// a callback pointer into the magic section so it ends up being called.
//
// # What's up with this callback?
//
// The callback specified receives a number of parameters from... someone!
// (the kernel? the runtime? I'm not quite sure!) There are a few events that
// this gets invoked for, but we're currently only interested on when a
// thread or a process "detaches" (exits). The process part happens for the
// last thread and the thread part happens for any normal thread.
//
// # Ok, what's up with running all these destructors?
//
// This will likely need to be improved over time, but this function
// attempts a "poor man's" destructor callback system. Once we've got a list
// of what to run, we iterate over all keys, check their values, and then run
// destructors if the values turn out to be non null (setting them to null just
// beforehand). We do this a few times in a loop to basically match Unix
// semantics. If we don't reach a fixed point after a short while then we just
// inevitably leak something most likely.
//
// # The article mentions weird stuff about "/INCLUDE"?
//
// It sure does! Specifically we're talking about this quote:
//
// The Microsoft run-time library facilitates this process by defining a
// memory image of the TLS Directory and giving it the special name
// “__tls_used” (Intel x86 platforms) or “_tls_used” (other platforms). The
// linker looks for this memory image and uses the data there to create the
// TLS Directory. Other compilers that support TLS and work with the
// Microsoft linker must use this same technique.
//
// Basically what this means is that if we want support for our TLS
// destructors/our hook being called then we need to make sure the linker does
// not omit this symbol. Otherwise it will omit it and our callback won't be
// wired up.
//
// We don't actually use the `/INCLUDE` linker flag here like the article
// mentions because the Rust compiler doesn't propagate linker flags, but
// instead we use a shim function which performs a volatile 1-byte load from
// the address of the symbol to ensure it sticks around.
#[link_section = ".CRT$XLB"]
#[allow(dead_code, unused_variables)]
#[used] // we don't want LLVM eliminating this symbol for any reason, and
// when the symbol makes it to the linker the linker will take over
pub static p_thread_callback: unsafe extern "system" fn(c::LPVOID, c::DWORD, c::LPVOID) =
on_tls_callback;
#[allow(dead_code, unused_variables)]
unsafe extern "system" fn on_tls_callback(h: c::LPVOID, dwReason: c::DWORD, pv: c::LPVOID) {
if dwReason == c::DLL_THREAD_DETACH || dwReason == c::DLL_PROCESS_DETACH {
run_dtors();
}
// See comments above for what this is doing. Note that we don't need this
// trickery on GNU windows, just on MSVC.
reference_tls_used();
#[cfg(target_env = "msvc")]
unsafe fn reference_tls_used() {
extern "C" {
static _tls_used: u8;
}
crate::intrinsics::volatile_load(&_tls_used);
}
#[cfg(not(target_env = "msvc"))]
unsafe fn reference_tls_used() {}
}
#[allow(dead_code)] // actually called above
unsafe fn run_dtors() {
let mut any_run = true;
for _ in 0..5 {
if !any_run {
break;
}
any_run = false;
let mut cur = DTORS.load(SeqCst);
while !cur.is_null() {
let ptr = c::TlsGetValue((*cur).key);
if !ptr.is_null() {
c::TlsSetValue((*cur).key, ptr::null_mut());
((*cur).dtor)(ptr as *mut _);
any_run = true;
}
cur = (*cur).next;
}
}
}

@RalfJung
Copy link
Member

RalfJung commented Jul 29, 2020

This is specifically about TLS destructors of the main thread, right? That one is kind of dubious anyway: #28129.

Does this also happen when it's not the main thread's dtors?

@RalfJung
Copy link
Member

Cc @mati865

@RalfJung RalfJung added the O-windows Operating system: Windows label Jul 29, 2020
@emgre
Copy link
Author

emgre commented Jul 29, 2020

In my example, the TLS destructor is called from within the spawned thread AFAIK, not the main thread. Even if I move the thread_local! inside the spawned thread, it has the same faulty behaviour.

@emgre
Copy link
Author

emgre commented Jul 29, 2020

I stepped a bit through libstd from my example. The thread spawned from Test reaches the end and properly returns. The join in main then blocks on this WaitForSingleObject: https://github.com/rust-lang/rust/blob/1.45.0/src/libstd/sys/windows/thread.rs#L73

It looks like Windows isn't aware that the thread actually ended... 🤔

@RalfJung
Copy link
Member

Hm good point, the main thread never uses R.

@mati865
Copy link
Contributor

mati865 commented Jul 29, 2020

I can only confirm this hangs with both MinGW and MSVC in the line linked by emgre.

@RalfJung you should call Windows WG.

@RalfJung
Copy link
Member

@rustbot ping windows

@rustbot
Copy link
Collaborator

rustbot commented Jul 29, 2020

Hey Windows Group! This bug has been identified as a good "Windows candidate".
In case it's useful, here are some instructions for tackling these sorts of
bugs. Maybe take a look?
Thanks! <3

cc @arlosi @danielframpton @gdr-at-ms @kennykerr @luqmana @lzybkr @retep998 @rylev @sivadeilra

@retep998
Copy link
Member

join() does not return until the thread has actually and completely finished running. If it is still running TLS destructors then the thread is still running. Doing a join() on the current thread will always deadlock, at least on Windows anyway.

@RalfJung
Copy link
Member

@retep998 I don't think there's a join on the current thread anywhere here?

@retep998
Copy link
Member

Hmmm, I'll have to investigate further into what is going on...

@retep998
Copy link
Member

There's definitely race conditions involved as inserting sleeps at certain points changes the behavior.

@retep998
Copy link
Member

retep998 commented Aug 7, 2020

This excerpt from MSDN might be relevant:

The ExitProcess, ExitThread, CreateThread, CreateRemoteThread functions, and a process that is starting (as the result of a CreateProcess call) are serialized between each other within a process. Only one of these events can happen in an address space at a time. This means the following restrictions hold:

  • During process startup and DLL initialization routines, new threads can be created, but they do not begin execution until DLL initialization is done for the process.
  • Only one thread in a process can be in a DLL initialization or detach routine at a time.
  • ExitProcess does not return until no threads are in their DLL initialization or detach routines.

@jyn514 jyn514 added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Dec 8, 2020
@ChrisDenton
Copy link
Contributor

The destructors are being run during the thread local callback function. This in turn will run during thread or module exit. During this time the OS holds a loader lock which, to cut a long story short, makes using any kind of thread synchronization prone to deadlock. For more details see Dynamic-Link Library Best Practices (it's talking about DllMain but it's mostly applicable to the tls callback too).

I don't think this is fixable by Rust other than to say "don't do that".

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Dec 3, 2021
…olnay

Document Windows TLS drop behaviour

The way Windows TLS destructors are run has some "interesting" properties. They should be documented.

Fixes rust-lang#74875
@bors bors closed this as completed in 25474ed Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-thread-locals Area: Thread local storage (TLS) O-windows Operating system: Windows T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants