Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std: Avoid locks during TLS destruction on Windows #41512

Merged
merged 1 commit into from
May 6, 2017

Conversation

alexcrichton
Copy link
Member

Gecko recently had a bug reported 1 with a deadlock in the Rust TLS
implementation for Windows. TLS destructors are implemented in a sort of ad-hoc
fashion on Windows as it doesn't natively support destructors for TLS keys. To
work around this the runtime manages a list of TLS destructors and registers a
hook to get run whenever a thread exits. When a thread exits it takes a look at
the list and runs all destructors.

Unfortunately it turns out that there's a lock which is held when our "at thread
exit" callback is run. The callback then attempts to acquire a lock protecting
the list of TLS destructors. Elsewhere in the codebase while we hold a lock over
the TLS destructors we try to acquire the same lock held first before our
special callback is run. And as a result, deadlock!

This commit sidesteps the issue with a few small refactorings:

  • Removed support for destroying a TLS key on Windows. We don't actually ever
    exercise this as a public-facing API, and it's only used during lazy_init
    during racy situations. To handle that we just synchronize lazy_init
    globally on Windows so we never have to call destroy.

  • With no need to support removal the global synchronized Vec was tranformed
    to a lock-free linked list. With the removal of locks this means that
    iteration no long requires a lock and as such we won't run into the deadlock
    problem mentioned above.

Note that it's still a general problem that you have to be extra super careful
in TLS destructors. For example no code which runs a TLS destructor on Windows
can call back into the Windows API to do a dynamic library lookup. Unfortunately
I don't know of a great way around that, but this at least fixes the immediate
problem that Gecko was seeing which is that with "well behaved" destructors the
system would still deadlock!

@rust-highfive
Copy link
Collaborator

r? @BurntSushi

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton
Copy link
Member Author

cc @brson

@alexcrichton
Copy link
Member Author

cc @vadimcn

@mzji
Copy link

mzji commented Apr 24, 2017

@retep998 Is there a common solution for this problem on Windows?

@vadimcn
Copy link
Contributor

vadimcn commented Apr 25, 2017

Well, the standard Windows solution is "Just don't call any Windows APIs (with a few exceptions) while the loader lock is taken". Which means that if one would like to run arbitrary code on library load/unload, they must provide some kind of Initialize/Shutdown calls in said library.

@vadimcn
Copy link
Contributor

vadimcn commented Apr 25, 2017

In this case, though, this isn't about running destructors on shutdown, but rather an unfortunate interaction of Widnows XP fallback code in the implementation of Mutex, which may invoke GetProcAddress if this is the first usage of Mutex in the process. So another way to fix this particular problem would be to use a spin-lock in TLS code.

@alexcrichton
Copy link
Member Author

@vadimcn note that this PR fixes the TLS issue by just removing all locks. I tweaked the interface a bit so the locks weren't needed, and the public facing API hasn't changed at all.

My main question would be that this is a pretty unfortunate drawback that affects all TLS destructors in Rust, so standard destructors that, say, grab a lock or otherwise go into the Windows runtime may be affected? That seems... bad :(

@vadimcn
Copy link
Contributor

vadimcn commented Apr 26, 2017

Yeah, I don't know of a 100% safe way of running arbitrary code on DLL unload ☹️ .

@aidanhs aidanhs added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 26, 2017
@alexcrichton
Copy link
Member Author

cc @brson (you may be able to review as well)

c::TlsSetValue(key, ptr::null_mut());
dtor(ptr as *mut _);
c::TlsSetValue((*cur).key, ptr::null_mut());
((*cur).dtor)(ptr as *mut _);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents this section of code from racing? That is, can more than one thread be here trying to run the dtor of the same key? (This is likely for my own edification, since I can see the old code did this as well. One of my assumptions is that run_dtors is running every time a thread exits, right?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct yeah this can run concurrently, but the dtor is just a raw function pointer so it's up to the dtor to have internal synchronization if neessary.

The list of global dtors is only prepended to so taking any location and iterating over it should always be valid in terms of avoiding use-after-free.

@BurntSushi
Copy link
Member

BurntSushi commented May 4, 2017

This LGTM and the approach seems sensible. I had one question about the code and one more:

Note that it's still a general problem that you have to be extra super careful
in TLS destructors. For example no code which runs a TLS destructor on Windows
can call back into the Windows API to do a dynamic library lookup.

Could you unpack this a bit? In particular, I don't know enough to understand how the second sentence follows from the first.

@alexcrichton
Copy link
Member Author

Sure yeah, so apparently what's happening is that when a TLS destructor is run the Windows runtime holds at least one lock, the dynamic library lookup lock. I'm not sure if other locks are held. This means that if a TLS dtor attempts to grab lock A, whereas elsewhere in the code you grab lock A and then do something that grabs the dynamic library lookup lock you'll deadlock (as seen here).

I'm basically just not sure there's a way to run TLS dtors on Windows in a situation where "almost arbitrary code is allowed" in the same sense that arbitrary code is allowed in the main function.

@BurntSushi
Copy link
Member

Seems sensible, thanks for explaining! @bors r+

@bors
Copy link
Contributor

bors commented May 4, 2017

📌 Commit fffe254 has been approved by BurntSushi

@bors
Copy link
Contributor

bors commented May 4, 2017

⌛ Testing commit fffe254 with merge 91ddbd3...

@bors
Copy link
Contributor

bors commented May 4, 2017

💔 Test failed - status-appveyor

@Mark-Simulacrum
Copy link
Member

Tidy failed? This is odd, since I don't see any error message...

I'm going to retry since it feels spurious but not opening an issue yet since, well, I'm not sure:

@bors retry

command did not execute successfully: "C:\\projects\\rust\\build\\x86_64-pc-windows-msvc\\stage0-tools\\x86_64-pc-windows-msvc\\release\\tidy.exe" "C:\\projects\\rust\\src" "--no-vendor"
expected success, got: exit code: 3221225477

frewsxcv added a commit to frewsxcv/rust that referenced this pull request May 5, 2017
…ck, r=BurntSushi

std: Avoid locks during TLS destruction on Windows

Gecko recently had a bug reported [1] with a deadlock in the Rust TLS
implementation for Windows. TLS destructors are implemented in a sort of ad-hoc
fashion on Windows as it doesn't natively support destructors for TLS keys. To
work around this the runtime manages a list of TLS destructors and registers a
hook to get run whenever a thread exits. When a thread exits it takes a look at
the list and runs all destructors.

Unfortunately it turns out that there's a lock which is held when our "at thread
exit" callback is run. The callback then attempts to acquire a lock protecting
the list of TLS destructors. Elsewhere in the codebase while we hold a lock over
the TLS destructors we try to acquire the same lock held first before our
special callback is run. And as a result, deadlock!

This commit sidesteps the issue with a few small refactorings:

* Removed support for destroying a TLS key on Windows. We don't actually ever
  exercise this as a public-facing API, and it's only used during `lazy_init`
  during racy situations. To handle that we just synchronize `lazy_init`
  globally on Windows so we never have to call `destroy`.

* With no need to support removal the global synchronized `Vec` was tranformed
  to a lock-free linked list. With the removal of locks this means that
  iteration no long requires a lock and as such we won't run into the deadlock
  problem mentioned above.

Note that it's still a general problem that you have to be extra super careful
in TLS destructors. For example no code which runs a TLS destructor on Windows
can call back into the Windows API to do a dynamic library lookup. Unfortunately
I don't know of a great way around that, but this at least fixes the immediate
problem that Gecko was seeing which is that with "well behaved" destructors the
system would still deadlock!

[1]: https://bugzilla.mozilla.org/show_bug.cgi?id=1358151
@bors
Copy link
Contributor

bors commented May 5, 2017

⌛ Testing commit fffe254 with merge 090ec35...

@bors
Copy link
Contributor

bors commented May 5, 2017

💔 Test failed - status-appveyor

@Mark-Simulacrum
Copy link
Member

Hmm, no, not a spurious failure, most likely. Maybe it's something to do with thread local storage on Windows? No idea if tidy uses thread locals, though.

@alexcrichton
Copy link
Member Author

Yeah definitely looks like a segfault related to this, I'll take a look

@alexcrichton
Copy link
Member Author

@bors: r=BurntSushi

@bors
Copy link
Contributor

bors commented May 5, 2017

📌 Commit bdf6194 has been approved by BurntSushi

@bors
Copy link
Contributor

bors commented May 5, 2017

⌛ Testing commit bdf6194 with merge 51175c9...

@bors
Copy link
Contributor

bors commented May 5, 2017

💔 Test failed - status-appveyor

@frewsxcv
Copy link
Member

frewsxcv commented May 5, 2017

Failure seems legit:

fatal runtime error: can't destroy tls keys on windows

Gecko recently had a bug reported [1] with a deadlock in the Rust TLS
implementation for Windows. TLS destructors are implemented in a sort of ad-hoc
fashion on Windows as it doesn't natively support destructors for TLS keys. To
work around this the runtime manages a list of TLS destructors and registers a
hook to get run whenever a thread exits. When a thread exits it takes a look at
the list and runs all destructors.

Unfortunately it turns out that there's a lock which is held when our "at thread
exit" callback is run. The callback then attempts to acquire a lock protecting
the list of TLS destructors. Elsewhere in the codebase while we hold a lock over
the TLS destructors we try to acquire the same lock held first before our
special callback is run. And as a result, deadlock!

This commit sidesteps the issue with a few small refactorings:

* Removed support for destroying a TLS key on Windows. We don't actually ever
  exercise this as a public-facing API, and it's only used during `lazy_init`
  during racy situations. To handle that we just synchronize `lazy_init`
  globally on Windows so we never have to call `destroy`.

* With no need to support removal the global synchronized `Vec` was tranformed
  to a lock-free linked list. With the removal of locks this means that
  iteration no long requires a lock and as such we won't run into the deadlock
  problem mentioned above.

Note that it's still a general problem that you have to be extra super careful
in TLS destructors. For example no code which runs a TLS destructor on Windows
can call back into the Windows API to do a dynamic library lookup. Unfortunately
I don't know of a great way around that, but this at least fixes the immediate
problem that Gecko was seeing which is that with "well behaved" destructors the
system would still deadlock!

[1]: https://bugzilla.mozilla.org/show_bug.cgi?id=1358151
@alexcrichton
Copy link
Member Author

@bors: r=BurntSushi

@bors
Copy link
Contributor

bors commented May 5, 2017

📌 Commit 495c998 has been approved by BurntSushi

frewsxcv added a commit to frewsxcv/rust that referenced this pull request May 5, 2017
…ck, r=BurntSushi

std: Avoid locks during TLS destruction on Windows

Gecko recently had a bug reported [1] with a deadlock in the Rust TLS
implementation for Windows. TLS destructors are implemented in a sort of ad-hoc
fashion on Windows as it doesn't natively support destructors for TLS keys. To
work around this the runtime manages a list of TLS destructors and registers a
hook to get run whenever a thread exits. When a thread exits it takes a look at
the list and runs all destructors.

Unfortunately it turns out that there's a lock which is held when our "at thread
exit" callback is run. The callback then attempts to acquire a lock protecting
the list of TLS destructors. Elsewhere in the codebase while we hold a lock over
the TLS destructors we try to acquire the same lock held first before our
special callback is run. And as a result, deadlock!

This commit sidesteps the issue with a few small refactorings:

* Removed support for destroying a TLS key on Windows. We don't actually ever
  exercise this as a public-facing API, and it's only used during `lazy_init`
  during racy situations. To handle that we just synchronize `lazy_init`
  globally on Windows so we never have to call `destroy`.

* With no need to support removal the global synchronized `Vec` was tranformed
  to a lock-free linked list. With the removal of locks this means that
  iteration no long requires a lock and as such we won't run into the deadlock
  problem mentioned above.

Note that it's still a general problem that you have to be extra super careful
in TLS destructors. For example no code which runs a TLS destructor on Windows
can call back into the Windows API to do a dynamic library lookup. Unfortunately
I don't know of a great way around that, but this at least fixes the immediate
problem that Gecko was seeing which is that with "well behaved" destructors the
system would still deadlock!

[1]: https://bugzilla.mozilla.org/show_bug.cgi?id=1358151
bors added a commit that referenced this pull request May 5, 2017
Rollup of 9 pull requests

- Successful merges: #41064, #41307, #41512, #41582, #41678, #41722, #41734, #41761, #41763
- Failed merges:
@bors bors merged commit 495c998 into rust-lang:master May 6, 2017
@alexcrichton alexcrichton deleted the fix-windows-tls-deadlock branch June 14, 2017 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants