Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'solaris' runtime stack guard conflicts with OS stack-clash mitigation #52577

Closed
pfmooney opened this issue Jul 20, 2018 · 18 comments
Closed

'solaris' runtime stack guard conflicts with OS stack-clash mitigation #52577

pfmooney opened this issue Jul 20, 2018 · 18 comments
Assignees
Labels
O-solaris Operating system: Solaris regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@pfmooney
Copy link
Contributor

pfmooney commented Jul 20, 2018

We have found the stack guard logic present in rust 1.27 conflicts with the OS-provided mitigation present in illumos distributions. This became clear on SmartOS when the 2018Q2 pkgsrc update moved from rust 1.24 to 1.27. Those binaries, compiled on an old platform without the mitigation, bail immediately on machines that do possess it

# rustc
thread 'main' panicked at 'failed to allocate a guard page', libstd/sys/unix/thread.rs:337:17
note: Run with `RUST_BACKTRACE=1` for a backtrace.
# RUST_BACKTRACE=full rustc
thread 'main' panicked at 'failed to allocate a guard page', libstd/sys/unix/thread.rs:337:17
stack backtrace:
   0: 0xfffffd7fde402f3e - std::sys::unix::backtrace::tracing::imp::unwind_backtrace::ha22b267de1618f10
   1: 0xfffffd7fde3cdee6 - std::sys_common::backtrace::print::h9be6459ca953ea86
   2: 0xfffffd7fde410a1b - std::panicking::default_hook::{{closure}}::h11d519068135b02a
   3: 0xfffffd7fde4106e1 - std::panicking::default_hook::he7d6fa94ad994305
   4: 0xfffffd7fde41110c - std::panicking::rust_panic_with_hook::hef4aa8f9edd49f7f
   5: 0xfffffd7fde410ef6 - std::panicking::begin_panic::h350ed1a97f2676c6
   6: 0xfffffd7fde3e7b24 - std::sys::unix::thread::guard::init::h66e847be7b85328f
   7: 0xfffffd7fde3e44fd - std::rt::update_stack_guard::h3749f1240dd6decc
   8: 0xfffffd7fde786185 - rustc_driver::run::hf99f9e1bb314cfb8
   9: 0xfffffd7fde7954da - rustc_driver::main::he45a01066d380276
  10:           0x401842 - std::rt::lang_start::{{closure}}::h50ce2fc38c241613
  11: 0xfffffd7fde410be2 - std::panicking::try::do_call::h260c23fa696aee03
  12: 0xfffffd7fde417919 - __rust_maybe_catch_panic
  13: 0xfffffd7fde3e4325 - std::rt::lang_start_internal::hef1606e7d03cbe2a
  14:           0x4018a3 - main
  15:           0x40167b - _start

The details of how the two guards are conflicting can be found written up in OS-7059. The stack-clash mitigation was added to SmartOS and OmniOS back in October 2017. It was merged into illumos-gate this past week, marking its availability for any remaining downstream distributions.

Our temporary mitigation has been to float a patch in pkgsrc to treat the 'solaris' platform like 'linux', disabling the stack-guard mapping:

--- src/libstd/sys/unix/thread.rs.orig 2018-06-24 22:42:29.203295357 +0000
+++ src/libstd/sys/unix/thread.rs
@@ -309,7 +309,7 @@ pub mod guard {

         let stackaddr = get_stack_start_aligned()?;

-        if cfg!(target_os = "linux") {
+        if cfg!(any(target_os = "linux", target_os = "solaris")) {
             // Linux doesn't allocate the whole stack right away, and
             // the kernel has its own stack-guard mechanism to fault
             // when growing too close to an existing mapping.  If we map
@@ -345,7 +354,7 @@ pub mod guard {
     }

     pub unsafe fn deinit() {
-        if !cfg!(target_os = "linux") {
+        if cfg!(not(any(target_os = "linux", target_os = "solaris"))) {
             if let Some(stackaddr) = get_stack_start_aligned() {
                 // Remove the protection on the guard page.
// FIXME: we cannot unmap the page, because when we mmap()

Considering the rest of the illumos ecosystem is (or soon will be) effected by this issue, it would be nice to get that stack-guard exclusion moved into upstream rust.

@nagisa
Copy link
Member

nagisa commented Jul 21, 2018

Please submit a PR with the patch.

@nagisa nagisa added the O-solaris Operating system: Solaris label Jul 21, 2018
@nagisa
Copy link
Member

nagisa commented Jul 21, 2018

The most important question here probably concerns the backwards compatibility. "Just" disabling the code path for solaris will also remove the stack guard for all the older/other solarises that do not have their built-in stack mitigation implemented. So in the best case we end up with code that checks the mmap call result and does not panic if it fails (potentially ignoring the genuine errors).

Seems to me that "fixing" this in illumos would be better if we care about back-compat.

@pfmooney
Copy link
Contributor Author

Please submit a PR with the patch.

I planned to wait until consensus was reached about the right path forward before submitting a PR with anything.

"Just" disabling the code path for solaris will also remove the stack guard for all the older/other solarises that do not have their built-in stack mitigation implemented.

The same can be said about older deployments of Linux which lack the enhanced stack-clash protections. Is it inappropriate to rely on its simpler OS-provided guard page on systems without stack-clash protection?

The mapping of ld.so on older illumos platforms which lack the stack-clash protections is deterministic. It still results in a page gap between the end of the stack and the next valid mapping. Programs overrunning the stack limit will still take a fault.

Seems to me that "fixing" this in illumos would be better if we care about back-compat.

Considering that there are over 9 months of illumos platforms in the wild on which rust's latest bits will abort, "fixing" illumos does not seem so attractive. How would it differ from Linux's change to its own stack handling?

@nagisa
Copy link
Member

nagisa commented Jul 25, 2018

Is it true that this bug has only begun with 1.27 (I’ve only noticed this point now)? Given that the only changes present in 1.27 in that file are from #48575 perhaps reverting that PR would fix the issue altogether?

The only other approach I’m comfortable with is ignoring the errors from syscalls on solaris, which will hide the actual bugs if they ever happen, but at least both new and old solarises will be properly supported.

@nagisa nagisa added regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jul 25, 2018
@nagisa
Copy link
Member

nagisa commented Jul 25, 2018

cc @ishitatsuyuki

@ishitatsuyuki
Copy link
Contributor

@nagisa I think stack guard is always enabled so it has nothing to do with my PR?

@nagisa
Copy link
Member

nagisa commented Jul 25, 2018

This became clear on SmartOS when the 2018Q2 pkgsrc update moved from rust 1.24 to 1.27.

seems to imply that 1.24 was working but 1.27 is not, everything else being the same?

@pfmooney is that right? Is the "2018Q2 update" something resembling an update between, say, Ubuntu releases? If so, can you please check versions 1.26 to 1.24 with other parts of your system being the same?

@cuviper
Copy link
Member

cuviper commented Jul 25, 2018

@pfmooney

Is it inappropriate to rely on its simpler OS-provided guard page on systems without stack-clash protection?

My argument for #43072 was that this is fine. Rust was only adding a simple single-page guard, so if the OS already has this, we're not protecting it any better. Then with the additional protections for Stack Clash, Rust's guard page made things actually worse.

Do the older illumos systems still have some basic protection for overflow? If so, then I think you can use the same logic I did. Trust that the older OS guard is at least as good as our own, and that the newer OS will do even better with Stack Clash. So perhaps illumos can follow what linux is doing since #43072.

@ishitatsuyuki

I think stack guard is always enabled so it has nothing to do with my PR?

If you look at the backtrace, this error is happening from the guard::init within rust_driver::main -- i.e. this is the re-initialization added in your PR. The initial guard that rt::lang_start would have added was apparently OK, since it got past that.

@pfmooney
Copy link
Contributor Author

seems to imply that 1.24 was working but 1.27 is not, everything else being the same?

Correct. The rustc 1.24.1 shipped in 2018Q1 works fine on a system bearing the stack-clash protections. In pre-release testing of 1.27.0 for 2018Q2, it was found to abort due to the added guard. (In order to make it useful, 2018Q2 bears the included patch)

@pfmooney is that right? Is the "2018Q2 update" something resembling an update between, say, Ubuntu releases?

SmartOS ships images based on branches/snapshots of pkgsrc. The OS platform image, consisting of the kernel and system libraries, is independent of these images and can vary from machine to machine. My test environment, for example, happens to be running a joyent_20171215T051107Z platform to host instances based on both 2018Q1 and 2018Q2 images.

If that is right, can you please check versions 1.26 to 1.24 with other parts of your system being the same?

I only have the 1.24.0 binaries presently at hand, but I can look into building a few of the intermediary versions.

Do the older illumos systems still have some basic protection for overflow? If so, then I think you can use the same logic I did.

On older platforms, the deterministic layout of the address space results in an unoccupied page between the upper limit of the stack and the last mapping of ld.so.

So perhaps illumos can follow what linux is doing since #43072.

Yep, that sounds ideal to me.

@pfmooney
Copy link
Contributor Author

Binaries from 1.26.2 appear to work fine on a system with the stack-clash protection.

@pnkfelix
Copy link
Member

discussed at T-compiler meeting. Assigning to self to take point on this. Not 100% sure whether we will treat this as P-medium (which is my current inclination) or P-high, so I'm leaving the P- label unset for now until I can investigate further

@pnkfelix pnkfelix self-assigned this Nov 29, 2018
@pfmooney
Copy link
Contributor Author

Thanks for looking at this @pnkfelix.

The pkgsrc builds still maintain the floated patch from above (Seen here) although it would be nice to get it upstream on the assumption that the illumos platform will have rustup builds one day.

@nagisa
Copy link
Member

nagisa commented Nov 29, 2018

By the way @pfmooney are there any instances of illumos getting their own target triplet?

@pfmooney
Copy link
Contributor Author

I've heard nothing besides what you wrote up in #55553.

@jasonbking
Copy link
Contributor

Just to follow up on this, we're in the process of creating and testing a new illumos OS target for rust. The current proposed plan is that the illumos target will disable rust stack probes in its spec file. Newer versions of the rust toolchain in pkgsrc will be updated to use this new target as the default (instead of x86_64-sun-solaris).

This should I believe fix the problem, while allowing anyone using rust on actual Solaris platforms to benefit from the rust stack guards while allowing illumos platforms to use the built-in stack-clash mitigation. Once ready, we would also like to contribute back the code for the new target (essentially the spec files + a small amount of updates other crates to make them aware of the illumos target as necessary, plus a few bits fo code to fill in a few gaps).

@pnkfelix
Copy link
Member

There has been significant changes to the code is hypothesized to have injected this bug.

We've gone back to running rustc in a spawned thread by default (in PR #56813; though that did not restore all aspects of the code from before PR #48575 landed). And then well after that, PR #56732 got rid of rustc_driver::driver entirely.

In parallel, I have to imagine that much has happened on the Illumos side.

So the question now is: Is there still a bug to resolve here? cc @pfmooney and @jasonbking

@pfmooney
Copy link
Contributor Author

@pnkfelix ,
I took a closer look at this today. With the updates to rustc, floated patch to negate the stack guard behavior on illumos does not appear to be necessary. I reran a build of the latest version in pkgsrc (1.35) and tested it on a SmartOS machine with the OS stack-clash mitigation present. The binaries (both rustc, and a subsequent test program) ran without issue. Thanks for pointing out those changes in this area.

@jasonbking You might want to follow up with @jperkin about confirming the behavior and pulling that patch out of the pkgsrc build.

@pfmooney
Copy link
Contributor Author

Additional testing seems to suggest this is not currently an issue. Absent evidence to the contrary, I'm closing this out. Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-solaris Operating system: Solaris regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants