Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.30 -> 1.31 dylib late-binding regression with GNU binutils 2.28 or older. #61539

Open
eddyb opened this issue Jun 5, 2019 · 19 comments
Open

1.30 -> 1.31 dylib late-binding regression with GNU binutils 2.28 or older. #61539

eddyb opened this issue Jun 5, 2019 · 19 comments

Comments

@eddyb
Copy link
Member

@eddyb eddyb commented Jun 5, 2019

Originally reported as #60593, and appears to be caused by #54592, this is my reduction:

  • common.rs
pub static STATIC: u8 = 0;
pub fn get() -> *const u8 { &STATIC }
  • plugin.rs
extern crate common;
#[no_mangle]
pub fn run() {
    println!("plugin: {:p}", common::get as *const ());
    assert_eq!(common::get(), &common::STATIC as *const _);
}
  • driver.rs
extern crate common;
fn main() {
    println!("driver: {:p}", common::get as *const ());
    // You can comment this line to make it work.
    assert_eq!(common::get(), &common::STATIC as *const _);
    unsafe {
        extern "C" {
            fn dlopen(filename: *const u8, flag: i32) -> *mut u8;
            fn dlsym(handle: *mut u8, symbol: *const u8) -> *const u8;
        }
        std::mem::transmute::<*const u8, fn()>(dlsym(
            dlopen(b"./libplugin.so\0".as_ptr(), 1),
            b"run\0".as_ptr(),
        ))()
    }
}

How to run it:

RELEASE=nightly-2018-10-11
rustup toolchain add $RELEASE

echo 'System toolchain versions:'
ldd --version | head -n1 | sed 's/^ldd //'
ld.bfd --version | head -n1
ld.gold --version | head -n1
echo

rustc +$RELEASE common.rs --crate-type=lib
# You can change this from dylib to cdylib to make it work.
rustc +$RELEASE plugin.rs --crate-type=dylib -L.
# You can uncomment -Z plt=yes to make it work.
rustc +$RELEASE driver.rs -L. # -Z plt=yes
./driver

I was able to try this out with NixOS 18.03 and 18.09 releases, via:

NIX_PATH=nixpkgs=https://github.com/NixOS/nixpkgs-channels/archive/nixos-18.03.tar.gz \
    nix-shell -p stdenv --run ./test.sh
NIX_PATH=nixpkgs=https://github.com/NixOS/nixpkgs-channels/archive/nixos-18.09.tar.gz \
    nix-shell -p stdenv --run ./test.sh

Example of failure (using a NixOS 18.03 toolchain):

info: syncing channel updates for 'nightly-2018-10-12-x86_64-unknown-linux-gnu'

  nightly-2018-10-12-x86_64-unknown-linux-gnu unchanged - rustc 1.31.0-nightly (77af31408 2018-10-11)

System toolchain versions:
(GNU libc) 2.26
GNU ld (GNU Binutils) 2.28.1
GNU gold (GNU Binutils 2.28.1) 1.14

driver: 0x55d244a55820
plugin: 0x55d244a55820
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `0x55d244a9e7ac`,
 right: `0x7f87a69b3f19`', plugin.rs:5:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Example of success (by switching to NixOS 18.09):

info: syncing channel updates for 'nightly-2018-10-12-x86_64-unknown-linux-gnu'

  nightly-2018-10-12-x86_64-unknown-linux-gnu unchanged - rustc 1.31.0-nightly (77af31408 2018-10-11)

System toolchain versions:
(GNU libc) 2.27
GNU ld (GNU Binutils) 2.30
GNU gold (GNU Binutils 2.30) 1.15

driver: 0x5649e7dbf4f0
plugin: 0x7fca6df40930

Other ways to get it to succeed:

  • use a Rust version before #54592 (e.g. RELEASE=nightly-2018-10-11)
  • pass -Z plt=yes when compiling driver
    • restores pre-#54592 behavior
  • change the crate type of plugin to cdylib (instead of dylib)
    • presumably common::get is private in that case?
    • not a general fix AFAIK, but I'm not sure how to repro with cdylib
  • don't call common::get in driver
    • common::get won't get looked up early, which means driver's copy won't override plugin's (within the context of affected toolchains)
  • set RTLD_DEEPBIND when calling dlopen
    • common::get from driver is ignored when looking up the use from plugin

cc @alexcrichton @rust-lang/compiler

@eddyb eddyb changed the title 1.30 -> 1.31 regression with less recent Linux distro toolchains. 1.30 -> 1.31 dylib late-binding regression with less recent Linux distro toolchains. Jun 5, 2019
@eddyb

This comment has been minimized.

Copy link
Member Author

@eddyb eddyb commented Jun 5, 2019

Looks like relro support was added in GNU binutils 2.28, so maybe it had some bugs?
There isn't any 2.29 binutils build on the official NixOS build servers so I can't easily test that version.

@nagisa

This comment has been minimized.

Copy link
Contributor

@nagisa nagisa commented Jun 5, 2019

How confident are we that PLT entirely negates this issue other than just accidentally preventing it from triggering in this particular case?

To me it seems that exactly what happens with PLT here is prevention and the core issue still remains regardless of whether PLT is enabled or not.

Not opposed to flipping back the default for now, though.

@eddyb

This comment has been minimized.

Copy link
Member Author

@eddyb eddyb commented Jun 6, 2019

I agree, I have no idea what the actual bug is yet, just what affects this specific reduction.
Note that in all cases that work, common::get is not deduplicated, as opposed to both common::get and common::STATIC being deduplicated.

What I suspect is happening with -Z plt=yes is bypassing a buggy linker feature.
But I'd have to bisect binutils to prove that. Which I might still do, I just hope it's easier than glibc.

@pnkfelix pnkfelix added the O-linux label Jun 6, 2019
@nagisa

This comment has been minimized.

Copy link
Contributor

@nagisa nagisa commented Jun 11, 2019

@eddyb an easy way to (dis-)prove linker fault is to attempt linking with lld.

@pnkfelix

This comment has been minimized.

Copy link
Member

@pnkfelix pnkfelix commented Jun 13, 2019

triage: we need to discuss this at today's meeting before I'll be comfortable assigning a priority to it.

@pnkfelix

This comment has been minimized.

Copy link
Member

@pnkfelix pnkfelix commented Jun 13, 2019

discussed in triage meeting. P-high. Removing nomination tag. Assigning to @nagisa and myself.

@nagisa

This comment has been minimized.

Copy link
Contributor

@nagisa nagisa commented Jun 30, 2019

Linking with lld works. Using clang instead of gcc as a ld driver also fails. This does indeed point all the fingers to ld. I’ll try to close down on the ld version now.

@nagisa

This comment has been minimized.

Copy link
Contributor

@nagisa nagisa commented Jun 30, 2019

This issue is no longer present with binutils 2.29.0.

@nagisa

This comment has been minimized.

Copy link
Contributor

@nagisa nagisa commented Jun 30, 2019

@eddyb eddyb changed the title 1.30 -> 1.31 dylib late-binding regression with less recent Linux distro toolchains. 1.30 -> 1.31 dylib late-binding regression with GNU binutils 2.28. Jul 1, 2019
@eddyb

This comment has been minimized.

Copy link
Member Author

@eddyb eddyb commented Jul 1, 2019

@nagisa @alexcrichton Is there precedent for checking the version of the linker to work around bugs?
I also wonder what happens with GNU binutils 2.27 and earlier, since I don't think it supports relro, do we check for that somehow, and fallback, or are we asking for relro in a way that gets ignored?

@nagisa

This comment has been minimized.

Copy link
Contributor

@nagisa nagisa commented Jul 1, 2019

At least 2.27 also fails so not relro related. Waiting for 2.26 to build right now.

@nagisa

This comment has been minimized.

Copy link
Contributor

@nagisa nagisa commented Jul 1, 2019

2.26 and 2.23.1 also fail (although 2.23.1 only prints the addresses and exits with exit code 0 before calling into the plugin), so it is probably safe to say that this issue is present in all relevant versions of the GNU toolchain before 2.29.

@eddyb

This comment has been minimized.

Copy link
Member Author

@eddyb eddyb commented Jul 1, 2019

@nagisa Oh, right, I forgot, the bug was fixed after 2.28 but I never tested earlier versions.

@eddyb eddyb changed the title 1.30 -> 1.31 dylib late-binding regression with GNU binutils 2.28. 1.30 -> 1.31 dylib late-binding regression with GNU binutils 2.28 or older. Jul 1, 2019
@alexcrichton

This comment has been minimized.

Copy link
Member

@alexcrichton alexcrichton commented Jul 1, 2019

We have only one rudimentary check for support of a c compiler flag (-pie I think) but other than that we don't have any precedence for c compiler or linker version checking in rustc itself

@cuviper

This comment has been minimized.

Copy link
Member

@cuviper cuviper commented Jul 1, 2019

I also wonder what happens with GNU binutils 2.27 and earlier, since I don't think it supports relro,

I think 2.27 only added a way to configure the default:

* Add a configure option --enable-relro to decide whether -z relro should
  be enabled in ELF linker by default.  Default to yes for all Linux
  targets except FRV, HPPA, IA64 and MIPS.  Note that -z relro can increase
  disk and memory size.

But -z relro has existed much longer -- I see references way back in ChangeLog-2004.

@pnkfelix

This comment has been minimized.

Copy link
Member

@pnkfelix pnkfelix commented Jul 11, 2019

pre-triage stray thought: While I worry about trying to change compiler behavior based on a dynamically detected linker version, I would have fewer objections to use such detection just to drive the emission of a diagnostic warning. Where said diagnostic tells people to take one of the aforementioned steps to work around the bug.

nominating for discussion at T-compiler meeting.

@nagisa

This comment has been minimized.

Copy link
Contributor

@nagisa nagisa commented Jul 18, 2019

This has been discussed briefly during the meeting last week, but no real conclusion was arrived at.

@nagisa nagisa removed the I-nominated label Jul 18, 2019
@pnkfelix

This comment has been minimized.

Copy link
Member

@pnkfelix pnkfelix commented Sep 19, 2019

Some points made during aforementioned meeting that I thought were worth transcribing here:

  • nagisa: "the warning message based on just linker version would lead to false positives more often than not, but I don’t know of any other solution that’s not reenabling plt by default."
  • pnkfelix: " is there particular pattern of codegen we could associate the diagnostic with, to try to lower the false positive rate?"
    • nagisa: "Maybe? Ultimately to trigger that particular bug a dynamic library needs to be loaded at runtime (via e.g. dlopen)"
  • nagisa: "Most relevant case where this almost always happen are compiler plugins"
@pnkfelix

This comment has been minimized.

Copy link
Member

@pnkfelix pnkfelix commented Nov 14, 2019

I haven't taken any time to look into this since I last commented here (about two months ago). Its tempting to downgrade this to P-medium with justification "this is a bug in ld that is fixed in a later version." But I don't want to do that without at least surveying the current popular GNU-based distributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.