Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bevy segfaults on some Windows systems after Rust 1.79 update #126442

Closed
mockersf opened this issue Jun 13, 2024 · 24 comments
Closed

Bevy segfaults on some Windows systems after Rust 1.79 update #126442

mockersf opened this issue Jun 13, 2024 · 24 comments
Labels
C-gub Category: the reverse of a compiler bug is generally UB

Comments

@mockersf
Copy link
Contributor

The issue was first noticed in CI: https://github.com/bevyengine/bevy/actions/runs/9505180010/job/26199467296

The same job works on Rust 1.78: https://github.com/mockersf/bevy/actions/runs/9506123706/job/26202601792

A few users reported the same error, so it's not just on virtualised hardware. It's also not on all Windows machines

It seems to be triggered when rendering. Running an example from the Bevy repository that renders something reliably causes a segfault on some systems when using DirectX12. Running directly examples from wgpu don't seem to cause the crash.

A user report that it fails on nightly-2024-04-18, and works on nightly-2024-04-17

Sorry that's not much to go on, but I don't have an affected system available.

@mockersf mockersf added the C-bug Category: This is a bug. label Jun 13, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jun 13, 2024
@lqd
Copy link
Member

lqd commented Jun 13, 2024

A few users reported the same error, so it's not just on virtualised hardware. It's also not on all Windows machines

https://github.com/rust-lang/cargo-bisect-rustc would help locating the nightly (and validate it’s nightly-2024-04-18) and in particular the PR where this issue appeared.

@janhohenheim
Copy link
Contributor

janhohenheim commented Jun 13, 2024

I'm a bit sleepy, but I'll try cargo-bisect-rustc now and see how far I get
Update 1: Running the following now

$env:WGPU_BACKEND='dx12'; cargo-bisect-rustc --start=2024-04-17 --end=2024-04-18 --prompt --with-src -- run --example alien_cake_addict

Update 2: Can confirm the reported range (fails on nightly-2024-04-18, and works on nightly-2024-04-17) is correct
Update 3: Let's hope it's not a rollup PR, since that won't get us the exact failing commit on a Windows-only bug:

And even further, if the regression is in a rollup PR, then it will bisect the individual PRs within the rollup. This final bisection is only available for x86_64-unknown-linux-gnu since it is using the builds made for the rustc performance tracker.

@janhohenheim
Copy link
Contributor

janhohenheim commented Jun 13, 2024

Per another user running the tools, the offending commit is 38104f3
I'm still running the bisect to confirm this.

@lqd
Copy link
Member

lqd commented Jun 13, 2024

cc PR author @Mark-Simulacrum on that bisection

@janhohenheim
Copy link
Contributor

janhohenheim commented Jun 13, 2024

My bisection is done and I can confirm the result. This is the output:

searched nightlies: from nightly-2024-04-17 to nightly-2024-04-18
regressed nightly: nightly-2024-04-18
searched commit range: 1cec373...becebb3
regressed commit: 38104f3

bisected with cargo-bisect-rustc v0.6.8

Host triple: x86_64-pc-windows-msvc
Reproduce with:

cargo bisect-rustc --end=2024-04-18 --prompt --with-src -- run --example alien_cake_addict

note by me: also requires $env:WGPU_BACKEND='dx12'; to actually get the crash

@alice-i-cecile
Copy link

The offending PR: #123936

@workingjubilee
Copy link
Contributor

If that is the offending commit, then it seems very unlikely that this has not unveiled a bug inside wgpu or whatever else is busy talking to DX12.

@workingjubilee
Copy link
Contributor

workingjubilee commented Jun 14, 2024

Can you get a backtrace in WinDbg or whatever's fashionable?

Or can a "minimalized"1 repro case be produced?

Footnotes

  1. where "minimalized" is "slightly less than the entire .exe there", I am aware it is not going to be small

@Brezak
Copy link

Brezak commented Jun 14, 2024

When looking at WinDbg the issue seems to arise here. wgpu-hal was converting a normal rust str into a c_char ptr and passing that across ffi to a function expecting a null terminated string. Unluckily for us this string was a empty string "". Before #123936 that meant we'd get a pointer to somewhere we can read from. After the PR we get a 0x1 pointer that will cause a segfault when read from.

@workingjubilee
Copy link
Contributor

Agh, that is the classic error that I predicted in my remarks here.

@workingjubilee
Copy link
Contributor

workingjubilee commented Jun 14, 2024

This passes, so changing to c"" should be fine.

#[test]
fn cstrs_dont_dangle() {
    let c_str = c"";
    assert_ne!(c_str.as_ptr(), std::ptr::NonNull::dangling().as_ptr());
}

However... @Brezak Are you sure that's the issue? It seems that this string was created from a CString::new()?

https://github.com/gfx-rs/wgpu/blob/d7a35ecda06742acb4cdd5a83c5b9737f375a70b/wgpu-hal/src/dx12/device.rs#L281-L288

Does that CString actually survive the lifetime of that function call?

@Brezak
Copy link

Brezak commented Jun 14, 2024

The debugger said that source_name had the address 0x1.

@workingjubilee
Copy link
Contributor

Oh, I was squinting at the wrong line it seems.

@Brezak
Copy link

Brezak commented Jun 14, 2024

I'm not at my computer right now but in the code calling said borked function there was a unwrap_or_default creating a that would get passed into the function.

@workingjubilee
Copy link
Contributor

yep.

@Brezak
Copy link

Brezak commented Jun 14, 2024

@workingjubilee
Copy link
Contributor

It seems that the reason this isn't hit on the other path is that D3DCompiler is called pretty much directly by wgpu, whereas the hassle_rs bindings of the DxcCompiler convert strings into null terminated wide strings first.

@RalfJung
Copy link
Member

RalfJung commented Jun 14, 2024

So IIUC this line is unsound; if the callee expects a null-terminated string then you can't just pass a str::as_ptr(). compile_fxc should probably take a &ffi::CStr to represent the fact that the string must be null-terminated, or it needs to allocate a CString itself to guarantee null-termination.

When looking at WinDbg the issue seems to arise here. wgpu-hal was converting a normal rust str into a c_char ptr and passing that across ffi to a function expecting a null terminated string. Unluckily for us this string was a empty string "". Before #123936 that meant we'd get a pointer to somewhere we can read from. After the PR we get a 0x1 pointer that will cause a segfault when read from.

So seems like previously you were lucky that the pointer was pointing to a zero byte and so the c string was considered empty -- but that was still UB since you were outside the bounds of that zero-sized allocation.
Now this OOB is detected by hardware and leads to a segfault.

CString can never be backed by a zero-sized allocation, so looks like stage.module.raw_name is created incorrectly if it has pointer value 0x1. Where is that value set? Is it using a safe constructor?
EDIT: I missed the unwrap_or_default, which creates an &"" here. That then later gets passed as a pointer to a function that wants a null-terminated string, which indeed will go Very Wrong.

FWIW, using CString::to_str() and then accessing the trailing 0 is UB under the Stacked Borrows aliasing rules, since the 0 is outside the bounds of the str. To get a pointer to the entire string including the 0, you need to use CStr::as_ptr(). It seems unlikely that that would be the issue though, AFAIK we do not make use of this UB anywhere in the compiler.

@workingjubilee
Copy link
Contributor

@RalfJung Note that hassle_rs::DxcCompiler::compile is somewhat to "blame" here: its API design requests a &str and then "helpfully" converts this to the requisite string, allocating to do so. Thus the compile_fxc API implemented by wgpu matches the signature that is used by compile_dxc, which partially matches the signature of DxcCompiler::compile.

@RalfJung
Copy link
Member

hassle_rs is providing a nice Rusty API and doing so correctly. I don't think they have to take any blame here.

wgpu tried and failed to provide the same nice Rusty API, and used an &str in a way that is not backed by its invariant.

(For anyone reading along, also see the discussion here.)

@workingjubilee
Copy link
Contributor

It is a Rusty API, it's just slightly perplexing to me because the caller is holding on to a CString anyways, it seems.

@RalfJung
Copy link
Member

RalfJung commented Jun 14, 2024 via email

@workingjubilee
Copy link
Contributor

workingjubilee commented Jun 14, 2024

Hopefully!

Actually what's more confusing, now that I look closer, is that the underlying compiler API for DXC takes a wide string (thus it does necessitate allocation, it seems...)... and this is the newer one? It seems like that should be the other way around, knowing Microsoft API version history, and the new one should accept a UTF-8 string, and the older one should ask for a wide string...

...oh wait, both APIs are deprecated, actually, and Microsoft is on to IDxcCompiler3::Compile now. Well!

@mockersf
Copy link
Contributor Author

Thank you for the help and the investigation! It now has been fixed in wgpu, waiting for a backport and a release

@saethlin saethlin added C-discussion Category: Discussion or questions that doesn't represent real issues. and removed C-bug Category: This is a bug. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Jun 15, 2024
@workingjubilee workingjubilee added C-gub Category: the reverse of a compiler bug is generally UB and removed C-discussion Category: Discussion or questions that doesn't represent real issues. labels Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-gub Category: the reverse of a compiler bug is generally UB
Projects
None yet
Development

No branches or pull requests

9 participants