Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report I/O errors from rmeta encoding with emit_fatal #119510

Merged
merged 1 commit into from Jan 3, 2024

Conversation

saethlin
Copy link
Member

@saethlin saethlin commented Jan 2, 2024

#119456 reminded me that I never did systematic testing to provoke the out-of-disk ICEs so I grepped through a recent crater run (#119440 (comment)) for more out-of-disk ICEs on current master and yep there's 2 in there.

So I finally cooked up a way to provoke for these crashes. I wrote a little cdylib crate that has a #[no_mangle] pub extern "C" fn write which occasionally reports ENOSPC, and prints a backtrace when it does.

code for the dylib

// cargo add libc rand backtrace
use rand::Rng;

#[no_mangle]
pub extern "C" fn write(
    fd: libc::c_int,
    buf: *const libc::c_void,
    count: libc::size_t,
) -> libc::ssize_t {
    if fd > 2 && rand::thread_rng().gen::<u8>() == 0 {
        let mut count = 0;
        backtrace::trace(|frame| {
            backtrace::resolve_frame(frame, |symbol| {
                if let Some(name) = symbol.name() {
                    if count > 3 {
                        eprintln!("{}", name);
                    }
                }
                count += 1;
            });
            true
        });

        unsafe {
            *libc::__errno_location() = libc::ENOSPC;
        }
        return -1;
    } else {
        unsafe {
            let res =
                libc::syscall(libc::SYS_write, fd as usize, buf as usize, count as usize) as isize;
            if res < 0 {
                *libc::__errno_location() = -res as i32;
                -1
            } else {
                res
            }
        }
    }
}

Then LD_PRELOAD that dylib and repeatedly build a big project until it ICEs, such as with this:

while true; do
    cargo clean
    LD_PRELOAD=/home/ben/evil/target/release/libevil.so cargo +stage1 check 2> errors
    if grep "thread 'rustc' panicked" errors; then
        break
    fi
done

My "big project" for testing was an otherwise-empty project with cargo add axum.

Before this PR, the above procedure finds a crash in between 1 and 15 minutes. With this PR, I have not found a crash in 30 minutes, and I'll be leaving this to run overnight (starting now). (A night has now passed, no crashes were found)

I believe the problem is that even though since #117301 we correctly check FileEncoder for errors on all paths, we use emit_err, so there is a window of time between the call to emit_err and the full error reporting where rustc believes it has emitted a valid rmeta file and will permit Cargo to launch a build for a dependent crate. Changing these calls to emit_fatal closes that window.

I think there are a number of other cases where emit_err has been used instead of the more-correct emit_fatal such as

sess.dcx().emit_err(errors::CopyPath::new(from, path, e));
but unlike rmeta encoding I am not aware of those cases of those causing problems.

r? @WaffleLapkin

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 2, 2024
Copy link
Member

@Nilstrieb Nilstrieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally try to avoid fatal errors when possible, but here it looks very reasonable.
I'm not waffle and you asked for it but r=me if that works for you too

@saethlin
Copy link
Member Author

saethlin commented Jan 2, 2024

I've been sending the out-of-disk PRs to Waffle because I keep getting good reviews back, so in this case thank you for taking a look, but I will wait for Waffle until the tree opens (I have no idea when that would be, but any kind of rush in these circumstances seems silly).

Copy link
Member

@WaffleLapkin WaffleLapkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too, r=WaffleLapkin,Nilstrieb once the tree feels better.

@@ -56,16 +56,14 @@ where
}
Err(err) if err.kind() == io::ErrorKind::NotFound => (),
Err(err) => {
sess.dcx().emit_err(errors::DeleteOld { name, path: path_buf, err });
return;
sess.dcx().emit_fatal(errors::DeleteOld { name, path: path_buf, err });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can drop the ; here, but it does not matter. (I prefer using expressions instead of blocks where possible, but yeah, not important...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking about this as I made the change

@WaffleLapkin WaffleLapkin added S-blocked Status: Marked as blocked ❌ on something else such as an RFC or other implementation work. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 2, 2024
@WaffleLapkin
Copy link
Member

Marking as S-blocked Status: Marked as blocked ❌ on something else such as an RFC or other implementation work. , since this awaits neither an action from the author or the reviewer

@Nilstrieb
Copy link
Member

Hm, it's S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. for someone to approve it, and then it will be S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. . I see no reason to wait for approving until the tree is open, it can sit in the bors queue just fine.

@saethlin
Copy link
Member Author

saethlin commented Jan 2, 2024

@bors r=WaffleLapkin,Nilstrieb rollup

@bors
Copy link
Contributor

bors commented Jan 2, 2024

📌 Commit 94c43cc has been approved by WaffleLapkin,Nilstrieb

It is now in the queue for this repository.

@bors
Copy link
Contributor

bors commented Jan 2, 2024

🌲 The tree is currently closed for pull requests below priority 100. This pull request will be tested once the tree is reopened.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-blocked Status: Marked as blocked ❌ on something else such as an RFC or other implementation work. labels Jan 2, 2024
fmease added a commit to fmease/rust that referenced this pull request Jan 3, 2024
…Lapkin,Nilstrieb

Report I/O errors from rmeta encoding with emit_fatal

rust-lang#119456 reminded me that I never did systematic testing to provoke the out-of-disk ICEs so I grepped through a recent crater run (rust-lang#119440 (comment)) for more out-of-disk ICEs on current master and yep there's 2 in there.

So I finally cooked up a way to provoke for these crashes. I wrote a little `cdylib` crate that has a `#[no_mangle] pub extern "C" fn write` which occasionally reports `ENOSPC`, and prints a backtrace when it does.
<details><summary><strong>code for the dylib</strong></summary>
<p>

```rust
// cargo add libc rand backtrace
use rand::Rng;

#[no_mangle]
pub extern "C" fn write(
    fd: libc::c_int,
    buf: *const libc::c_void,
    count: libc::size_t,
) -> libc::ssize_t {
    if fd > 2 && rand::thread_rng().gen::<u8>() == 0 {
        let mut count = 0;
        backtrace::trace(|frame| {
            backtrace::resolve_frame(frame, |symbol| {
                if let Some(name) = symbol.name() {
                    if count > 3 {
                        eprintln!("{}", name);
                    }
                }
                count += 1;
            });
            true
        });

        unsafe {
            *libc::__errno_location() = libc::ENOSPC;
        }
        return -1;
    } else {
        unsafe {
            let res =
                libc::syscall(libc::SYS_write, fd as usize, buf as usize, count as usize) as isize;
            if res < 0 {
                *libc::__errno_location() = -res as i32;
                -1
            } else {
                res
            }
        }
    }
}
```

</p>
</details>

Then `LD_PRELOAD` that dylib and repeatedly build a big project until it ICEs, such as with this:
```bash
while true; do
    cargo clean
    LD_PRELOAD=/home/ben/evil/target/release/libevil.so cargo +stage1 check 2> errors
    if grep "thread 'rustc' panicked" errors; then
        break
    fi
done
```
My "big project" for testing was an otherwise-empty project with `cargo add axum`.

Before this PR, the above procedure finds a crash in between 1 and 15 minutes. With this PR, I have not found a crash in 30 minutes, and I'll be leaving this to run overnight (starting now). (A night has now passed, no crashes were found)

I believe the problem is that even though since rust-lang#117301 we correctly check `FileEncoder` for errors on all paths, we use `emit_err`, so there is a window of time between the call to `emit_err` and the full error reporting where rustc believes it has emitted a valid rmeta file and will permit Cargo to launch a build for a dependent crate. Changing these calls to `emit_fatal` closes that window.

I think there are a number of other cases where `emit_err` has been used instead of the more-correct `emit_fatal` such as https://github.com/rust-lang/rust/blob/e51e98dde6a60637b6a71b8105245b629ac3fe77/compiler/rustc_codegen_ssa/src/back/write.rs#L542 but unlike rmeta encoding I am not aware of those cases of those causing problems.

r? `@WaffleLapkin`
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 3, 2024
Rollup of 21 pull requests

Successful merges:

 - rust-lang#119086 (Query panic!() to useful diagnostic)
 - rust-lang#119239 (Remove unnecessary arm in `check_expr_yield`)
 - rust-lang#119298 (suppress change-tracker warnings in CI containers)
 - rust-lang#119319 (Document that File does not buffer reads/writes)
 - rust-lang#119434 (rc: Take *const T in is_dangling)
 - rust-lang#119444 (Rename `TyCtxt::is_closure` to `TyCtxt::is_closure_or_coroutine`)
 - rust-lang#119474 (Update tracking issue of naked_functions)
 - rust-lang#119476 (Pretty-print always-const trait predicates correctly)
 - rust-lang#119477 (rustdoc ui: adjust tooltip z-index to be above sidebar)
 - rust-lang#119479 (Remove two unused feature gates from rustc_query_impl)
 - rust-lang#119487 (Minor improvements in comment on `freshen.rs`)
 - rust-lang#119492 (Update books)
 - rust-lang#119494 (Deny defaults for higher-ranked generic parameters)
 - rust-lang#119498 (Update deadlinks of `strict_provenance` lints)
 - rust-lang#119505 (Don't synthesize host effect params for trait associated functions marked const)
 - rust-lang#119510 (Report I/O errors from rmeta encoding with emit_fatal)
 - rust-lang#119512 (Mark myself as back from leave)
 - rust-lang#119514 (coverage: Avoid a query stability hazard in `function_coverage_map`)
 - rust-lang#119523 (llvm: Allow `noundef` in codegen tests)
 - rust-lang#119534 (Update `thread_local` examples to use `local_key_cell_methods`)
 - rust-lang#119544 (Fix: Properly set vendor in i686-win7-windows-msvc target)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit a34754e into rust-lang:master Jan 3, 2024
11 checks passed
@rustbot rustbot added this to the 1.77.0 milestone Jan 3, 2024
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Jan 3, 2024
Rollup merge of rust-lang#119510 - saethlin:fatal-io-errors, r=WaffleLapkin,Nilstrieb

Report I/O errors from rmeta encoding with emit_fatal

rust-lang#119456 reminded me that I never did systematic testing to provoke the out-of-disk ICEs so I grepped through a recent crater run (rust-lang#119440 (comment)) for more out-of-disk ICEs on current master and yep there's 2 in there.

So I finally cooked up a way to provoke for these crashes. I wrote a little `cdylib` crate that has a `#[no_mangle] pub extern "C" fn write` which occasionally reports `ENOSPC`, and prints a backtrace when it does.
<details><summary><strong>code for the dylib</strong></summary>
<p>

```rust
// cargo add libc rand backtrace
use rand::Rng;

#[no_mangle]
pub extern "C" fn write(
    fd: libc::c_int,
    buf: *const libc::c_void,
    count: libc::size_t,
) -> libc::ssize_t {
    if fd > 2 && rand::thread_rng().gen::<u8>() == 0 {
        let mut count = 0;
        backtrace::trace(|frame| {
            backtrace::resolve_frame(frame, |symbol| {
                if let Some(name) = symbol.name() {
                    if count > 3 {
                        eprintln!("{}", name);
                    }
                }
                count += 1;
            });
            true
        });

        unsafe {
            *libc::__errno_location() = libc::ENOSPC;
        }
        return -1;
    } else {
        unsafe {
            let res =
                libc::syscall(libc::SYS_write, fd as usize, buf as usize, count as usize) as isize;
            if res < 0 {
                *libc::__errno_location() = -res as i32;
                -1
            } else {
                res
            }
        }
    }
}
```

</p>
</details>

Then `LD_PRELOAD` that dylib and repeatedly build a big project until it ICEs, such as with this:
```bash
while true; do
    cargo clean
    LD_PRELOAD=/home/ben/evil/target/release/libevil.so cargo +stage1 check 2> errors
    if grep "thread 'rustc' panicked" errors; then
        break
    fi
done
```
My "big project" for testing was an otherwise-empty project with `cargo add axum`.

Before this PR, the above procedure finds a crash in between 1 and 15 minutes. With this PR, I have not found a crash in 30 minutes, and I'll be leaving this to run overnight (starting now). (A night has now passed, no crashes were found)

I believe the problem is that even though since rust-lang#117301 we correctly check `FileEncoder` for errors on all paths, we use `emit_err`, so there is a window of time between the call to `emit_err` and the full error reporting where rustc believes it has emitted a valid rmeta file and will permit Cargo to launch a build for a dependent crate. Changing these calls to `emit_fatal` closes that window.

I think there are a number of other cases where `emit_err` has been used instead of the more-correct `emit_fatal` such as https://github.com/rust-lang/rust/blob/e51e98dde6a60637b6a71b8105245b629ac3fe77/compiler/rustc_codegen_ssa/src/back/write.rs#L542 but unlike rmeta encoding I am not aware of those cases of those causing problems.

r? ``@WaffleLapkin``
@saethlin saethlin deleted the fatal-io-errors branch January 15, 2024 21:15
workingjubilee added a commit to workingjubilee/rustc that referenced this pull request Mar 12, 2024
…etrochenkov

Detect truncated DepGraph files

I suspect that the following issues are caused by truncated incr comp files:

* rust-lang#120582
* rust-lang#121499
* rust-lang#122210

We fail with an allocation failure or capacity overflow in this case because we assume that the ending bytes of an DepGraph file are the lengths of arrays. If the file has somehow been truncated then the ending bytes are probably some of our varint encoding, which tries to eliminate zero bytes, so interpreting a random 8 bytes as an array length has a very high chance of producing a byte capacity over `isize::MAX`.

Now theoretically since rust-lang#119510 merged I have fixed the out-of-disk issues and yet in rust-lang#120894 (comment) I still see some decoding failures that look like out-of-disk ICEs, for example https://crater-reports.s3.amazonaws.com/beta-1.77-1/beta-2024-02-10/gh/scottfones.aoc_2022/log.txt

So this PR should ensure that we get an ICE that clearly identifies if the file in question is truncated.
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Mar 12, 2024
Rollup merge of rust-lang#122245 - saethlin:check-dep-graph-size, r=petrochenkov

Detect truncated DepGraph files

I suspect that the following issues are caused by truncated incr comp files:

* rust-lang#120582
* rust-lang#121499
* rust-lang#122210

We fail with an allocation failure or capacity overflow in this case because we assume that the ending bytes of an DepGraph file are the lengths of arrays. If the file has somehow been truncated then the ending bytes are probably some of our varint encoding, which tries to eliminate zero bytes, so interpreting a random 8 bytes as an array length has a very high chance of producing a byte capacity over `isize::MAX`.

Now theoretically since rust-lang#119510 merged I have fixed the out-of-disk issues and yet in rust-lang#120894 (comment) I still see some decoding failures that look like out-of-disk ICEs, for example https://crater-reports.s3.amazonaws.com/beta-1.77-1/beta-2024-02-10/gh/scottfones.aoc_2022/log.txt

So this PR should ensure that we get an ICE that clearly identifies if the file in question is truncated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

range start index 351645 out of range for slice of length 16384 getting the resolver for lowering
5 participants