Description
Original issue is chatmail/core#2032
I have prepared a minimal example demonstrating the bug, which depends only on async-std
1.6.5.
An example which you can unpack and run with cargo run
: filedrop.tar.gz
The source code for reference:
use async_std::prelude::*;
async fn create_file() {
let mut file = async_std::fs::OpenOptions::new()
.create(true)
.write(true)
.open("foo.txt")
.await
.unwrap();
file.write_all(b"foobarbaz").await.unwrap();
//file.flush().await.unwrap();
eprintln!("before drop");
}
async fn test() {
let tsk = async_std::task::spawn(async move {
create_file().await;
eprintln!("after drop");
});
tsk.await;
}
fn main() {
async_std::task::block_on(test());
}
I built this example in the following configuration:
- Debian sid as a host machine.
- NetBSD 9.1 as a guest, installed in qemu (kvm)
- Rust 1.47.0 installed on NetBSD via rustup.
When I execute cargo run
, the program prints before drop
and gets stuck. Apparently the problem is that despite what comment says, executor does not handle blocking operation in Drop correctly:
Line 314 in 11196c8
After waiting a minute and terminating the program, the file foo.txt
is created, but is empty.
When I uncomment the line file.flush().await.unwrap();
, the program prints
before drop
after drop
and exits.
On my Linux machine this works correctly. It is not a QEMU bug, as the same problem is experienced on some Android phones, see original issue chatmail/core#2032
Activity
link2xt commentedon Oct 22, 2020
Collected the backtraces of the stuck process:
link2xt commentedon Oct 22, 2020
Sidenote: there is a third alternative to dropping unwritten data and flushing it in a blocking way. It's possible to spawn a new task that will do the flushing sometime later. If user did not care to flush the file, it can be postponed.
link2xt commentedon Oct 23, 2020
Maybe this can be reproduced more reliably and converted into a test? Is it possible to cause
File
to become unflushed in deterministic way? This is probably caused by slow qemu device built on top of HDD and old android SD card.dignifiedquire commentedon Oct 23, 2020
We likely need to stop flushing automatically in the drop impl, and add to the docs that flushing is required to be manual. This is what
tokio::fs::File
does, and likely the only way to be sure it works, as long asAsyncDrop
is not yet there.link2xt commentedon Oct 23, 2020
It's possible to remove
Drop
, but isn't it a workaround? What is the reason for panic in the first thread? What if I implement aFile
wrapper that doesflush
indrop
, why shouldn't it work?link2xt commentedon Oct 23, 2020
Here is a way to create unflushed file:
async-std/src/fs/file.rs
Line 411 in 11196c8
I'll try to use it and see what happens.
link2xt commentedon Oct 23, 2020
This code does not fail on Linux:
But maybe because write operation never actually blocks.
link2xt commentedon Oct 26, 2020
Reproduced on Alpine linux under qemu, running on top of qcow2 disk backed by HDD. Dumped the core, copied to the host system and loaded it in gdb. On the host system same binary successfully writes the file by the way. So it is reproducible both with kqueue and epoll, on NetBSD and Alpine Linux running in KVM and on real Android.
Stacktraces with epoll, produced on Debian from binary and core file downloaded from Alpine
strace -f
on Alpine Linux (program deadlocks)strace -f
on host Debian GNU/Linux (program finishes)ghost commentedon Oct 26, 2020
Can you perhaps try replacing
async_std::fs::File
withasync_fs::File
and see if the deadlock goes away?https://docs.rs/async-fs/1.5.0/async_fs/struct.File.html
Let's if there's a problem in the async file implementtion or somewhere else.
link2xt commentedon Oct 26, 2020
With
async_fs::OpenOptions
it works on Alpine (in qemu).link2xt commentedon Oct 26, 2020
strace -f
on Alpine under qemu withasync_fs::OpenOptions
link2xt commentedon Oct 26, 2020
But
async_fs::OpenOptions
doesn't implementDrop
at all, so it just doesn't trigger the bug. I'm not sure what this experiment shows.Also, with
async_fs
the file is empty, there is nowrite("foobarbaz")
syscall instrace
output, the file is simply never written.link2xt commentedon Oct 26, 2020
On Debian, the file is written even with
async_fs::OpenOptions
.link2xt commentedon Aug 1, 2021
Probably related problem,
cargo
gets stuck compiling a package on a GitLab runner running under QEMU: https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/23790harmic commentedon Jun 27, 2022
I had a case of this today. It occurred in a VM running under Virtualbox, running Rocky Linux 8.
The application I was testing is quite complex, the relevant bit has async tasks which are reading from TCP sockets, processing what they read, and writing the results to a file.
The VM was configured with two CPUs, so there were 2 async runtime threads. Each time I observed it, both runtime threads were stuck trying to drop
async_std::fs::File
instances. That seemed a little strange, although I can't be sure there were no cases where only one of the threads got stuck - obviously it was more obvious when they both get stuck because the whole application grinds to a halt.Adding a
flush()
call just before dropping the file does seem to have fixed it. Interestingly, the test program above did not trigger a problem for me.There definitely needs to be some documentation about this - I wasted a huge amount of time trying to debug this before stumbling across this ticket.
link2xt commentedon Jun 27, 2022
I think the problem is in
Drop
implementation calling async flush.poll_flush
then tries to spawn another task and wait for its completion, callingpoll_drain
in turn and so on, and this is likely where we deadlock. Instead of calling async flush, flush() should be implemented using only blocking operations instead of calling this:https://github.com/async-rs/async-std/blob/11196c853dc42e86d608c4ed29af1a8f0c5c5084/src/fs/file.rs#L841-L861=
In other words, just
self.file.write_all(self.cache)
(add&
and*
where needed) directly and do nothing else.harmic commentedon Jun 29, 2022
This makes perfect sense.
You would need to have at least N+1 runtime threads in order to be able to drop N
File
s simultaneously - which explains why most cases of this have been seen in QEMU or VMs since they would be likely to have few CPUs.One thing I do not understand though: when I configure my VM with 1 CPU, or instruct async_std to create one runtime thread via ASYNC_GLOBAL_EXECUTOR_THREADS, the reproduction program at the top of this ticket does not trigger a hanging for me.
link2xt commentedon Jul 17, 2022
I made a fix #1033 but have not really tested it.