Async write_all sometimes silently fails to write to file #4296

igorsol · 2021-12-03T19:58:31Z

Version
tokio v1.14.0
tokio-macros v1.6.0 (proc-macro)

Platform
Linux 4.15.0-163-generic #171~16.04.1-Ubuntu SMP Wed Nov 10 11:19:51 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description
While using write_all to write data to file I noticed that in some rare cases it does not write anything to the file despite the awaiting on corresponding future.

To reproduce the issue I've got the code from the write_all documentation:

use tokio::io::{self, AsyncWriteExt};
use tokio::fs::File;

#[tokio::main]
async fn main() -> io::Result<()> {
    let mut file = File::create("foo.txt").await?;

    file.write_all(b"some bytes").await?;
    //file.flush().await?;
    Ok(())
}

This code should create foo.txt file in the current directory. To test the binary executable I created following bash script which calls test binary in a loop a checks if non-zero length foo.txt was created:

i=0

ctrl_c() {
    echo
    echo "Ctrl-c captured on iteration $i"
    exit 0
}

trap 'ctrl_c' SIGINT

while true; do
    ((++i))
    target/debug/walltest
    if ! [ -s foo.txt ]; then
        echo "failure occurred on iteration $i"
        exit 1
    fi
    rm foo.txt
done

I expected to see this happen:

bash script must run infinitely until Ctrl-C is pressed

Instead, this happened:

the script stops in less than a second reporting that zero-length foo.txt was created. Number of successful iteration varies from less than 10 to few hundreds.

Additional information:

If I uncomment the file.flush call then file is always written as expected.
Identical code executed under async-std runtime works as expected without file.flush.

The text was updated successfully, but these errors were encountered:

Darksonn · 2021-12-04T10:33:42Z

This is because file operations are offloaded to the spawn_blocking pool, and you need the flush to wait for it to complete. Normally the runtime will always wait for spawn_blocking tasks when shutting down, but it actually only waits if the spawn_blocking task has started before the shutdown happens. So the places where your operation fails is if the runtime shutdown happens before the spawn_blocking task starts.

Note that the above implies that the data can only be lost if the write happens just before runtime shutdown. Otherwise it will complete normally even if the file is dropped before the write completes.

It seems like async-std's approach is the following:

impl Drop for File {
    fn drop(&mut self) {
        // We need to flush the file on drop. Unfortunately, that is not possible to do in a
        // non-blocking fashion, but our only other option here is losing data remaining in the
        // write cache. Good task schedulers should be resilient to occasional blocking hiccups in
        // file destructors so we don't expect this to be a common problem in practice.
        let _ = futures_lite::future::block_on(self.flush());
    }
}

However I would prefer to avoid this solution.

If we want to solve this, I would find it better to have the runtime somehow always wait for the file operation before shutting down.

krtab · 2021-12-04T11:59:45Z

Hi!

This is because file operations are offloaded to the spawn_blocking pool, and you need the flush to wait for it to complete.

I don't understand why this guarantee is not/cannot be upheld by awaiting poll_write. Before diving into this issues (and the associated thread on URLO), my mental model was:

In an async context, I'm calling (indirectly or not) poll_write. This return a Future which will be Ready once the bytes have actually been passed on to the OS for writting¹.
Upon the first polling of this Future, a "task" (i'm unsure about the proper terminology here) is started on the blocking thread_pool. This task will use the blocking OS primitives to write the bytes and once this is done wake the future by calling its waker's method.
When polled afterwards, the Future checks if some internal state has been changed to "n bytes have been succefuly written" and if so returns Ready(Ok(n))

But my understanding is that it is rather:

Upon polling, the Future returns Ready(Ok(n)) if n bytes have successfully been written to some internal buffer shared with the blocking task.

Is my first mental impractical to carry out, or is it a deliberate choice? In any case, I think @igorsol had a similar confusion so there may be some way we can think of improving the documentation.

¹ What I mean by "passed on to the OS for writting" is akin to <std::fs::File as std::io::Write>::write has returned.

Darksonn · 2021-12-04T18:05:06Z

Unfortunately, having poll_write wait until the write actually completes is incompatible with the contract for AsyncWrite. If you return Poll::Pending, then the user may supply a different buffer next time they call poll_write. This becomes visible in the cancellation safety section for AsyncWriteExt::write, which says the following:

This method is cancellation safe in the sense that if it is used as the event in a tokio::select! statement and some other branch completes first, then it is guaranteed that no data was written to this AsyncWrite.

If the write call returns Poll::Pending, then some other branch in a select! can complete first but the data would still be written in violation with the above.

The behavior is conceptually similar to writing to a buffered writer.

igorsol · 2021-12-06T11:13:02Z

Just some additional thoughts:

So write_all() reports that future is completed when actual write is not completed yet. If write_all() is a write operation then its future just lies about operation completion. If write_all() is not write operation then what it is? Maybe it should be documented as some scheduling operation?
write_all() returns io::Result<()> but if write operation is not completed then it cannot return result of actual writes. Does it mean that if there is some real IO error it will be lost?
I don't understand how file operations in spawn_blocking pool are executed after File's destructor. Is file handle duplicated or stored behind some reference-counting smart pointer? Sorry I didn't look into tokio sources. The fact that file handle lives longer than File instance is also very confusing.

Darksonn · 2021-12-06T12:49:42Z

The current semantics of write (which write_all inherit for its last call to write) is that if a write has already been started, then it will wait until that write completes, then starts a new write and immediately return. Io errors are returned on the next call to write or flush. And yes, the file descriptor is reference counted between the File object and the background task.

The documentation already contains the following snippet:

A file will not be closed immediately when it goes out of scope if there are any IO operations that have not yet completed. To ensure that a file is closed immediately when it is dropped, you should call flush before dropping it.

Of course, documentation can always be improved.

malaire · 2021-12-06T16:17:37Z

... starts a new write and immediately return. Io errors are returned on the next call to write or flush ...

This at the very least should be documented as it's completely unexpected behavior at least for me.

So errors returned by write are NEVER about that write which user attempted to do but about PREVIOUS write. Totally unexpected.

BraulioVM · 2021-12-11T15:26:17Z

From @Darksonn's comment above

If we want to solve this, I would find it better to have the runtime somehow always wait for the file operation before shutting down.

Is this something you think should be solved? Would it make sense that I look into writing a PR along the lines those lines? Or is it likely that the change wouldn't be accepted?

Darksonn · 2021-12-11T16:12:09Z

If we can find a good solution, then I am ok with making the change, but it is probably best to discuss the details of the implementation idea with me before you spend a lot of time on implementing it.

BraulioVM · 2021-12-11T16:31:16Z

Cool, I'll have a look and if I find a way it could be solved then I'll draft a proposal to see what you think

iitalics · 2021-12-12T02:51:56Z

I ran into this before, I think the main point of confusion is how std::fs::File immediately flushes on drop so { let f = File::create(); f.write_all(...)?; } just works; then you port this line by line to tokio and it doesn't anymore. yet the write_all method is doing the same behavior in both API's, technically speaking. It's the invisible flush that is different.

I don't really think the API should be changed but perhaps it should be documented loudly that this is something you may need to handle.

malaire · 2021-12-12T04:04:54Z

I ran into this before, I think the main point of confusion is how std::fs::File immediately flushes on drop so { let f = File::create(); f.write_all(...)?; } just works; then you port this line by line to tokio and it doesn't anymore. yet the write_all method is doing the same behavior in both API's, technically speaking. It's the invisible flush that is different.

Behaviour is definitely NOT same. Standard write_all is documented to not return until the entire buffer has been successfully written or such an error occurs while last write of tokio write_all doesn't even bother checking for errors, it just starts a new write and immediately return.

Darksonn · 2021-12-12T09:45:46Z

I agree that the behavior is not the same, but regarding the standard write_all, if you use a BufWriter<File>, then you get a very similar behavior with the std write_all and it also will not always catch errors until you flush, so this is not unique to Tokio's IO traits.

BraulioVM · 2022-01-25T19:10:33Z

This issue is fixed now!

igorsol added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Dec 3, 2021

Darksonn added the M-fs Module: tokio/fs label Dec 4, 2021

BraulioVM mentioned this issue Dec 12, 2021

Guarantee that File::write_all writes all data (or at least tries) #4316

Merged

Darksonn closed this as completed Jan 25, 2022

BraulioVM mentioned this issue Dec 31, 2022

Expose a way to flush writes in Websocket_lwt_unix vbmithr/ocaml-websocket#132

Closed

Darksonn mentioned this issue Jan 4, 2023

rt: blocking runtime doesn't reponse shutdown flag #5275

Closed

SergioBenitez mentioned this issue Dec 12, 2023

Correctly flush when persisting temporary files. rwf2/Rocket#2668

Closed

tustvold mentioned this issue Feb 6, 2024

Multipart upload can leave futures unpolled, leading to timeout apache/arrow-rs#5366

Closed

tustvold mentioned this issue Mar 3, 2024

Revisit Design of ObjectStore::put_multipart apache/arrow-rs#5458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async write_all sometimes silently fails to write to file #4296

Async write_all sometimes silently fails to write to file #4296

igorsol commented Dec 3, 2021

Darksonn commented Dec 4, 2021 •

edited

Loading

krtab commented Dec 4, 2021

Darksonn commented Dec 4, 2021 •

edited

Loading

igorsol commented Dec 6, 2021 •

edited

Loading

Darksonn commented Dec 6, 2021

malaire commented Dec 6, 2021 •

edited

Loading

BraulioVM commented Dec 11, 2021

Darksonn commented Dec 11, 2021

BraulioVM commented Dec 11, 2021

iitalics commented Dec 12, 2021

malaire commented Dec 12, 2021 •

edited

Loading

Darksonn commented Dec 12, 2021

BraulioVM commented Jan 25, 2022

Async write_all sometimes silently fails to write to file #4296

Async write_all sometimes silently fails to write to file #4296

Comments

igorsol commented Dec 3, 2021

Darksonn commented Dec 4, 2021 • edited Loading

krtab commented Dec 4, 2021

Darksonn commented Dec 4, 2021 • edited Loading

igorsol commented Dec 6, 2021 • edited Loading

Darksonn commented Dec 6, 2021

malaire commented Dec 6, 2021 • edited Loading

BraulioVM commented Dec 11, 2021

Darksonn commented Dec 11, 2021

BraulioVM commented Dec 11, 2021

iitalics commented Dec 12, 2021

malaire commented Dec 12, 2021 • edited Loading

Darksonn commented Dec 12, 2021

BraulioVM commented Jan 25, 2022

Darksonn commented Dec 4, 2021 •

edited

Loading

Darksonn commented Dec 4, 2021 •

edited

Loading

igorsol commented Dec 6, 2021 •

edited

Loading

malaire commented Dec 6, 2021 •

edited

Loading

malaire commented Dec 12, 2021 •

edited

Loading