Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposing new `AsyncRead` / `AsyncWrite` traits #1744

Open
wants to merge 1 commit into
base: master
from

Conversation

@seanmonstar
Copy link
Member

seanmonstar commented Nov 6, 2019

Introduce new AsyncRead / AsyncWrite

This PR introduces new versions of AsyncRead / AsyncWrite traits.
The proposed changes aim to improve:

  • ergonomics.
  • integration of vectored operations
  • working with uninitialized byte slices.

Overview


The PR changes the AsyncRead and AsyncWrite traits to accept T: Buf and T: BufMut values instead of &[u8] and &mut [u8]. Because
&[u8] implements Buf and &mut [u8] implements BufMut, the same
calling patterns used today are still possible. Additionally, any type
that implements Buf and BufMut may be used. This includes
Cursor<&[u8]>, Bytes, ...

Improvement in ergonomics


Calls to read and write accept buffers, but do not necessary use up
the entirety of the buffer. Both functions return a usize representing
the number of bytes read / written. Because of this, it is common to
write loops such as:

let mut rem = &my_data[..];
​
while !rem.is_empty() {
    let n = my_socket.write(rem).await?;
    rem = &rem[n..];
}


The key point to notice is having to use the return value to update the
position in the cursor. This is both common and error prone. The Buf /
BufMut traits aim to ease this by building the cursor concept directly
into the buffer. By using these traits with AsyncRead / AsyncWrite,
the above loop can be simplified as:

let mut buf = Cursor::new(&my_data[..]);
​
while buf.has_remaining() {
    my_socket.write(&mut buf).await?;
}


A small reduction in code, but it removes an error prone bit of logic
that must be often repeated.

Integration of vectored operations


In the AsyncRead / AsyncWrite traits provided by futures-io,
vectored operations are covered using separate fns: poll_read_vectored
and poll_write_vectored. These two functions have default
implementations that call the non-vectored operations.

This has a draw back, when implementing AsyncRead / AsyncWrite,
usually as a layer on top of a type such as TcpStream, the implementor
must not forget to impleement these two additional functions. Otherwise,
the implementation will not be able to use vectored operations even if
the underlying TcpStream supports it. Secondly, it requires duplication
of logic: one poll_read implementation and one poll_read_vectored
implementation. It is possible to implement one in terms of the other,
but this can result in sub-optimial implementations.

Imagine a situation where a rope
data structure is being written to a socket. This structure is comprised
of many smaller byte slices (perhaps thousands). To write it efficiently
to the socket, avoiding copying data is preferable. To do this, the byte
slices need to be loaded in an IoSlice. Since modern linux systems
support a max of 1024 slices, we initialize an array of 1024 slices,
iterate the rope to populate this array and call poll_write_vectored.
The problem is that, as the caller, we don't know if the AsyncWrite
type supports vectored operations or not, poll_write_vectored is
called optimistically. However, the implementation "forgot" to proxy its
function to TcpStream, so poll_read is called w/ the first entry in
the IoSlice. The problem is, for each call to poll_read_vectored, we
must iterate 1024 nodes in our rope to only have one chunk written at a
time.

By using T: Buf as the argument, the decision of whether or not to use
vectored operations is left up to the leaf AsyncWrite type.
Intermediate layers only implement poll_write w/ T: Buf and pass it
along to the inner stream. The TcpStream implementation will know that
it supports vectored operations and know how many slices it can write at
a time and do "the right thing".

Working with uninitialized byte slices


When passing buffers to AsyncRead, it is desirable to pass in
uninitialized memory which the poll_read call will write to. This
avoids the expensive step of zeroing out the memory (doing so has
measurable impact at the macro level). The problem is that uninitialized
memory is "unsafe", as such, care must be taken.

Tokio initially attempted to handle this by adding a
prepare_uninitialized_buffer
function. std is investigating adding a
similar
though improved variant of this API. However, over the years, we have
learned that the prepare_uninitialized_buffer API is sub optimal for
multiple reasons.

First, the same problem applies as vectored operations. If an
implementation "forgets" to implement prepare_uninitialized_buffer
then all slices must be zeroed out before passing it to poll_read,
even if the implementation does "the right thing" (not read from
initialized memory). In practice, most implementors end up forgetting to
implement this function, resulting in memory being zeroed out.

Secondly, implementations of AsyncRead that should not require
unsafe to implement now must add unsafe simply to avoid having
memory zeroed out.

Switching the argument to T: BufMut solves this problem via the
BufMut trait. First, BufMut provides low-level functions that return
&mut [MaybeUninitialized<u8>]. Second, it provides utility functions
that provide safe APIs for writing to the buffer (put_slice, put_u8,
...). Again, only the leaf AsyncRead implementations (TcpStream)
must use the unsafe APIs. All layers may take advantage of uninitialized
memory without the associated unsafety.

Drawbacks


The primary drawback is genericizing the AsyncRead and AsyncWrite
traits. This adds complexity. We feel that the added benefits discussed
above outweighs the drawbacks, but only trying it out will validate it.

Relation to futures-io, std, and roadmap


The relationship between tokio's I/O traits and futures-io has come
up a few times in the past. Tokio has historically maintained its own
traits. futures-io has just been released with a simplified version of
the traits. There is also talk of standardizing AsyncRead and
AsyncWrite in std. Because of this, we believe that now is the
perfect time to experiment with these traits. This will allow us to gain
more experience before committing to traits in std.

The next version of Tokio will not be 1.0. This allows us to
experiment with these traits and remove them for 1.0 if they do not pan
out.

Once AsyncRead / AsyncWrite are added to std, Tokio will provide
implementations for its types. Until then, tokio-util will provide a
compatibility layer between Tokio types and futures-io.

This replaces the `[u8]` byte slice arguments to `AsyncRead` and
`AsyncWrite` with `dyn Buf` and `dyn BufMut` trait objects.
@seanmonstar seanmonstar requested review from carllerche and tokio-rs/maintainers Nov 6, 2019
let res = ready!(self.as_mut().get_pin_mut().poll_read(cx, buf));
self.discard_buffer();
return Poll::Ready(res);
}
unimplemented!()

This comment has been minimized.

Copy link
@hawkw

hawkw Nov 6, 2019

Member

Was the intention to finish this before merging?

This comment has been minimized.

Copy link
@seanmonstar

seanmonstar Nov 6, 2019

Author Member

Uh, ahem, yes. Nothing to see here.

(I commented some things out when I started the branch to chip away at the insane number of compiler errors, and then may have forgotten where I left some of these...)

@hawkw

This comment has been minimized.

Copy link
Member

hawkw commented Nov 6, 2019

@seanmonstar from the PR description:

The primary drawback is genericizing the AsyncRead and AsyncWrite
traits. This adds complexity. We feel that the added benefits discussed
above outweighs the drawbacks, but only trying it out will validate it.

It looks like the AsyncRead/AsyncWrite traits in this PR take dyn Buf/dyn BufMut trait object references? Is this comment still accurate?

@Ralith

This comment has been minimized.

Copy link
Contributor

Ralith commented Nov 6, 2019

Initial thoughts from a quick skim:

  • Since the cursor is now handled automatically, should the usize return be removed?
  • Can we take this opportunity to replace io::Error with an associated type, in the same spirit of trying things while we still can?

e: Oh, and I really like this idea overall.

@seanmonstar

This comment has been minimized.

Copy link
Member Author

seanmonstar commented Nov 6, 2019

Since the cursor is now handled automatically, should the usize return be removed?

Good point, I'd thought about that as well. I'm in favor of changing the return types of poll_read and poll_write to io::Result<()> (from io::Result<usize>) since it's part of {Buf, BufMut}.

Can we take this opportunity to replace io::Error with an associated type, in the same spirit of trying things while we still can?

I haven't been bothered by the return type myself, so it wasn't something that I personally wanted to experiment with, and I was scared away by the supposed dragons mentioned in the origin std-io reform RFC. But I don't mean to scare away others from experimenting as well :)

@Ralith

This comment has been minimized.

Copy link
Contributor

Ralith commented Nov 6, 2019

But I don't mean to scare away others from experimenting as well :)

Fair enough, I can draft a follow-up once this is in.

@sfackler

This comment has been minimized.

Copy link
Contributor

sfackler commented Nov 7, 2019

WRT unininitialized buffers, I have a half-written thing discussing the path forward for std that I'm going to try to finish up this week: https://paper.dropbox.com/doc/IO-Buffer-Initialization--AoGf~cBSiAGi3mjGAJe9fW9pAQ-MvytTgjIOTNpJAS6Mvw38 - please add comments if you'd like!

It does seem to me like this is a nice and clean approach, but there's no way we can adjust Read in the same way. It may very well be the case that we shouldn't handicap AsyncRead to ensure it matches Read, but IMO it's something to think about (especially if we want to avoid huge amounts of churn when the AsyncRead/AsyncWrite traits land in std).

@carllerche carllerche changed the title Introduce new `AsyncRead` / `AsyncWrite` Proposing new `AsyncRead` / `AsyncWrite` traits Nov 7, 2019
@Ralith

This comment has been minimized.

Copy link
Contributor

Ralith commented Nov 7, 2019

The downsides of generic methods seems to be mainly additional complexity required to use a trait object (via secondary provided object-safe methods, or a secondary trait with a blanket impl). In exchange, they offer guaranteed static dispatch. I'm not sure either side is a clear win, particularly since virtual dispatch likely pales in comparison to the actual cost of doing I/O.

One case where the inlining afforded by monomorphization might be important is implementations of Buf that produce large numbers of small discontinuous slices. For example, imagine a serialization library that derives Buf implementations for PoD structs to be read as a series of slices, one field at a time. This feels a bit contrived, but I'm not sure it's beyond reason.

@seanmonstar

This comment has been minimized.

Copy link
Member Author

seanmonstar commented Nov 7, 2019

I can re-check the description later, but it might not be clear: this proposes passing &mut dyn Buf instead of generics, specifically to allow dyn AsyncRead and stuff to still work.

@Ralith

This comment has been minimized.

Copy link
Contributor

Ralith commented Nov 7, 2019

Yeah, just giving some initial thoughts on that tradeoff in response to @carllerche's request.

specifically to allow dyn AsyncRead and stuff to still work.

That could still be accomplished with an extra trait.

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented Nov 7, 2019

@seanmonstar

I can re-check the description later, but it might not be clear: this proposes passing &mut dyn Buf instead of generics, specifically to allow dyn AsyncRead and stuff to still work.

Yeah, I think the original text

genericizing the AsyncRead and AsyncWrite traits

Is what generated the confusion here, along with the various mentions of T: Buf and T: BufMut.

@Matthias247

This comment has been minimized.

Copy link

Matthias247 commented Nov 7, 2019

I'm not sure either side is a clear win, particularly since virtual dispatch likely pales in comparison to the actual cost of doing I/O.
One case where the inlining afforded by monomorphization might be important is implementations of Buf that produce large numbers of small discontinuous slices.

I think there is something to this. I don't think virtual dispatches matter when reading from the real TCP socket. But often buffered readers are implemented on top of these - which are then used by deserialization libraries. When those read things in 1-8 byte chunks from the buffer the dispatch might have an impact.

But that should be benchmarkable. I guess there are libraries out there which are exactly doing this (h2?).

Is the ergonomics reason such a big thing in practice? I'm typically just doing write_all() on byte slices or slics of byte slices via extension traits - which also means I don't have to advance the cursor in the application code. If not using write_all I typically care about the amount of written bytes.

By using T: Buf as the argument, the decision of whether or not to use
vectored operations is left up to the leaf AsyncWrite type.
Intermediate layers only implement poll_write w/ T: Buf and pass it
along to the inner stream.

Is nesting and AsyncWrite a big thing? From my point of view it is mostly a leaf type which is "hand-implemented". Even if not implemented by a TCP socket but by a TLS or HTTP library, those will be leaves in the sense that the writes go into their internal buffers and are not directly forwarded. They have to implement support for vectored operations anyway.

For application code I think interacting with those things via Future adapters will likely get more the norm. Here people do not interact with the low-level AsyncRead/AsyncWrite methods anyway. E.g. we write things like:

fn do_with_stream(stream: &dyn AsyncWrite) {
   let rem = &my_data[..];
   stream.write_all(rem).await
}
@carllerche

This comment has been minimized.

Copy link
Member

carllerche commented Nov 7, 2019

@Matthias247

Is nesting and AsyncWrite a big thing?

Same as std... there are use cases (TLS, deflate / inflate, mocking,...) there are protocol specific impls, like HTTP bodies, websocket, etc...

In my personal experience, I definitely hit the issue where, in order to have the most efficient impl possible, I needed to implement both poll_read and poll_read_vectored separately to do different things, resulting in code dup. Most often than not, I ended up just skipping the specialized poll_read_vectored option.

For me, the biggest factor is the "safety" aspect and uninitialized memory. Being required to zero out memory is sub optimal and I don't think the current strategy for modeling safety & uninitialized memory is ideal. WDYT?

@carllerche

This comment has been minimized.

Copy link
Member

carllerche commented Nov 7, 2019

Also, to note, taking &mut dyn Buf will prevent passing in &[u8] as an argument. If the argument type is impl Buf, then &[u8] can be passed in. Though, odds are users will not call poll_read directly, instead going via the async fn read(&mut self) -> io::Result<usize> path.

Object safety can also be achieved by doing something like (i forgot the exact incantation):

trait AsyncRead {
    fn poll_read(&mut self, dst: impl BufMut) -> Poll<io::Result<usize>> where Self: Sized;

    #[doc(hidden)]
    fn poll_read_dyn(&mut self, dst: &mut dyn BufMut) -> Poll<io::Result<usize>> {
        self.poll_read(dst)
    }
}

impl AsyncRead for Box<dyn AsyncRead> {
    fn poll_read(&mut self, dst: impl BufMut) -> Poll<io::Result<usize>> {
        self.poll_read_dyn(dst)
    }
}
@Matthias247

This comment has been minimized.

Copy link

Matthias247 commented Nov 7, 2019

Same as std... there are use cases (TLS, deflate / inflate, mocking,...) there are protocol specific impls,
like HTTP bodies, websocket, etc...

Right. But I would think a lot of those just go back into either being an buffered reader around another stream, or they have an internal buffer which actually doesn't benefit from vectored IO. Maybe @Nemo157 has some insight from the implementation of async-compression?

In my personal experience, I definitely hit the issue where, in order to have the most efficient impl possible, I needed to implement both poll_read and poll_read_vectored separately to do different things, resulting in code dup. Most often than not, I ended up just skipping the specialized poll_read_vectored option.

I'm guilty too :-) Although the only situation I really could recall where I could forward some vectored IO was in the case of some buffered reader/writer operation - when the buffer was exhausted and bypassed.

For me, the biggest factor is the "safety" aspect and uninitialized memory. Being required to zero out memory is sub optimal and I don't think the current strategy for modeling safety & uninitialized memory is ideal. WDYT?

I definitely agree that zeroing out memory is a bit annoying from a performance perspective, since it is typically not necessary for the correctness of the program. I don't yet understand what the impact of simply not doing it would be, since I don't have the historic context of APIs like prepare_uninitialized_buffer. A not uninitialized byte array is at least still a byte array with valid u8 values that can be read. So for any implementation it of an AsyncRead it should still be ok reading those values? Is this about tools like Miri not liking those interactions? I am wondering whether the differentiation between uninitialized memory and initialized ones must really be a property if the byte stream, or rather of the caller or memory allocator. Would you really ever do anything different in a Stream based on the knowledge that bytes have not been initialized?

Also, to note, taking &mut dyn Buf will prevent passing in &[u8] as an argument.

I think being able to pass &[u8] is important for discoverability and to allow newcomers to work with the API. It will likely be what most users are using, if they don't go for owned APIs like Bytes. I think a generic argument like Buf here can increase the barrier for newcomers. It reminds me of when I started boost asio long ago, and the function signatures like this one mention MutableBufferSequence and I first had no idea what to pass. I learned then quickly that I can pass buffer(ptr, length), but it took quite a bit of time to understand and appreciate the full flexibility of the API.
I therefore think that being able to pass a normal byte slice - potentially with implicit conversion - is important. As far as I understand that could also be in the async fn/Future wrapper around this. But maybe the situation is different for Rust, since users get in touch with generics a lot more often.

Object safety was one of the main reason why directly implementing IO as Future producing types was not found ideal in rust-lang-nursery/futures-rs#1365. I think the reasons are valid, and I have appreciated the ability to pass byte streams as dyn AsyncRead now in several occasions. Although I'm still a bit sad that I can't implement these types via composition and async fn.

@Nemo157

This comment has been minimized.

Copy link

Nemo157 commented Nov 7, 2019

I'm in favor of changing the return types of poll_read and poll_write to io::Result<()> (from io::Result<usize>) since it's part of {Buf, BufMut}.

Being able to easily see how many bytes were written/read during the current call is very useful, at a minimum to be able to check for EOF, but also when performing other operations on the data in parallel to writing/reading it.

EDIT: It is possible to get this data without returning it, but it becomes a very painful pattern after the fifth time of writing it

let prior_len = buf.bytes().len();
writer.poll_write(&mut buf)?;
let written = buf.bytes().len() - prior_len;

and looking at BufMut I'm not certain whether remaining_mut is allowed to change dynamically as data is written, if it is I don't think it is possible to externally determine how many bytes were written during poll_read without wrapping the buffer.

@Nemo157

This comment has been minimized.

Copy link

Nemo157 commented Nov 7, 2019

But I would think a lot of those just go back into either being an buffered reader around another stream, or they have an internal buffer which actually doesn't benefit from vectored IO. Maybe @Nemo157 has some insight from the implementation of async-compression?

Nope, I have been ignoring the vectored IO methods so far.

One complication is that I'm using AsyncBufRead and AsyncBufWrite for the underlying IO, which don't expose any vectored access (although maybe it doesn't matter at this interface, is vectored IO just so that the application can write disparate slices in one kernel operation, or does splitting a single large read up into multiple smaller slices give a performance benefit?).

If a user is passing vectored IO buffers into the encoder/decoder then there would be a performance advantage to having it loop over all the buffers within the context of a single operation, to avoid extra calls into the underlying IO per-buffer. This seems easy to implement, and I could do the inverse of the default implementation and just have poll_read call poll_read_vectored with a single buffer vector.

@MOZGIII

This comment has been minimized.

Copy link
Contributor

MOZGIII commented Nov 7, 2019

Are there any plans on having generic support for various underlying types? I.e. Buf<T: Copy>. I’m working with audio samples, and there are a lot of operations where it’d be useful to have the same abstractions that we have for u8 buffers, but for other types (f32, u16, i16 to name a few).
If not I think we should implement a separate (i.e. not std and non tokio) crate with such abstraction.

To be concrete, I have a practical use case example. I\m using the code that extends the AsyncRead/AsyncWrite trait concepts for generic types, for instance this: https://github.com/MOZGIII/netsound/blob/1137f77966a6a3bd053acbc2e06dee68b31af0ba/src/io/async_write_items.rs

@Shnatsel

This comment has been minimized.

Copy link

Shnatsel commented Nov 7, 2019

The "read to uninitialized memory" is a really thorny problem, and I'm glad to see it's being addressed! However, some important aspects of it are not clear to me from the description:

Switching the argument to T: BufMut solves this problem via the BufMut trait. First, BufMut provides low-level functions that return &mut [MaybeUninitialized<u8>].

Returning a single slice does not seem to be possible with the current release of bytes crate - the documentation on BufMut states that the underlying storage may or may not be in contiguous memory. I see this PR uses bytes crate from git, is there a requirement for the storage to be contiguous in the git version of bytes?

Also, BufMut writes are infallible, i.e. it will allocate more memory if it runs out of capacity. Does this still hold for the revised version of BufMut ?
If so, this is contrary to the current Read behavior that writes to a fixed-size slice. In many cases allocating an unbounded amount of memory is undesirable and will lead to a DoS attack; I fear providing an automatically reallocating interface alone will make people roll their own unsafe with uninitialized slices, just like it's already happening with Vec. Providing writes to a Vec-like buffer with a bounded capacity sounds like a simpler solution, since it could be used to safely implement writes to an unbounded buffer in turn.

@Shnatsel

This comment has been minimized.

Copy link

Shnatsel commented Nov 7, 2019

@sfackler I've been giving a lot of thought to encapsulating uninit buffers, incl. in Read trait. I'll leave comments on the doc in the next few days. You can also catch me in #wg-secure-code on Rust Zulip if you like.

@sfackler

This comment has been minimized.

Copy link
Contributor

sfackler commented Nov 7, 2019

Also, BufMut writes are infallible, i.e. it will allocate more memory if it runs out of capacity.

That's the behavior of Vec's implementation, but not the behavior of the implementations for other types like slices or Cursors. For those types, the capacity never increases.

Definitely interested in your thoughts on Read + uninit buffers!

@Ralith

This comment has been minimized.

Copy link
Contributor

Ralith commented Nov 7, 2019

Are there any plans on having generic support for various underlying types?

This is an interesting thought; AsyncRead/Write then become something like a batch-oriented version of Stream/Sink. Probably something to explore independent of this specific PR though, same as adding an associated error type.

@baloo

This comment has been minimized.

Copy link

baloo commented Nov 8, 2019

Just wondering, given the amount of times AsyncRead/AsyncWrite is referenced to in tokio and its ecosystem, wouldn't all those dyn Buf or dyn BufMut (from a concrete type) affect compilation speed significantly? I feel like it's a lot more work for rustc to handle those.

Otherwise, the arguments sounds reasonable.

@hawkw

This comment has been minimized.

Copy link
Member

hawkw commented Nov 8, 2019

@baloo I believe a dyn Trait is going to be much easier on the compiler than a generic or impl Trait (no monomorphization!)

@DoumanAsh

This comment has been minimized.

Copy link
Contributor

DoumanAsh commented Nov 8, 2019

@hawkw virtual dispatch has runtime overhead and it might not be desirable for high performance.

Using Buf* is certainly convenient as user can provide something more than concrete type, but we should note potential performance hit

@Ralith

This comment has been minimized.

Copy link
Contributor

Ralith commented Nov 13, 2019

I'm curious -- are there measurements about the cost of zeroing memory in full applications?

In Quinn, on my laptop, __memset_avx2_erms for zeroing our 64KiB stack-allocated receive buffer (which happens on every poll but less often than every recv) costs per benchmark:

  • large streams: 2.45%
  • small streams: 0.60%
  • small datagrams: 0.20%

So, not a huge issue for us, though if we have a lot of success optimizing elsewhere then it might start to be worth our attention.

@carllerche

This comment has been minimized.

Copy link
Member

carllerche commented Nov 13, 2019

Thanks @seanmonstar for putting together this proposal and working through the
changes. It has been very helpful to see the changes in context. I think that it
is a big improvement from the status quo.

I would like to provide some additional thoughts for context as well as
an alternative proposal. I don't necessarily think the alternate proposal is
better. Both have pros and cons, but I do think that both proposal will be an
improvement.

At this point, the focus should be to figure out an AsyncRead / AsyncWrite
design that satisfies Tokio's needs. Secondly, std has a medium term goal of
providing their own version of AsyncRead / AsyncWrite. Ideally, once
that happens, Tokio will be able to deprecate its version of the traits in favor
of the one in std. For this to happen, the version stabilized in std needs
to satisfy Tokio's requirements, so the proposed design should consider whether
or not stabilization in std is possible.

Goals & Requirements

The traits should enable both the implementor and the caller of AsyncRead / AsyncWrite to achieve (measurable) optimal performance without having to
reach for unsafe
. Reaching optimal performance implies using uninitialized
memory as a buffer for reading and using vectored operations when possible. The
traits should also be easy to use and implement.

In short, primary goals:

  • Ergonomic usage / implementation.
  • Use uninitialized memory for reads.
  • Smooth support for vectored operations.

Secondary goal:

  • Buy-in for stabilization in std.

Uninitialized memory

In order to reach optimal performance, uninitialized memory must be used as an
argument to read calls. However, in Rust, it is undefined behavior to read
from uninitialized memory. This is true even if the memory is an array of bytes.
Because of this, the stated requirement implies that uninitialized memory must
be encapsulated in a safe abstraction.

The original proposal encapsulates uninitialized memory via a trait object:
&mut dyn BufMut. This requires all usage of the trait object to go via dynamic
dispatch and also prevents inlining. It is unclear how much of an impact this
has in practice. For Hyper, the impact was not measurable. If AsyncRead
implementations do small operations like:

for i in 0..100u8 {
  dyn_buf.put_u8(i);
}

then I expect the impact of a trait object to be more severe. The problem is we
have no way to evaluate the real impact without pushing the trait object
proposal into the wild.

An alternate strategy would be to use a concrete struct that includes a cursor
tracking how much memory has been initialized. Something like:

pub struct ReadBuf<'a> {
    initialized: usize,
    data: &'a mut [MaybeUninit<u8>],
}

impl ReadBuf {
    pub fn get_raw(&mut self) -> &mut [MaybeUninit<u8>] {
        &mut self.data[..]
    }

    pub unsafe fn set_initialized(&mut self, len: usize) {
        self.initialized = len;
    }
}

impl ops::Deref {
    type Target = [u8];

    fn deref(&self) -> &[u8] {
        unsafe { mem::transmute(&self.data[..self.initialized]) }
    }
}

impl ops::DerefMut {
    fn deref_mut(&mut self) -> &mut [u8] {
        unsafe { mem::transmute(&mut self.data[..self.initialized]) }
    }
}

Vectored operations

Vectored operations allow the caller to avoid copying memory by supporing reads
/ writes with discontinous memory. However, not all AsyncRead / AsyncWrite
implementations can support vectored operations. For example, a TlsStream
backed by native-tls is unable to efficiently support vectored ops due to a
lack of support in the underlying libraries (like openssl).

The existing solution to this problem is to include separate functions on the
traits: one for basic read (poll_read / poll_write) and one for vectored
operations (poll_read_vectored / poll_write_vectored). The vectored versions
include a default implementation that call the non-vectored versions with the
first slice.

This solution has a critical problem. In the case that the underlying I/O type
does not support vectored operations, the default implementation is
inadequate from a performance POV. When using vectored operations, the caller
can have a large set (~1,000) of tiny buffers (8 ~ 64 bytes). The expectation is
the entire set of buffers is submitted to the kernel in one go. The default
implementation will result in the caller submitting their set of buffers, only
one being written, and the caller being forced to loop and issue many syscalls.
The optimal strategy here is for the caller to hold a staging buffer, write all
of its slices into the staging buffer, then submit the staging buffer to
AsyncRead::poll_read in one go.

A secondary problem is that, because poll_write_vectored has a default
implementation, almost all implementations of AsyncRead / AsyncWrite
forget to provide a vectored variant even if it has the ability to do so.
Because of the problem above, forgetting to provide an implementation renders
the vectored fn calls useless.

My conclusion to the above problems is that, at a fundamental level, the
caller of AsyncRead / AsyncWrite needs to employ different patterns
depending on whether or not the AsyncRead / AsyncWrite type supports vectored
operations. This implies that the caller needs to be able to detect the
capability somehow.

The original proposal (&mut dyn Buf) directly solves the second problem by
only providing a single function to implement. The original problem can be
solved indirectly by providing a T: BufMut wrapper that detects if
bytes_vectored and bytes_vectored_mut are called. If they are not
called, a buffer internal to the BufMut wrapper is initialized and used as a
staging buffer. An example of this can be found in
Hyper.

An alternate strategy would be to include a function on AsyncRead / AsyncWrite
that informs the caller about capability:

pub trait AsyncRead {
    fn poll_read(self: Pin<&mut Self>, dst: &mut ReadBuf<'_>) -> Poll<io::Result<usize>>;

    fn can_vector(&self) -> bool {
        false
    }

    fn poll_read_vectored(
        self: Pin<&mut Self>,
        dst: &mut ReadBufSlice<'a>
    ) -> Poll<io::Result<usize>> {
        read_with_first_io_slice(self, dst)
    }
}

The caller would then be able to detect capability up front and perform
operations as necessary. This solves the critical problem. The secondary problem
can be addressed by not providing a default implementation of the function.
This would force the implementor to consider a vectored implementation. The
biggest downside with this option is it requires more functions to be
implemented. However, deciding to not implement poll_read_vectored is less
critical as the caller is able to detect the lack of capability.

Inclusion in std

I cannot comment on this with any authority. My impression is that using a
concrete struct has a higher chance of inclusion vs. a trait object. This is
based on the impression I got with talking with a few people directly (I will
let them comment directly).

Personal opinion

So far, I have just presented the problem and the options as best as I could. I
have not really stated my personal preference.

I see both proposals as having pros and cons. I believe that both proposals are
an improvement over the traits provided by futures-io 0.3 for the reasons
outlined in this comment. I see the original proposal (&mut dyn Buf) as
providing better ergonomics. Only having a single function to implement vs.
multiple is a clear win. It is unclear how much code duplication will result as
a factor of having poll_read vs. poll_read_vectored.

On the flip side, it is unknown how much trait objects will impact the goal of
"optimal performance" in practice. Initial measurements in hyper show no
noticable performance impact, but that is a small set.

Also, the &mut dyn Buf proposal does not directly solve the problem of
being able to detect vectored capabilities and, instead relies on intercepting
calls to bytes_vectored_mut to detect capability. This is a con to me.

I see the concrete struct (ReadBuf) proposal as the "conservative" one. It
solves the stated problems, has some friction, but has an understandable
ecosystem impact. The &mut dyn Buf proposal has more unknown in terms of the
extent of ecosystem impact.

I hope that others will be able to dig into the two proposals and help guide the
final solution.

@Ralith

This comment has been minimized.

Copy link
Contributor

Ralith commented Nov 13, 2019

Has the possibility of using static dispatch been definitively eliminated?

@carllerche

This comment has been minimized.

Copy link
Member

carllerche commented Nov 13, 2019

@Ralith Nothing is definitive, but I consider it at a similar level to trait objects. Where trait objects have an unknown runtime hit, traits have an unknown compile time hit + hoops to jump through to make object safe.

@nikomatsakis

This comment has been minimized.

Copy link

nikomatsakis commented Nov 13, 2019

@carllerche I'm curious about the method that lets callers detect whether vectorized capabilities are available. How often and how far up the stack do you anticipate that introspection being used? It seems like a major shift in strategy, to opt for one big buffer vs many small ones, and I'm wondering if that is something that would normally occur "close" to the calls to AsyncWrite, or if there would often be more layers in between that have to be adjusted.

@carllerche

This comment has been minimized.

Copy link
Member

carllerche commented Nov 13, 2019

@nikomatsakis My gut is it stays close to the usage of the AsyncRead / AsyncWrite type. It would be encapsulated in Hyper, for example.

@seanmonstar

This comment has been minimized.

Copy link
Member Author

seanmonstar commented Nov 13, 2019

@RalfJung

For read and BufMut, where can I see the new BufMut API that the OP is mentioning?

The changes are in bytes master, the docs for are here: http://tokio-rs.github.io/bytes/doc/bytes/trait.BufMut.html#tymethod.bytes_mut. It also includes a newtype over IoSliceMut, to stop exposing the Deref<Target=[u8]> on the std::io::IoSliceMut type.

We're not doing anything special with write.

@Matthias247

This comment has been minimized.

Copy link

Matthias247 commented Nov 13, 2019

An alternate strategy would be to use a concrete struct that includes a cursor
tracking how much memory has been initialized. Something like:

pub struct ReadBuf<'a> {
    initialized: usize,
    data: &'a mut [MaybeUninit<u8>],
}

It's a bit unclear for me what the initialized size is really used for. Assuming we are talking about issues at code generation time - how would they be influenced by any value which is only set at runtime? It feels like the generated code must be able to know at compile time which bytes are written / or potentially assume that all bytes get written?

One likely stupid question is why can't we get something like the C version working by having another initializer, that tells the compiler that the whole memory section is used (and may be assumed as initialized), but where we don't care about the content.

E.g. along

let mut buffer: [u8; 1024] = [i_dont_care_at_all; 1024];
reader.read(&mut buffer[..]);

With that approach all read/write traits can still operate on slices. I think that initializer is different from MaybeUnit because

  • it would only be valid for types where all random bytes which could by somewhere in memory are valid Rust values. Which is most likely only all the u8/u16/u32/i8/i16/i32/... types.
  • it prevents optimizations on that chunk of memory, and looks like a normal slice for everything else.

Likely that came up already somewhere else and there is a valid reason not to have it, so sorry for asking upfront.

@sfackler

This comment has been minimized.

Copy link
Contributor

sfackler commented Nov 13, 2019

that tells the compiler that the whole memory section is used (and may be assumed as initialized), but where we don't care about the content.

That is freeze.

@sfackler

This comment has been minimized.

Copy link
Contributor

sfackler commented Nov 13, 2019

It's a bit unclear for me what the initialized size is really used for.

It is used for a method on ReadBuf that returns a &mut [u8] by initializing the buffer (or some subset of it), but only if actually necessary. It's quite important to not be re-zeroing the buffer over and over when repeatedly interacting with a reader.

Assuming we are talking about issues at code generation time - how would they be influenced by any value which is only set at runtime? It feels like the generated code must be able to know at compile time which bytes are written / or potentially assume that all bytes get written?

"Surely the compiler can't understand this code and take advantage of the UB involved" is not really a robust path to making sound interfaces.

@jonhoo

This comment has been minimized.

Copy link
Contributor

jonhoo commented Nov 13, 2019

I wonder if it'd be possible to make read something along the lines of:

fn poll_read(&mut [MaybeUninit<u8>]) -> io::Result<&mut [u8]>;

Where the returned slice is a subslice into the provided one that contains only the actually-read (and so no longer uninitialized) bytes. That way, the caller would not need to do anything unsafe, yet we'd still support operating on uninitialized memory.

@Matthias247

This comment has been minimized.

Copy link

Matthias247 commented Nov 14, 2019

That is freeze.

Thanks for the hint. I know googled for that and came to this PR with it. It is basically one variation of my idea.

"Surely the compiler can't understand this code and take advantage of the UB involved" is not really a robust path to making sound interfaces.

I am more interested on what I (either as an AsyncReader implementor, or even as an OS or Allocator implementor) would need to do - in terms of code that is generated and gets executed at runtime - to make sure things are safe. Touching all bytes without being aware what side-effects it causes or must cause also doesn't sound like the most ideal solution.

@jonhoo

I wonder if it'd be possible to make read something along the lines of:

Wouldn't that require all implementors of the API do always initialize the array before doing anything else, since we never know how much had been initialized before? That would mean if such a read API would be called in a loop things have to be initialized over and over again. @carllerche s tracking struct tries to avoid that. For that one I actually don't understand if the compiler can actually associate the tracking information and the MaybeUninit inside the read call. For the pure read call code that is generated it can not observe any writes to previous offsets. So is this safe or not? From a logical perspective it is, but we were talking about the machine model and the compiler model being different here.

It's also not clear to me whether we can pass a MaybeUninit::as_ptr() to a syscall without initializing it explicitly inside the read call. Maybe someone can help me on that :)

@Matthias247

This comment has been minimized.

Copy link

Matthias247 commented Nov 14, 2019

FWIW, here are my experiences with vectorized IO over the last 10 years - where I found it helpful and where less so. Maybe others can add their findings, so what we get a better unstanding what things are required and where they are mostly used.

Vectorized Reads:

This one is easy: I never found a good use for them. I could not use them for length delimited messages (length goes into one buffer, payload in another), because the length of the payload is not known upfront. Even if it would - just reading everything in one buffer and splitting the length off will be more efficient.

One use-case I could imagine is if ones application utilizes fixed-width buffers for IO from a pool (e.g. using Bytes) that are getting filled as much as possible through reads, and then those buffers are getting pushed to the application code (which needs to deal with arbitrary payload boundaries). Maybe that would work for things like node.js where buffers are getting pushed to the application, instead of the user passing buffers (pull model).

It might also help for protcols where all data has fixed width. But on Stream based protocols (like TCP) variable sized payloads are actually typical. I guess the fact that we want our protocols and payloads to be extensible also contradicts fixed width messages.

Vectorized Writes:

First again what didn't work: Trying to write small things via vectorized IO. E.g. writing the first 4 bytes of a message which contained the length in the first slot of an IoVec, and the payload in another slot. This was a lot slower than merging things into one buffer. I think there is an undocumented minimum size that each buffer needs to have in order to be economic. And below that applications are better of merging everything (e.g. via a buffered writer) and doing aggregated writes - even though it means a lot more copies.

Apart from "what didn't workt that well", here are the 2 uses of vectorized writes at actually showed some benefits:

1. Buffered Writer implementations

Those are super helpful for protocols like HTTP, where you don't want to perform OS-level IO on everey write for headers. Every header gets written to the buffered writer. If the buffer is full, we try to write as much as possible to the OS. That is not yet a vectorized call.

However at some point of time we might switch to writing HTTP bodies, which are bigger. That means we now have a certain amount of buffered data as well as bigger amount of user data (maybe even in a vector form). That makes a reasonable vectorized write. However most of the IO vectors have had a realistic size of 2. For that case building the IoVec myself wasn't that bad. And even if the user passed a vector of data, then the remaining TODO was mostly inserted the buffered chunk at the front of the list.

2. Messaging with queues systems

I worked on some systems where one component buffered all outgoing messages for another machine, which can be enqueued by various tasks. The component tries to write buffered messages as fast as possible onto a socket. Since multiple message buffers might be available, it could on every write attempt build an IoVec based on all buffered messages and send those out at once. Then based on the amount of written bytes queued messages could be released. A websocket implementation could do something like that, with each Frame being a buffer. In this kind of systems the IoVec is also built dynamically, since the amount of messages in the queue changes - written messages get dropped and new messages get added.

@carllerche

This comment has been minimized.

Copy link
Member

carllerche commented Nov 14, 2019

@Matthias247 Thanks for the feedback. Your insights are helpful.

Personally, I have used vectored reads with success. The most obvious case is when using a ring buffer. When the buffer wraps, you can use vectored reads to fill the entire buffer.

For vectored writes, another thing to keep in mind is that AsyncWrite layers can take advantage of it as well even if it doesn't result in the vectored syscall. For example, a BufferWrites layer can accept vectored writes to fill its internal buffer. I would also assume that libs like rustls could take advantage of it as well. Mostly, the goal here is to avoid the caller having to buffer data... submit the buffer and it gets buffered again in an intermediate layer.

@Matthias247

This comment has been minimized.

Copy link

Matthias247 commented Nov 14, 2019

For vectored writes, another thing to keep in mind is that AsyncWrite layers can take advantage of it as well even if it doesn't result in the vectored syscall.

Right. Such an API can even just help to forward data in an atomic fashion between components in the same program (e.g. if every read/write would require taking a lock). Not sure however how common it is for those calls to directly forward them again in a vectored fashion.

@RalfJung

This comment has been minimized.

Copy link

RalfJung commented Nov 14, 2019

@carllerche

However, in Rust, it is undefined behavior to read
from uninitialized memory.

Probably you are aware, but things are slightly more subtle -- the following is completely fine, for example, even though it reads uninit memory:

let x: MaybeUninit<Vec<i32>> = mem::uninitialized();

But indeed, most types cause UB if you leave them uninitialized.
Also, uninitialized bytes can be read as part of struct padding without UB:

let x: (u8, u16);
x.0 = 1;
x.1 = 2;
let y = unsafe { (&x as *const _).read() }; // this (conceptually) also reads the padding byte, which is uninit. But that's fine.

If you read through https://doc.rust-lang.org/reference/behavior-considered-undefined.html, you'll notice that the actual rules hardly ever mention uninit memory. :) It is not needed, most of the time.

@seanmonstar

The changes are in bytes master, the docs for are here: http://tokio-rs.github.io/bytes/doc/bytes/trait.BufMut.html#tymethod.bytes_mut. It also includes a newtype over IoSliceMut, to stop exposing the Deref<Target=[u8]> on the std::io::IoSliceMut type.

Thanks. That API looks good on first glance. We might be able to improve ergonomics for slices of MaybeUninit, but that should be orthogonal.

We're not doing anything special with write.

Okay. So for now this means only fully initialized buffers can be written.

@jonhoo

Where the returned slice is a subslice into the provided one that contains only the actually-read (and so no longer uninitialized) bytes. That way, the caller would not need to do anything unsafe, yet we'd still support operating on uninitialized memory.

Yes, I think that would work.

@Matthias247

Wouldn't that require all implementors of the API do always initialize the array before doing anything else, since we never know how much had been initialized before? That would mean if such a read API would be called in a loop things have to be initialized over and over again.

Why that? The API @jonhoo proposed returns a subslice. That slice should be as long as what that read call wrote anyway. No extra writes should be needed.

@ztlpn

This comment has been minimized.

Copy link

ztlpn commented Nov 14, 2019

Re: API proposal by @jonhoo - I suppose it suffers from the same "inefficient backward compatibility" problem mentioned in @sfackler's dropbox doc for the dyn BufMut proposal (BTW why doesn't the doc mention that?)

I.e. a hypothetical compatibility shim that implements tokio::AsyncRead::poll_read in terms of futures::io::AsyncRead::poll_read will have to initialize the buffer on every call before it passes the buffer to futures::io::AsyncRead::poll_read.

@RalfJung

This comment has been minimized.

Copy link

RalfJung commented Nov 14, 2019

I.e. a hypothetical compatibility shim that implements tokio::AsyncRead::poll_read in terms of futures::io::AsyncRead::poll_read will have to initialize the buffer on every call before it passes the buffer to futures::io::AsyncRead::poll_read.

As far as I can tell that is inherent in futures::io::AsyncRead::poll_read; no API that correctly handles uninit bytes can avoid that?

@ztlpn

This comment has been minimized.

Copy link

ztlpn commented Nov 14, 2019

@RalfJung Sure, every compatibility shim will have to initialize memory, but if reads are performed in a loop and the buffer is reused, it is desirable to initialize only once - and this is possible e.g. with the "concrete struct" proposal because it tracks the length of the currently initialized part. But if poll_read takes a simple &mut [MaybeUninit<u8>] slice as a parameter, a shim must reinitialize the whole buffer on every loop iteration which is prohibitively expensive.

That's my understanding of the problem anyway.

@seanmonstar

This comment has been minimized.

Copy link
Member Author

seanmonstar commented Nov 14, 2019

it is desirable to initialize only once

This can also be possible with the dyn BufMut. Currently the required method is bytes_mut() -> &mut [MaybeUninit<u8>], and we can add bytes_mut_initialized() -> &mut [u8] method that defaults by zeroing, but a buffer that has some internal indices can override it.

@nikomatsakis

This comment has been minimized.

Copy link

nikomatsakis commented Nov 15, 2019

Hey all. So I started writing up a comment here but found it was getting too long. I decided to create a hackmd where I am taking notes on some of the considerations here -- obviously I'm not a core decision maker, but I am pretty interesting in understanding the arguments. There are lots of XXX's where I got tired of typing stuff. =) Feel free to fill notes in if you'd like, I hope to complete it at some point, just to have a good record for future standardization discussions.

In this comment, I'd just like to highlight a few things:

  • I still think it would be really useful to try and get more representative performance numbers. It seems like we are mostly going on microbenchmarks. But I feel like there are some reasonably high performance servers in other languages that are using zeroing -- I could easily see that this is all lost in the noise in end-to-end applications (I could also easily see that the opposite is true). Does tokio have some "sample applications" that are used for benchmarking? Could be just comment out zeroing and see what effect that has on those applications (presuming that all the readers are well-behaved, and don't actually read from their input buffers)?

The one measurement which @Ralith gave here suggested zeroing had a small impact on smaller workloads, and an impact of 2% or so on larger loads -- @Ralith can you say a bit more on what you were testing there? I'm not familiar with quinn, but I guess you meant https://github.com/djc/quinn?

  • Similarly, it seems like &mut dyn Buf has a lot of advantages, but introduces some unknowns as to the cost of those virtual calls at scale (they don't show up in microbenchmarks, but is that representative?). Of course, this is arguing in sort of the opposite direction of the previous paragraph, in that you usually expect that if microbenchmarks don't show any effect, then larger systems won't either. I'm honestly not sure here, as I could imagine that (e.g.) virtual calls are geting optimized out in microbenchmarks (since LLVM can inline and see the source of the call) or other such things. I'm not sure if people have investigated that? It seems like it would be worth doing.

  • I think a key consideration for this decision should be intercompatibility. We don't know yet what the std traits will look like. I think it's ok if tokio/std diverge on the precise shape the trait (the valuation of the tradeoffs might not be the same) but it'd be a shame if they don't permit performant interop.

@Ralith

This comment has been minimized.

Copy link
Contributor

Ralith commented Nov 15, 2019

@Ralith can you say a bit more on what you were testing there? I'm not familiar with quinn, but I guess you meant https://github.com/djc/quinn?

Yeah, sorry. Quinn is me and @djc's futures/tokio friendly implementation of QUIC, a complex encrypted transport protocol that runs on top of UDP. All of the benchmarks I cited spin up a client/server pair on independent threads on localhost, then cram as much data through the connection as possible, limited only by CPU. Each benchmark sends different types of data, requiring the implementation to do different amounts/types of processing.

  • large streams: 128KiB reliable messages; each message must be fragmented into >100 UDP packets and reassembled on the other side.
  • small streams: 1-byte reliable messages; many messages are packed into each packet.
  • small datagrams: 1-byte unreliable messages; many are packed into each packet and less bookkeeping needs to be done.

Judging by the results, we should be looking for low-hanging fruit in our packet encode/decode procedures (for example) before worrying about zeroing.

@seanmonstar

This comment has been minimized.

Copy link
Member Author

seanmonstar commented Nov 15, 2019

@nikomatsakis

Does tokio have some "sample applications" that are used for benchmarking?

hyper has a suite of end-to-end (so testing server and client implementations together) benchmarks checking concurrent requests and throughput. I wouldn't quite call them microbenchmarks, since they test a full stack, but at the same time, most servers won't be trying to dedicate 100% of their CPU to the same thing, so... 🤷‍♂

I ran the benchmarks for a single request/response, with 100kb bodies (both request and response) and 10mb bodies (https://github.com/hyperium/hyper/blob/71d088d3d0c062aa05459697f53bf15018cfd651/benches/end_to_end.rs#L30-L48). Afterwards, I applied this simplistic patch that zeroes the memory before poll_read_buf. Granted, it's probably zeroing things too frequently, but it's also what the current futures-io/std::io APIs would be doing anyways...

diff --git a/src/proto/h1/io.rs b/src/proto/h1/io.rs
index b5352f79..51218cba 100644
--- a/src/proto/h1/io.rs
+++ b/src/proto/h1/io.rs
@@ -173,6 +173,11 @@ where
         if self.read_buf.remaining_mut() < next {
             self.read_buf.reserve(next);
         }
+        // zero the uninitialized memory
+        unsafe {
+            let uninit = self.read_buf.bytes_mut();
+            std::ptr::write_bytes(uninit.as_mut_ptr(), 0u8, uninit.len());
+        }
         match Pin::new(&mut self.io).poll_read_buf(cx, &mut self.read_buf) {
             Poll::Ready(Ok(n)) => {
                     debug!("read {} bytes", n);

Results

cargo bench --bench end_to_end http1_body_both:

  • Uninitialized memory
    • 100kb (in both directions, so x2): ~1600 mb/s (over 3 runs)
    • 10mb: ~2350 mb/s
  • Zeroed memory
    • 100kb: ~1075 mb/s
    • 10mb: ~1850 mb/s
@nikomatsakis

This comment has been minimized.

Copy link

nikomatsakis commented Nov 15, 2019

@seanmonstar

I wouldn't quite call them microbenchmarks, since they test a full stack, but at the same time, most servers won't be trying to dedicate 100% of their CPU to the same thing, so... 🤷‍♂

Good point, the term "microbenchmarks" doesn't quite seem right. Thanks for elaborating. =)

EDIT: (I updated the measurement section to include the extra details, thanks to @Ralith too.)

@Matthias247

This comment has been minimized.

Copy link

Matthias247 commented Nov 16, 2019

@seanmonstar Thanks for benchmarking. That's actually some surprising results - I didn't expect it to show that much difference - especially since 2GB/s is actually still "slow". But you also mentioned it's probably doing more zeroing than it actually does.

There are people out there who are looking into exhausting 50-200GBit/s networking equipment - e.g. for CDN use cases. I recall reports like these where people claimed that the memory throughput (which is required for copying as well as zeroing) is an even more significant problem at those speeds. Therefore I am pretty sure there are applications that want to avoid all unnecessary work at all cost.

However those might not make use of interfaces like AsyncRead anyway. I guess there you would rather get a buffer from a buffer-pool (likely pre-initialized - which could mean containing garbage from the last use) and try to use zero-copy mechanism as much as possible

@Matthias247

This comment has been minimized.

Copy link

Matthias247 commented Nov 16, 2019

@RalfJung

Why that? The API @jonhoo proposed returns a subslice. That slice should be as long as what that read call wrote anyway. No extra writes should be needed.

I think my general challenge with the meaning of "uninitialized" and "MaybeUnit" is still whether we are actually talking about a type system property or a runtime/system property.

It it would be purely a type-system thing which allows the compiler optimizations when no write or only partial writes are observed in the scope, then things like passing in information about previous initializations do not help - they are outside of the current scope. And it's unclear to me whether side-channel information about previous initializations now makes something initialized or not. From a runtime/system perspective it certainly is, from a type-system perspective rather not.

It if would be a purely a runtime behavior, then it seems like weird behavior could not happen if you pass a things that looks uninitialized into a scope and just read from it - assuming the uninitialized looking thing had been initialized before in a different iteration. But according to the links from sfackler it can happen (was that mainly because the compiler could there guarantee that nobody ever used that uninitialized data?).

If it would be purely a compile-time property, then it's hard to understand why things like freeze() won't work.

With the mixed categorization I actually even have trouble to understand whether I can get a raw pointer from an allocator, do std::slice::from_raw_parts::<u8>(ptr, len) on it, and use that slice - given that I might now know whether the allocator zeroed the memory (in C language terms malloc does not, calloc does) .

@Ralith

This comment has been minimized.

Copy link
Contributor

Ralith commented Nov 16, 2019

Initializedness is a dynamic property of memory, i.e. something that is determined at runtime, according to the (admittedly not yet well defined) abstract semantics of the language, not to be confused with the particular behavior of today's compiler.

@RalfJung

This comment has been minimized.

Copy link

RalfJung commented Nov 16, 2019

I think my general challenge with the meaning of "uninitialized" and "MaybeUnit" is still whether we are actually talking about a type system property or a runtime/system property.

The one we are talking about here is not a type-system property. It is a runtime/"system" property, but the "system" here is the Rust Abstract Machine (R-AM)! As far as this discussion goes, Rust programs do not run on actual hardware, they run on the R-AM. It is the compilers job to emulate the R-AM on actual hardware (and the R-AM is specifically designed to make that emulation efficient).

The R-AM has some things that differ from actual hardware, and one of them is that it tracks, dynamically, while the program runs, whether memory is initialized. This extra tracking is used when defining which programs have UB and which do not. However, at the same time things are carefully designed such that a well-behaved (UB-free) program cannot possibly observe whether memory is initialized or not. This is needed for the "efficient emulation" part; it means that the compiler backend can generate code that doesn't actually know, at run-time, whether some memory is initialized or not, while still correctly emulating the R-AM under the assumption that the program has no UB. For programs with UB, the emulation is incorrect, but that's okay as such programs are just invalid.

"freeze" is a funny operation. It is not a NOP on the R-AM! It actually has the effect of changing the value stored in memory from "uninitialized" to some non-deterministically picked value. This will have a severe influence on optimizations, as those work exclusively with the R-AM semantics. (Optimizations turn one R-AM program into another, equivalent program -- again assuming the program is UB-free; for programs with UB the result does not have to be equivalent.) However, the way the emulation of the R-AM in terms of real hardware is set up, "freeze" can compile to a NOP.

@Ralith

This comment has been minimized.

Copy link
Contributor

Ralith commented Nov 19, 2019

Drafted a standalone benchmark without criterion for easier profiling and memsetting while sending data at a CPU-bound ~1Gbps is taking a little under 5% of runtime, which seems like enough to be deemed significant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.