-
Notifications
You must be signed in to change notification settings - Fork 18.3k
Description
[Edited to more benchmarks]
[Edited to add reservations about iterator use]
Proposal Details
This proposal adds two new functions and one new type to the io
package:
type Seq = iter.Seq2[[]byte, error]
func SeqFromReader(r Reader, bufSize int) Seq
func ReaderFromSeq(it Seq) ReadCloser
These enable more efficient and ergonomic I/O pipelines by leveraging Go's iterator functionality.
Background
When io.Reader
was introduced into Go, its API was designed following the time-honored API of the Unix read
system call.
The caller provides a buffer and the callee copies data into that buffer.
However, its API is notoriously hard to use. Witness the long doc comment. I'm sure most
of us have struggled to write io.Reader
implementations and been caught out by gotchas when
using it.
It's also somewhat inefficient. Although it amortizes byte-by-byte streaming cost, it's in general
not possible to turn an io.Writer
(convenient to produce data) into an io.Reader
without
the need to copy data, because once the data has been copied there is no clear way
of signalling back to the producer of the data that it's done with. So io.Pipe
is as good
as we can get, which involves goroutines and copying.
However now that iterators are a thing, there is a potential alternative.
I propose that we add a way to bridge from an iterator to io.Reader
and vice versa.
Proposal
I propose adding one type and two functions to the io
package:
Seq
defines a standard type for these sequences.
SeqFromReader
returns a Seq
of byte slices, yielding chunks of data read from the Reader
using an internal buffer. ReaderFromSeq
wraps a Seq
as a ReadCloser
, implementing Reader
semantics.
The semantics of Seq
slices are similar to those of Reader.Read
buffers: callers must not retain or mutate slices outside the current iteration. The sequence terminates at the first non-nil error, including EOF
.
// Seq represents a sequence of byte slices. It's somewhat equivalent to
// [Reader], although simpler in some respects.
// See [SeqFromReader] and [ReaderFromSeq] for a way to convert
// between [Seq] and [Reader].
//
// Each element in the sequence must have either a non-nil byte slice or
// a non-nil error; a producer should never produce either (nil, nil) or
// a non-nil slice and a non-nil error.
//
// The sequence always ends at the first error: if there are temporary
// errors, it's up to the producer to deal with them.
//
// The code ranging over the sequence must not use the slice outside of
// the loop or across iterations; that is, the receiver owns a slice
// until that particular iteration ends.
//
// Callers must not mutate the slice. [TODO perhaps it might be OK to
// allow callers to mutate, but not append to, the slice].
type Seq = iter.Seq2[[]byte, error]
// SeqFromReader returns a [Seq] that reads from r, allocating a buffer
// of the given size to do so.
func SeqFromReader(r Reader, bufSize int) Seq
// ReaderFromSeq converts an iterator into an ReadCloser.
// Close must be called on the reader if it hasn't returned an error.
func ReaderFromSeq(it Seq) ReadCloser
Discussion
In general, no allocation or copying needs to take place when using this API.
Byte slices can be passed by reference directly between producer and consumer,
and the strong ownership conventions of an iterator make this a reasonable
approach. The coroutine-based iter.Pull
function makes ReaderFromSeq
considerably more
efficient than io.Pipe
while providing many of the same advantages.
The API is arguably easier to deal with:
- no need to deal with partial reads or writes
- less confusion over the multiple combinations of the return values from
Read
The fact that we can write a bridge between Seq
and Reader
mean that
this new abstraction could fit nicely into the existing Go ecosystem.
It might also be useful to add another abstraction to make it easier to use
Writer
-oriented APIs with a generator. Perhaps something like this:
// SeqWriter returns a [Writer] that operates on the yield
// function passed into a [Seq] iterator. Writes will succeed until
// the iteration is terminated, upon which Write will return
// [ErrSequenceTerminated].
//
// The returned Writer should not be used outside the scope
// of the iterator function, following the same rules as any yield
// function.
func SeqWriter(yield func([]byte, error) bool) Writer
Performance
I tried a few benchmarks to get an idea for performance:
Benchmark | old (sec/op) | new (sec/op) | Δ sec/op | old (B/s) | new (B/s) | Δ B/s |
---|---|---|---|---|---|---|
PipeBase64 | 11.187µ ± 1% | 6.008µ ± 2% | -46.30% | 698.4Mi ± 1% | 1300.5Mi ± 2% | +86.22% |
PipeNoop | 537.0n ± 2% | 144.8n ± 2% | -73.05% | 14.21Gi ± 2% | 52.72Gi ± 2% | +271.11% |
ReaderNoop | 2.263n ± 1% | 3.584n ± 6% | +58.41% | 3.299Ti ± 2% | 1.948Ti ± 3% | -40.96% |
ReaderFillIndex | 1.110µ ± 2% | 1.109µ ± 1% | ~ | 6.874Gi ± 2% | 6.881Gi ± 1% | ~ |
The Pipe benchmarks measure the performance when using SeqReader
as
a substiture for io.Pipe
with various workloads for between the source
and the sink (a base64 encoder and a no-op pass-through).
This is to demonstrate how Seq can be used to improve performance
of some existing tasks.
The Reader benchmarks measure performance when using a Seq vs an io.Reader;
the no-op does nothing at all on producer or consumer; the FillIndex
just fills the buffer and runs bytes.Index on it (a fairly minimal workload).
This is to demonstrate how Seq is a very low overhead primitive for producing
readable data streams. Writing this benchmark, it was immediately obvious
that writing the io.Reader
benchmark code was harder: I had to create an
auxilliary struct type with fields to keep track of iterations, rather than just
write a simple for loop. This is the classic advantage of storing data in control flow.
So we might pay for this abstraction with a nanosecond of overhead, that
seems well worth the cost.
However, it's not all roses. Seq
is very convenient when we want to write push-oriented
code, but if callers are usually going to convert it to a reader with SeqReader
and
it's not too hard to write the code in a pull-based style, we should do so:
Benchmark | old (sec/op) | new (sec/op) | Δ sec/op | old (B/s) | new (B/s) | Δ B/s |
---|---|---|---|---|---|---|
ReaderVsSeqFromReaderNoop | 3.341n ± 3% | 128.550n ± 1% | +3747.65% | 2283.51Gi ± 3% | 59.35Gi ± 1% | -97.40% |
ReaderVsSeqFromReaderFillIndex | 1.089µ ± 2% | 1.221µ ± 1% | +12.08% | 7.006Gi ± 2% | 6.253Gi ± 1% | -10.76% |
Note that the overhead here is per-iteration, not per-byte, so as the buffer size grows and the per-iteration work grows proportionately, the overhead will reduce.
PoC code (including the benchmark code) is available at https://github.com/rogpeppe/ioseq.
Reservations about using iterators in this way
My biggest reservation about this proposal is the way that it uses iterators. In general, iterators work best when the values produced are independent of the loop. This enables us to use functions such as slices.Collect
and maps.Insert
which take values received from the iterator and store them elsewhere. This proposal uses iterators differently. The values produced are emphatically intended not to escape the loop in this way.
That said, iterators of the general form iter.Seq[T, error]
aren't that useful with collectors of this type anyway because of the final error value.
I'm very much in two minds on this. One the one hand, the above properties are definitely nice to have. One the other hand, there are many pre-Go-iterator-support iterator-like APIs which it seems to me would be nice phrased as for-range loops. But those iterator-like APIs often return values which only valid for the extent of one iteration. bufio.Scanner.Bytes is a good example of this. In fact it feels like a great example because that's essentially exactly what Seq
is doing. You can even use Scanner
to mimic what SeqFromReader
does.
Another example of this dilemma is #64341.
Modulo explicitly "escaping" functions like slices.Collect
, iterators do seem like a good fit here. The scope of a given iterator variable is well-defined, much like the argument to a function (which it is, of course, under the hood). And in general we don't seem to have a problem with saying that it's not OK to store the argument to a function outside the span of that function's invocation.
So ISTM that we need to decide if it's OK to "bless" this kind of pattern. If it's not OK, then this proposal should be declined.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status