This repository has been archived by the owner. It is now read-only.

Memory footprint StreamReader.readexactly #394

Open
frensjan opened this Issue Aug 4, 2016 · 23 comments

Comments

Projects
None yet
4 participants
@frensjan
Copy link

frensjan commented Aug 4, 2016

StreamReader.readexactly is a very handy method to implement reading frames ('chunks' of data preceded with the size of the 'chunk') from a stream. E.g.:

frame_len = yield from reader.readexactly(4) # when using 32 bits sizes
frame = yield from reader.readexactly(frame_len)

However, it's memory footprint isn't ideal. A list of 'blocks' is kept until all data is read which is then joined with b''.join(blocks). This causes each 'frame' to reside in memory twice until the blocks list and the bytes objects it points to are freed (when leaving the readexactly method). So it's peak memory usage is O(2n).

For small 'frames' the only issue probably is some copying which could be avoided, but for larger 'frames' this hurts. E.g. when sending large files for processing to a machine which are larger than memory / #receivers / 2.


A possible solution might be to use a bytearray to gather the data incrementally. E.g. something along the lines of:

@asyncio.coroutine
def readexactly(self, n):
    if self._exception is not None:
        raise self._exception

    if not n:
        return b''

    if len(self._buffer) >= n and not self._eof:
        data = self._buffer[:n]
        del self._buffer[:n]
        self._maybe_resume_transport()
        return data

    data = bytearray(n)
    pos = 0

    while n:
        if not self._buffer:
            yield from self._wait_for_data('read')

        if self._eof or not self._buffer:
            raise asyncio.IncompleteReadError(data[:pos], pos + n)

        available = len(self._buffer)
        if available <= n:
            data[pos:pos + available] = self._buffer
            self._buffer.clear()
            n -= available
            pos += available
        else:
            data[pos:pos + n] = self._buffer[:n]
            del self._buffer[:n]
            n = 0
        self._maybe_resume_transport()

    return data

This implementation would have a peak memory footprint of O(n). Or O(n + max(map(len, blocks))), not sure what the other constant factors might be, probably lots more lower in the stack.

Note that the implementation above also does away with some unnecessary copying by duplicating some code from StreamReader.read.
Note as well that the implementation has a short path for readexactly when the required data is available in the _buffer of the StreamReader.

An obvious issue with the implementation above is that it returns (mutable) bytebuffer instead of bytes objects. For my use case I considered this acceptable; I'm not sure if it would be acceptable in asyncio / the standard library. I'm not a aware of a copy-free mechanism to turn a bytebuffer into an immutable bytes-like object (like asReadOnlyBuffer).

Apart from being more memory friendly in terms of 'volume', because there is less copying, it's also a tad bit faster (around 20% on my machine, see test script here: https://gist.github.com/frensjan/27061b47e1423dd73eeec9f5cad5f4ad

An alternative direction could be to add a socket.recv_into-like method which accepts a mutable buffer.


I hope the issue is clear. Curious to get feedback on relevance for the 'broader developer community' and to the solution. If desirable I'll whip up a PR.

Best regards,
Frens Jan

@1st1

This comment has been minimized.

Copy link
Member

1st1 commented Aug 4, 2016

@socketpair

This comment has been minimized.

Copy link

socketpair commented Aug 5, 2016

Yes, this need to be fixed. expansion of bytearray is very fast.

@1st1

This comment has been minimized.

Copy link
Member

1st1 commented Aug 5, 2016

I actually noticed a speed regression in streams in 3.5.2. @socketpair would you be able to work on this?

@socketpair

This comment has been minimized.

Copy link

socketpair commented Aug 5, 2016

Yes, I can try. @1st1 can you provide the case where you notice performnce regression? (except this issue, since it already have performance test)

@1st1

This comment has been minimized.

Copy link
Member

1st1 commented Aug 5, 2016

I think I saw it when I was running this benchmark: https://github.com/MagicStack/uvloop/blob/master/examples/bench/rlserver.py -- stresses StreamReader.readline -- it became slower in 3.5.2.

@frensjan

This comment has been minimized.

Copy link
Author

frensjan commented Aug 5, 2016

@1st1 Just ran my own benchmark for readline again (https://gist.github.com/frensjan/27061b47e1423dd73eeec9f5cad5f4ad) but with uvloop: I saw a throughput improvement of 36% with my suggested implementation.

@socketpair note that the only bytearray which gets expanded is StreamReader._buffer. No incremental allocations are done within the suggested implementation of readline. I'm not familiar with the way bytearray objects expand, but I expect allocating bytarray(n) is better. Also, there is the added 'benefit' that a MemoryError will be raised before consuming to much from the reader.

@socketpair

This comment has been minimized.

Copy link

socketpair commented Aug 5, 2016

This test just hang on my PC...
@frensjan can you compare preformance of your variant vs #395 ?

@frensjan

This comment has been minimized.

Copy link
Author

frensjan commented Aug 9, 2016

@socketpair @1st1 I've compared master(at bbe53f3), #395 and the implementation suggested above with (with https://gist.github.com/frensjan/27061b47e1423dd73eeec9f5cad5f4ad):

master:

server done,     sent 4096.0 megabytes
client done, received 4096.0 megabytes
test took 0:00:09.196442
at 0.43 gigabytes per second

#395:

server done,     sent 4096.0 megabytes
client done, received 4096.0 megabytes
test took 0:00:08.648002
at 0.46 gigabytes per second

implementation above:

server done,     sent 4096.0 megabytes
client done, received 4096.0 megabytes
test took 0:00:07.291480
at 0.55 gigabytes per second

Environment:

Python: 3.5.1
Kernel: 4.6.4-301.fc24.x86_64

Curious to hear other experiences / observations.

@methane

This comment has been minimized.

Copy link
Member

methane commented Aug 9, 2016

@frensjan Your implementation seems returns bytearray, not bytes. Am I wrong?

#395 returns bytes, via bytes(memoryview(self._buffer)[:n]). (#395 (comment))

But it has significant cost. I researched what cause slowdown, and sent patch to Python.
http://bugs.python.org/issue27704

@frensjan

This comment has been minimized.

Copy link
Author

frensjan commented Aug 9, 2016

@methane yes, it indeed uses bytearray. Would be nice to have a means to turn a bytearray into a bytes object (or readonly view on it) as in asReadOnlyBuffer.

Or a faster implementation of bytes(bytearray) as you suggest. However, I'm a not fully confident on the interplay between StreamReader._limit and waiting until len(self._buffer) > n. If that's no issue ... (I've seen deadlock related commits around this in asyncio, so we should probably be careful).

@frensjan

This comment has been minimized.

Copy link
Author

frensjan commented Aug 9, 2016

@methane perhaps more importantly, the main reason why I filed this issue is not because of performance in terms of throughput, but memory footprint. The #395 implementation uses O(2n) memory (where n is the number of bytes to be read) because of the copy with bytes.

@frensjan

This comment has been minimized.

Copy link
Author

frensjan commented Aug 9, 2016

Finally, I'm also curious about the memory behaviour of del self._buffer[:n] which both the implementation above as #395 use. I've witnessed MemoryErrors on this, so I suspect it to copy any bytes beyond n.

This may be 'catastrophic' when reading an 8 bytes size field for some gigabytes of message following it. Now both implementations would suffer from this issue. However, #395 would suffer from this sooner as self._buffer is growing until it contains n bytes. While in the implementation above self._buffer would contain only the bytes received since the last time the socket buffer was empty.

The above may be a reason to implement something like recv_into where the user would control the buffer. With a usage as e.g.:

length = decode_length(reader.readexactly(4))
frame = bytearray(length)
reader.receiveinto(frame, length)
@methane

This comment has been minimized.

Copy link
Member

methane commented Aug 9, 2016

I'm sorry. I've forgot what you explained at first post.

read_into and readexactly_into seems better to me.

@socketpair

This comment has been minimized.

Copy link

socketpair commented Aug 9, 2016

Yes, I think read_into and read_into_exactly is good API.

@methane

This comment has been minimized.

Copy link
Member

methane commented Aug 9, 2016

I wonder if there is a way to use socket.recv_into in _SelectorSocketTransport.

@frensjan

This comment has been minimized.

Copy link
Author

frensjan commented Aug 14, 2016

@methane it would remove a copy, so that would be great! The code in https://github.com/python/asyncio/blob/master/asyncio/selector_events.py#L666 doesn't look that difficult, but requires the protocol to expose a buffer or a method to perform the actual receiving. It would actually be great if a readinto feature of streams could hook directly into _read_ready.

@frensjan

This comment has been minimized.

Copy link
Author

frensjan commented Aug 14, 2016

(afterburner: what a horrific idea that data is copied 3+ times before being returned from the current readexactly)

@frensjan

This comment has been minimized.

Copy link
Author

frensjan commented Aug 14, 2016

@socketpair @methane @1st1

A first draft for readinto and readinto_exactly: TargetHolding@bc13eb7

Performance is a bit better than the existing b''.join(blocks) based implementation and a bit worse from the suggested implementation of readexactly above (in the original post) if buffers aren't reused.

Maybe just as important: this provides a workaround for the O(2n) memory issue of the current implementation and #395.

@methane

This comment has been minimized.

Copy link
Member

methane commented Aug 15, 2016

Off topic: Big-O notation means O(2n) == O(n). Big-O notation is not suitable for this issue.

@methane

This comment has been minimized.

Copy link
Member

methane commented Aug 15, 2016

I don't have strong opinion for this.

Avoiding double-copy and double-memory consumption is useful for relatively large data.
(measurable for KBs data, nice to have for MBs data, and very important for 100MBs data).

If asyncio adds new API for such cases, I hope it solves both of copy and memory consumption.
It may be possible to add interaction between readinto and event loop later.
But I'm not sure, especially for ProactorEventLoop.

@socketpair

This comment has been minimized.

Copy link

socketpair commented Aug 15, 2016

The difference between .read() and .readinto() is only in which buffer is passed to kernel - either internal (like stream._buffer) or user-specified (that is passed as argument to readinto). So, event loop should not see any difference really. For example, libuv has alloc callback where one may allocate new buffer for each read operation, or pass buffer, that already exists.

@frensjan

This comment has been minimized.

Copy link
Author

frensjan commented Aug 15, 2016

@methane yes, I shouldn't have used Big-O here; both implementations have a maximum memory footprint which is linear to the message size.

@methane indeed, this is an issue for bigger messages. I'm working on something which can ship files, possibly gigabytes in size. Also the copy overhead for smaller (B or KB) sized messages, throughput is probably dominated by other factors.

@socketpair ideally this would be the case. One could - in the readinto implementation - swap out self._buffer for the user provided buffer, so that StreamReader.feed_data feeds into the user provided buffer. However, the feed_data 'protocol' doesn't support reads of limited size. So either, one has to deal with to much data being read into the user buffer (copying it away and keeping it around for other reads) or do something drastic ...

@methane

This comment has been minimized.

Copy link
Member

methane commented Aug 15, 2016

When feed_data called, one copy from kernel to user space is happen already.
To reduce it, EventLoop (especailly, ProactorEventLoop) and Protocol APIs should be changed too.

To achieve maximum performance and efficiency, I think Stream should be bound tightly with EventLoop and not use Protocol.
But I don't know asyncio.Stream should be it or not.

If EventLoop provides recv_into API, third party stream libraries can use EventLoop directly, without protocol.
So I feel lower level API is more important than adding APIs to asyncio.Stream and make it complicated.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.