Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use bytearray buffers in IOStream #1873

Merged
merged 8 commits into from Feb 21, 2017

Conversation

@pitrou
Copy link
Contributor

commented Nov 1, 2016

Replace the deques of chunks in IOStream with simple bytearrays. This simplifies and optimizes buffer management, removing many extraneous copies.
Also allow a memoryview as write() argument, as that's useful to avoid copies in applications.

Seems to fix #1685. @mrocklin

@bdarnell
Copy link
Member

left a comment

Nice, thanks for doing this.

Putting on my programmer-archaeologist hat for the record, we used to use cStringIO before we switched to the list-of-bytes approach (we were still supporting python 2.5 at the time, so we couldn't use bytearray). I had misremembered that we did that for performance (to reduce copies in and out of the cStringIO), but looking back at git logs I see that it was done to support the frozen write buffer hack. I'll be glad to see the merge_prefix code go away.

This looks like it's probably going to be a performance win, but it would be good to confirm with some benchmarks (now we always copy the data once, into the bytearray. Before we copied it a variable number of times, but I think that number was often zero). We don't have a good benchmarking suite but there's one using ab in demos/benchmark.

self._write_buffer_size = 0
self._write_buffer_frozen = False
self._pending_writes_while_frozen = []

This comment has been minimized.

Copy link
@bdarnell

bdarnell Nov 4, 2016

Member

Should this be a second bytearray which can be swapped in when we unfreeze?

This comment has been minimized.

Copy link
@pitrou

pitrou Nov 4, 2016

Author Contributor

I don't think so. When we unfreeze, not all the write buffer has necessarily been written (e.g. on Windows with the 128KB limit).

self._write_buffer_size += sum(map(len, self._pending_writes_while_frozen))
self._pending_writes_while_frozen[:] = []

def _got_empty_write(self, size):

This comment has been minimized.

Copy link
@bdarnell

bdarnell Nov 4, 2016

Member

This method needs a docstring/comment.

self._write_buffer_size -= num_bytes
# Amortized O(1) shrink

This comment has been minimized.

Copy link
@bdarnell

bdarnell Nov 4, 2016

Member

Comment here (and at the other occurrence of this pattern) that bytearray does this itself starting in python 3.4.

# A cleaner solution would be to set
# SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER, but this is
# not yet accessible from python
# (http://bugs.python.org/issue8240)

This comment has been minimized.

Copy link
@bdarnell

bdarnell Nov 4, 2016

Member

That issue is closed and this comment can be updated: A cleaner solution would be to set SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER, which python does starting in version 3.4.

This comment has been minimized.

Copy link
@pitrou

pitrou Nov 4, 2016

Author Contributor

By the way, do you know how tolerant OpenSSL is with that option?
I still had "bad write retry" errors on 3.5 before fixing the freeze logic:
https://travis-ci.org/tornadoweb/tornado/jobs/172421214#L1284

@pitrou

This comment has been minimized.

Copy link
Contributor Author

commented Nov 4, 2016

demos/benchmark/benchmark.py shows basically no change at 1700 req./s. here.

I've got another small benchmark which shows good results. Server:

import os

from tornado.tcpserver import TCPServer
from tornado.ioloop import IOLoop
from tornado import gen

N = 100000000  # 100M

data = os.urandom(N // 100) * 100

sentinel = b'--my-sentinel--'
packet = data + sentinel

niters = 5


class MyServer(TCPServer):
    @gen.coroutine
    def handle_stream(self, stream, address):
        for i in range(3):
            yield stream.write(packet)
            msg = yield stream.read_until(sentinel)
            assert len(msg) == len(packet), (len(msg), len(packet))

s = MyServer()
s.listen(8000)

IOLoop.current().start()

Client:

from tornado.tcpclient import TCPClient
from tornado.ioloop import IOLoop
from tornado import gen

sentinel = b'--my-sentinel--'

@gen.coroutine
def f():
    client = TCPClient()
    stream = yield client.connect('localhost', 8000,
                                  max_buffer_size=int(1e9))
    for i in range(3):
        msg = yield stream.read_until(sentinel)
        print(len(msg))
        yield stream.write(msg)

if __name__ == '__main__':
    IOLoop().run_sync(f)

On master (Python 3.5):

$ time client.py 
[...]
real    0m3.094s
user    0m1.360s
sys 0m0.304s

With this PR (Python 3.5):

$ time client.py 
[...]
real    0m1.301s
user    0m0.628s
sys 0m0.272s

On master (Python 2.7):

$ time client.py 
[...]
real    0m2.346s
user    0m1.060s
sys 0m0.280s

With this PR (Python 2.7):

$ time client.py 
[...]
real    0m1.069s
user    0m0.524s
sys 0m0.232s
@mrocklin

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2016

What would need to happen for this to work with zero copy? It looks like appending a memoryview onto a bytesarray still involves a copy. This isn't a significant priority for me at the moment, but I can imagine it becoming one, particularly when moving the buffers of large numpy arrays around on fast networks.

@bdarnell

This comment has been minimized.

Copy link
Member

commented Nov 4, 2016

I think using a byte array like this will always imply at least one copy. To get down to zero copies you'd need to use something like a list of memoryviews (so something more like #1691 than this one).

@mrocklin

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2016

OK, for what it's worth I'm not at the point where zero-copy is a big deal for any active use case. I may never get there.

@pitrou

This comment has been minimized.

Copy link
Contributor Author

commented Nov 8, 2016

I think I've addressed the review comments.

@bdarnell

This comment has been minimized.

Copy link
Member

commented Nov 20, 2016

I've been hesitating about merging this because I'm concerned that in the future we might decide we want to get to zero copies and we have to bring back all the list-of-chunks stuff. But now I think that doesn't have to be true: We only really care about zero-copy in the fast path, where all writes are small enough to be written in one system call, and where reads can be passed directly to a waiting callback. If we're accumulating multiple chunks, it's probably fine to copy them into a bytearray instead of building up a list of chunks that reference the original input data.

Therefore I think we could merge this, throw away the chunking logic, and at some point in the future make some tweaks to keep the data out of the bytearray in the fast path. @pitrou and @mrocklin does that sound reasonable? (And thank you for your patience!)

@mrocklin

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2016

If this cleans up internal logic then I agree that it should be merged. I would like to see true zero copy, but I don't yet think that it significantly impacts my performance.

We only really care about zero-copy in the fast path, where all writes are small enough to be written in one system call

Can you elaborate more on this? Why is zero-copy less relevant for larger writes?

As an example application, I might want to write a memoryview from a numpy array to a socket:

In [1]: import numpy as np
In [2]: mv = np.ones(1000000).data
In [3]: mv
Out[3]: <memory at 0x7f62a9341588>

I believe that I can pull off chunks of this memoryview without creating copies

In [4]: %time mv[1:]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 6.44 µs
Out[4]: <memory at 0x7f62a9341708>

This case of large numpy arrays is also, I think, the only case where when I'm likely to be affected by performance here.

@bdarnell

This comment has been minimized.

Copy link
Member

commented Nov 20, 2016

Zero-copy is less relevant for larger writes because larger writes can always be broken up by the application. Instead of stream.write(giant_memoryview), you could do

while giant_memoryview:
    await stream.write(giant_memoryview[:chunk_size])
    giant_memoryview = giant_memoryview[chunk_size:]

By waiting for the previous write to finish, you make it possible for the next chunk to take the fast path too (of course, it's not obvious that this three-line python loop will be faster than the copy would be...)

I'm picturing logic something like this (which would actually handle the numpy memoryview just fine, as long as you wait for one memoryview to finish before writing another):

def write(self, chunk):
    if not self.write_buffer:
       self.write_buffer = chunk
    else:
        if not isinstance(self.write_buffer, bytearray):
            self.write_buffer = bytearray(self.write_buffer)
        self.write_buffer += chunk
   bytes_sent = self.sock.send(self.write_buffer)
   self.write_buffer = memoryview(self.write_buffer)[:bytes_sent]
@pitrou

This comment has been minimized.

Copy link
Contributor Author

commented Nov 21, 2016

I've been hesitating about merging this because I'm concerned that in the future we might decide we want to get to zero copies and we have to bring back all the list-of-chunks stuff.

That's a reasonable concern. I would answer that copies are fast, but of course that depends: if you're sending a 1GB piece of data, then they're not that fast anymore.

One possible design would be a hybrid queue/bytearray, roughly:

BIG = 10 * 1024**2  # 10 MB

def write(self, chunk):
    if len(chunk) >= BIG or len(self.queue[-1]) >= BIG:
        self.queue.append(chunk)
    else:
        self.queue[-1].extend(chunk)

By the way, perhaps the SSL "frozen buffer" kludge can be pushed entirely into SSLIOStream...

@mrocklin

This comment has been minimized.

Copy link
Contributor

commented Nov 21, 2016

For context, I generally assume something like 5GB/s memory copy/allocation speed and 100MB/s to 1GB/s network connection speed. Under these parameters memcopies are indeed fast and I agree that they can be ignored.

There are some people starting to use Dask and therefore Tornado on more serious network interconnects (large HPC clusters at DOE labs). If this occurs, and if we remove all other bottlenecks, then I can imagine that the memcopy bandwidth would become important. That's a lot of ifs though, too many for me to think that this is worth optimizing for. If we start moving data around a cluster at memcopy speed then we're probably doing pretty well.

@pitrou

This comment has been minimized.

Copy link
Contributor Author

commented Nov 21, 2016

Yes. Memory copies can have adverse effects, though, such as increasing memory consumption ;-)

@bdarnell

This comment has been minimized.

Copy link
Member

commented Feb 21, 2017

Merging this for Tornado 4.5. Thanks @pitrou and @mrocklin for your contributions and patience!

@bdarnell bdarnell merged commit d32446d into tornadoweb:master Feb 21, 2017

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.