-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize BytesIO to do less reallocations when written, similarly to StringIO #59586
Comments
From this pydev thread: http://mail.python.org/pipermail/python-dev/2012-July/120981.html "BytesIO is actually missing an optimisation that is already used in It should be optimised to work the same way StringIO does (which is |
This optimization for StringIO was done in issue bpo-13149 |
I am not see that BytesIO is slower than StringIO (both are about 30% slower than append/join). $ ./python -m timeit -s "import io; d=[b'a'*10,b'bb'*5,b'ccc'*5]*1000" "b=[]; w=b.append" "for x in d: w(x)" "b''.join(b)"
1000 loops, best of 3: 966 usec per loop
$ ./python -m timeit -s "import io; d=['a'*10,'bb'*5,'ccc'*5]*1000" "b=[]; w=b.append" "for x in d: w(x)" "''.join(b)"
1000 loops, best of 3: 918 usec per loop
$ ./python -m timeit -s "import io; d=[b'a'*10,b'bb'*5,b'ccc'*5]*1000" "b=io.BytesIO(); w=b.write" "for x in d: w(x)" "b.getvalue()"
1000 loops, best of 3: 1.22 msec per loop
$ ./python -m timeit -s "import io; d=['a'*10,'bb'*5,'ccc'*5]*1000" "b=io.StringIO(); w=b.write" "for x in d: w(x)" "b.getvalue()"
1000 loops, best of 3: 1.24 msec per loop |
I wonder if this is a fair comparison, Serhiy. Strings are unicode underneath, so they have a large overhead per string (more data to copy around). Increasing the length of the strings changes the game because due to PEP-393, the overhead for ASCII-only Unicode strings is constant: >>> import sys
>>> sys.getsizeof('a')
50
>>> sys.getsizeof(b'a')
34
>>> sys.getsizeof('a' * 1000)
1049
>>> sys.getsizeof(b'a' * 1000)
1033
>>> When re-running your tests with larger chunks, the results are quite interesting: $ ./python -m timeit -s "import io; d=[b'a'*100,b'bb'*50,b'ccc'*50]*1000" "b=io.BytesIO(); w=b.write" "for x in d: w(x)" "b.getvalue()"
1000 loops, best of 3: 509 usec per loop
$ ./python -m timeit -s "import io; d=['a'*100,'bb'*50,'ccc'*50]*1000" "s=io.StringIO(); w=s.write" "for x in d: w(x)" "s.getvalue()"
1000 loops, best of 3: 282 usec per loop So, it seems to me that BytesIO could use some optimization! |
I'd like to take a shot at this, if Antoine doesn't mind. I'll prepare a patch for bytesio.c Question: what set of benchmarks would it be good to run to make sure this doesn't degrade the performance of BytesIO in various cases? |
I just used your examples.
Well, now I see, that BytesIO slower StringIO. |
Here is a preliminary version of the patch. I am not sure that it is fully correct. Microbenchmark results: $ ./python -m timeit -s "import io; n=100; d=['a'*n,'bb'*n,'ccc'*n]*10000" "s=io.StringIO(); w=s.write" "for x in d: w(x)" "s.getvalue()"
10 loops, best of 3: 25.5 msec per loop
$ ./python -m timeit -s "import io; n=100; d=[b'a'*n,b'bb'*n,b'ccc'*n]*10000" "s=io.BytesIO(); w=s.write" "for x in d: w(x)" "s.getvalue()"
10 loops, best of 3: 39.9 msec per loop
$ ./python-patched -m timeit -s "import io; n=100; d=[b'a'*n,b'bb'*n,b'ccc'*n]*10000" "s=io.BytesIO(); w=s.write" "for x in d: w(x)" "s.getvalue()"
10 loops, best of 3: 26.1 msec per loop
$ ./python -m timeit -s "import io; n=1000; d=['a'*n,'bb'*n,'ccc'*n]*1000" "s=io.StringIO(); w=s.write" "for x in d: w(x)" "s.getvalue()"
100 loops, best of 3: 12.1 msec per loop
$ ./python -m timeit -s "import io; n=1000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000" "s=io.BytesIO(); w=s.write" "for x in d: w(x)" "s.getvalue()"
10 loops, best of 3: 26.5 msec per loop
$ ./python-patched -m timeit -s "import io; n=1000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000" "s=io.BytesIO(); w=s.write" "for x in d: w(x)" "s.getvalue()"
100 loops, best of 3: 13.6 msec per loop |
I don't understand the purpose of your patch. It just replaces a direct realloc() call with an indirect one (since _PyBytes_Resize() will call realloc() internally). |
Ok, I understand. You're trying to make the getvalue() call cheaper, |
Yes, it saves up to half of time on large data (on Linux). It would be |
There seems to be a problem with the patch: when you store the getvalue() result somewhere (instead of discarding it), things get much slower: $ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000" "s=io.BytesIO(); w=s.write" "for x in d: w(x)" "s.getvalue()"
1000 loops, best of 3: 913 usec per loop
$ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000" "s=io.BytesIO(); w=s.write" "for x in d: w(x)" "global y; y = s.getvalue()"
100 loops, best of 3: 4.67 msec per loop This does not happen without the patch. |
Under Windows (64-bit Windows 7 on a VirtualBox VM), the patch increases performance slightly but not as much as under Linux: -> before patch: C:\t\cpython>pc\VS9.0\amd64\python.exe -m timeit -s "import io; n=2000; d=[b'a'* -> after patch: C:\t\cpython>pc\VS9.0\amd64\python.exe -m timeit -s "import io; n=2000; d=[b'a'* And the join() approach is 10x faster (!): C:\t\cpython>pc\VS9.0\amd64\python.exe -m timeit -s "import io; n=2000; d=[b'a'* ... which points to a much less optimized realloc() under Windows. |
The patch updated. Fixed some errors, optimized initialization, added checks and comments. I think that now the patch is ready for review. |
This problem exists with the new patch? |
Thank you, Antoine. This is an expected result.
To be fair, test body should be: "s=[]; w=s.append" "for x in d: w(x)" "b''.join(s)" May be tuning resize strategy (overallocate not 1/8, but 1/4, 1/2, or 100%) can help. |
I think you can run the benchmark yourself (I ran it under Linux). |
I don't see any differences. |
Well, with the latest patch I get: $ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000" "s=io.BytesIO(); w=s.write" "for x in d: w(x)" "s.getvalue()"
1000 loops, best of 3: 982 usec per loop
$ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000" "s=io.BytesIO(); w=s.write" "for x in d: w(x)" "global y; y = s.getvalue()"
100 loops, best of 3: 4.79 msec per loop |
Hmm. Without the ability to reproduce this effect, it will be difficult for me to |
It's under 64-bit Linux, Intel Core i5 CPU. Are you sure you're testing That said, the numbers under Windows suggest me that Eli's original idea |
By the way, here is Matt Mackall's take on realloc(): http://www.selenic.com/pipermail/mercurial-devel/2011-October/034988.html |
StringIO: $ ./python -m timeit -s "import io; n=2000; d=['a'*n,'bb'*n,'ccc'*n]*1000" "s=io.StringIO(); w=s.write" "for x in d: w(x)" "global y; y = s.getvalue()" -> Linux: 1000 loops, best of 3: 985 usec per loop Unpatched BytesIO: $ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000" "s=io.BytesIO(); w=s.write" "for x in d: w(x)" "global y; y = s.getvalue()" -> Linux: 100 loops, best of 3: 2.44 msec per loop b''.join(): $ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000" "l = list(d)" "global y; y = b''.join(l)" -> Linux: 1000 loops, best of 3: 821 usec per loop bytearray: $ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000" "b = bytearray()" "for x in d: b += x" "global y; y = b" -> Linux: 1000 loops, best of 3: 834 usec per loop |
I use 32-bit Linux.
I agree, it would be more robust, but much more complex solution. I will try to implement this approach too. Victor Stinner in their experiments with formatting preferred However, even in itself the patch deserves attention. It not only significant speeds up writing under Linux, but also speeds up the initialization, which leads to faster reading. ./python -m timeit -s "import io; d=(b'a'*99+b'\n')*10000" "s=io.BytesIO(d); r=s.readline" "while r(): pass" Unpatched: 100 loops, best of 3: 5.92 msec per loop I will try to preserve this advantage.
Note, that list also uses realloc() inside and therefore you get the same behavior with other constant factor. Actually, appending to a list less than sizeof(void*) bytes at a Thank you for measurements, this shows the scale of the issue. |
See also issue bpo-15612 for a possible optimization on StringIO (use _PyUnicodeWriter). |
Updated patch uses large overallocation factor (1/2 instead 1/8), it may increase the speed on Windows. Fixed implementation of __sizeof__() and some minor bugs. |
Only one week left before feature freeze. Please benchmark this patch on Windows. |
This issue is an optimization, not a new feature. |
Synchronized with tip. Unfortunately there are no common points with recent bpo-22003 changes so they should be reverted (except tests). |
What should be made to move this issue forward? |
Hey Serhiy, The implementation for your readline optimization seems less contentious (and less risky) than the remainder of the patch -- it could perhaps be easily split off into a separate patch, which may be far more easily committed. I love the concept of this patch, although from my last reading (weeks ago), it's slightly scary that it relies on Py_REFCNT() to know whether to mutate a string or not. In principle this should never break, in practice, however, it is uncertain that there are no strange edge cases that aren't immediately obvious. The _PyBytes_Resize doc is quite clear: "Only use this to build up a brand new bytes object; don’t use this if the bytes may already be known in other parts of the code" |
It wouldn't be the first time we've used Py_REFCNT that way. We do it with tuples and normal strings as well. |
Here is a patch which optimizes readline() and readlines() methods of BytesIO and the __next__() method of BytesIO iterator. |
I suppose this takes advantage of the libc's optimized memchr(). Any benchmarks? |
Benchmark from bpo-22003: $ ./python0 -m timeit -s 'import i' 'i.readlines()' Before patch: 10 loops, best of 3: 36.1 msec per loop |
New changeset 601045ceff94 by Serhiy Storchaka in branch 'default': |
Updated patch without just committed line reading optimization and committed in bpo-22193 the _PySys_GetSizeOf() function. |
Can you post any benchmark? |
I posted benchmarks two years ago, in msg165795. Here are updated results: $ ./python -m timeit -s "import io; n=100; d=[b'a'*n,b'bb'*n,b'ccc'*n]*10000" "s=io.BytesIO(); w=s.write" "for x in d: w(x)" "s.getvalue()" Before patch: 10 loops, best of 3: 42.3 msec per loop $ ./python -m timeit -s "import io; n=1000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000" "s=io.BytesIO(); w=s.write" "for x in d: w(x)" "s.getvalue()" Before patch: 10 loops, best of 3: 28.7 msec per loop They don't depend from the resizing factor on Linux. I increased it in hope it will help on Windows. |
Ping. |
Updated patch addresses Antoine's comment. |
New changeset 2e29d54843a4 by Serhiy Storchaka in branch 'default': |
Some buildbots crash: http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/11152/steps/test/logs/stdio E.g.: Fatal Python error: Objects/frameobject.c:429 object at 0x406dd878 has negative ref count -5 Current thread 0x40525ce0 (most recent call first): Trying to reproduce locally and investigate. |
New changeset 3fdbdf4d2ac7 by Serhiy Storchaka in branch 'default': |
New changeset ea33b61cac74 by Serhiy Storchaka in branch 'default': |
Can this be closed now? |
No, I'm working on more advanced patch. |
Oh, Berker, please stop shaking up bug tracker. |
@serhiy-storchaka are you still planning to work on this? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: