-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance for small reads and fix seek problem #44677
Comments
This patch improves read performance of the gzip module. We have seen improvments from 20% from reading big blocks up to factor 50 for reading 4 byte blocks. Additionally this patch removes the need of seeking within the file with allows using streams like stdin as input. Details: The whole read(),readline() stack got rewritten. Core part is a new input buffer. It consists of a list of strings (self.buffer), an offset of what has already be consumed from the first string (self.pos) and the length of the (still consumable) buffer (self.bufferlen). This makes adding to and removing from the buffer cheap. It turned out that removing from the old buffer was breaking performance as for reading one byte the whole buffer had to be copied. For reading a 2k buffer in single bytes 2MB had to be copied. readline no longer uses read() but directly works on the buffer. This removes a whole layer of copying strings together. For removing the need of seeking a new readonly filelike class is used (PaddedFile). It just prepends a string to a file and uses the file's read method when the string got used up. There is probably still some space for tweaking when it comes to buffere sizes as we kept this simple. But the great performance improvments show that we can't be off that much. Performance test program and results are attached. |
File Added: test_gzip2.py |
File Added: results.txt |
Added minor improvement of the unittest: Check for newlines in the readline() test code. |
The patch looks good, and the tests pass. Can you add a test that ensures that a seek() method is not necessary any more for the underlying stream? (closed bpo-914340 which provided a similar patch as duplicate) |
I added checks to test_readline() and test_append(). I left the other read test untouched to keep some filename=filename coverage. BTW: I really forgot special thanks for Phil Knirsch who wrote the initial read() implementation and the performance test cases and did a lot of weaking and testing and without whom this patch would never have existed. |
Excuse me, but I can't apply the patch. I have Windows XP without any SP and I tried to do the command patch -u gzip.py gzip.py.diff |
Hm, works one Linux. |
I applied the patch by hand, I think the problem is simply it's for python 2.6 (I have the stable 2.5 version) Anyway like I wrote for an old similar patch of another user, the patch starts to read the header at the current position, and not at the start of the file. You can see it trying this piece of code: --------------------------------------- import urllib2
import array
import gzip
urlfile = urllib2.urlopen(someurl)
header = array.array("B")
header.fromstring(urlfile.read(1))
gzfile = gzip.GzipFile(fileobj=urlfile)
print gzfile.read() Error:
I don't know how you can solve this without seek() Anyway if you are interested I created the diff for Python 2.5 : |
There is a simple solution for that problem: DON'T! If you pass a file descriptor to the GzipFile class it is your responsibility that the file is a gzip file and that the file pointer is at the right position. After having a short look into original code I don't think it can cope with the use case you brought up. I'd even argue that seeking to the beginning of the file is broken behaviour. What if the gzip content doesn't start at the beginning of the file? In fact prepending some header before compressed data is quite common. If the file you where reading really had a header your example had just worked. |
Indeed the problem is seek() is not implementes for urllib objects. It's not a bug of your patch, sorry if I explained it bad. |
There is a simple solution for that problem: DON'T! If you pass a file descriptor to the GzipFile class it is your responsibility that the file is a gzip file and that the file pointer is at the right position. After having a short look into original code I don't think it can cope with the use case you brought up. I'd even argue that seeking to the beginning of the file is broken behaviour. What if the gzip content doesn't start at the beginning of the file? In fact prepending some header before compressed data is quite common. If the file you where reading really had a header your example had just worked. |
This patch is awesome. It makes it possible to do this with http >>> conn = httplib.HTTPConnection('blah')
>>> req = conn.request('GET', 'a.gz')
>>> resp = conn.getresponse()
>>> unzipper = gzip.GzipFile(fileobj=resp)
# read just the first 100 lines
>>> for i in range(100):
print unzipper.readline() |
Are compatibility concerns the main reason this patch has yet to be applied? If so, could we allay those fears by keeping the default seek-y behavior and only introducing the new seek-less style when a 'noseek' flag is passed? |
There are no compatibility concerns I am aware of. The new implementation does no longer need a seek()able file. Of course an implemented seek() method won't hurt anyone. The additional tests are only there to point out the problems of the old implementation. So there is no flag needed to maintain compatibility. The patch just has to be reviewed and then to be applied. If there are any concerns or questions I'll be glad to assist. Florian |
The patch could only be applied to 3.2 (2.7 is frozen now). But the gzip code has changed quite a bit and I would advocate creating a new patch if you are interested. |
As there has been a lot of support for the attached patch what is needed to take this forward? |
Read my 2010-06-17 message above. Someone needs to update the patch for 3.2, and preferable show that it still brings an improvement (small micro-benchmarks are enough). |
I updated the performace script to Py3. You still need to change the import gzipnew line to actually load the modified module. Right now it just compares the stdlib gzip module to itself. |
Attached result of a run with stdlib gzip module only. Results indicate that performance still is as bad as on Python 2. The Python 3 gzip module also still makes use of tell() ans seek(). So both argument for including this patch are still valid. Porting the patch will include some real work to get the bytes vs string split right. |
Performance is easily improved by wrapping the file object in a Text 1 byte block test Text 4 byte block test Text 16 byte block test Still, fixing the seek()/tell() issue would be nice. |
Stupid me! I ran the tests against my systems gzip version (Py 3.1). The performance issue is basically fixed by rev 77289. Performance is even a bit better that my original patch by may be 10-20%. The only test case where it performs worse is Random 10485760 byte block test Don't know if it is worth bothering. May be increasing the maximum chunk size improves this - but I didn't try that out yet. WRT to seeking: I now have two patches that eliminate the need for seek() on normal operation (rewind obviously still needs seek()). Both are based on the PaddedFile class. The first patch just creates a PaddedFile object while switching from an old to a new member while the second just wraps the fileobj all the time. Performance test show that wrapping is cheap. The first patch is a bit ugly while the second requires a implementation of seek() and may create problems if new methods of the fileobj are used that may interfere with the PaddedFile's internals. So I leave the choice which one is preferred to the module owner. The patch creates another problem with is not yet fixed: The implementation of .seekable() is becoming wrong. As one can now use non seekable files the implementation should check if the file object used for reading is really seekable. As this is my first PY3k work I'd prefer if this can be solved by someone else (But that should be pretty easy). |
Thank you very much! I have kept the second approach (use PaddedFile at all times), since it is more regular and minimizes the probability for borderline cases. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: