-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GZipFile.readline too slow #51720
Comments
It's not a big problem because we can just shell to zcat but compare $ time python ./b.py >/dev/null real 0m10.977s real 1m19.015s $ # Notice that the gzip module (a.py) had the benefit of the files
being in a disk cache by too now...
$ cat a.py
import gzip
import os apt_cache_dir = "/var/cache/apt/apt-file" for apt_cache_file in os.listdir(apt_cache_dir):
if not apt_cache_file.endswith(".gz"):
continue
f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file))
for line in f:
print line $ cat b.py
import os
import subprocess
from cStringIO import StringIO apt_cache_dir = "/var/cache/apt/apt-file" for apt_cache_file in os.listdir(apt_cache_dir):
if not apt_cache_file.endswith(".gz"):
continue
p = subprocess.Popen(["zcat", os.path.join(apt_cache_dir,
apt_cache_file)],
stdout = subprocess.PIPE)
f = StringIO(p.communicate()[0])
assert p.returncode == 0
for line in f:
print line Also tried this one just for "completeness": $ cat c.py
import gzip
import os
from cStringIO import StringIO apt_cache_dir = "/var/cache/apt/apt-file" for apt_cache_file in os.listdir(apt_cache_dir):
if not apt_cache_file.endswith(".gz"):
continue
f = gzip.open(os.path.join(apt_cache_dir, apt_cache_file))
f = StringIO(f.read())
for line in f:
print line But after it had ran (with some thrashing) for 3 and a half minutes I |
How does the following compare? f = gzip.open(...)
s = f.read()
for line in s.splitlines():
print line |
(GZipFile.readline() is implemented in pure Python, which explains why |
Hope this reply works right, the python bug interface is a bit confusing I tried the splitlines() version you suggested, it thrashed my machine It's not just a GzipFile.readline() issue either, c.py calls .read() and |
This sounds very weird. How much memory do you have, and how large are |
The gz in question is 17mb compressed and 247mb uncompressed. Calling The machine has 300mb free RAM from a total of 1024mb. It's not a new issue, I didn't find it when searching python bug http://www.google.com/search?q=python+gzip+slow+zcat It's an issue which people have stumbled into before but nobody seems to It's a minor point since it's so easy to work around, in fact I have my |
That would be the explanation. Reading the whole file at once and then Doing repeated calls to splitlines() on chunks of limited size (say 1MB) |
Yes, subprocess works fine and was the quickest to implement and How can I put this without being an ass? Hell, I'm no good at diplomacy
You can of course tell me to STFU - but look around the web I won't be |
Well, let's say it is suboptimal. But it's certainly ok if you don't |
I tried passing a size to readline to see if increasing the chunk helps I profiled it and function overhead seems to be the real killer. 30% of There doesn't seem to be any way to speed this up without rewriting the |
First patch, please forgive long comment :) I submit a small patch which speeds up readline() on my data set - a The speedup is 350%. Source of slowness is that (~20KB) extrabuf is allocated/deallocated in In the patch read() returns a slice from extrabuf and defers In the following, the first timeit() corresponds to reading extrabuf >>> timeit.Timer("x[10000: 10100]", "x = 'x' * 20000").timeit()
0.25299811363220215
>>> timeit.Timer("x[: 100]; x[100:]; x[100:] + x[: 100]", "x = 'x' *
10000").timeit()
5.843876838684082 Another speedup is achieved by doing a small shortcut in readline() for The patch only addresses the typical case of calling readline() with no |
Ah, a patch. Now we're talking :) |
The patch doesn't apply against the SVN trunk (some parts are rejected). (for info about our SVN repository, see |
Ah, my bad, I hadn't seen that the patch is for 3.2. Sorry for the |
I confirm that the patch gives good speedups. It would be nice if there In 3.x, a better but more involved approached would be to rewrite the |
Right, using the io module makes GzipFile as fast as zcat. I submit a new patch this time for Python 2.7, however, it is not a GzipFile is now derived of io.BufferedRandom, and as result the |
Thanks for the new patch. The problem with inheriting from I think the solution would be to use delegation rather than inheritance. def __init__(self, ...)
if 'w' in mode:
self.buf = BufferedWriter(...)
for meth in ('read', 'write', etc.):
setattr(self, meth, getattr(self.buf, meth)) It would also be nice to add some tests for the issues I mentioned By the way, we can't apply this approach to 2.x since |
How about using the first patch with the slicing optimization and This way we still get the 350% speed up and keep it fully backward g = gzip.GzipFile(...)
r = io.BufferedReader(g)
for line in r:
... |
That's fine with me. |
Submitted combined patch for Python 2.7. |
In the test, should you verify that the correct data comes back from |
Two things:
|
isatty() and __iter__() of io.BufferedIOBase raise on closed file and Should we keep the original GzipFile methods or prefer the implementation |
It's fine to use the BufferedIOBase implementation. There's no reason to |
uploaded updated patch for Python 2.7. |
Uploaded patch for Python 3.2. |
The patches have been committed. Thank you! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: