-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bz2.BZ2File should accept other file-like objects. (issue4274045) #50113
Comments
bz2.BZ2File should, like gzip.GzipFile, accept a fileobj argument. If implemented, you could much more easily pipe BZ2-data from other |
Do you want to work on patch? |
Would if I could. But, No. |
That’s a perfectly fine reply. Someone will see this feature request and propose a patch eventually. Another way to help is to write tests, since those are in Python. |
For the record, this will need a comprehensive rewrite of bz2module, since it uses FILE pointers right now. |
Without a patch and compelling use cases, this has no chance. Recommend closing. |
I'll try working on a patch. |
Sorry, I'm giving up. The copyright notice for bz2module.c lists "Gustavo Niemeyer" as one of the holders, is he the maintainer? Maybe he should be notified of this bug. |
Indeed, I think only an extensive rewrite could fulfill the feature
He hasn't been active for years, so I don't think he can still be |
A use case
planet-*osm.bz2 files are 14GB at the moment. it would be great to read them from stdin while downloading from a server and uploading to a database at the same time. Of course, you can insert "bzip2 -d" into the pipe... but then why to bother with bz2 module in Python? ;) |
We’ve already agreed the feature is desirable; what’s missing is a patch, not user stories :) |
OK! :) |
I have been working on a patch for this issue. I've implemented everything except for readline(), readlines() and the iterator protocol. In the existing implementation, the reading methods seem to interact weirdly - iternext() uses a readahead buffer, while none of the other methods do. Does anyone know if there's a reason for this? I was planning on having all the reading methods use a common buffer, which should allow free mixing of read methods and iteration. Looking at bpo-8397, I'm guessing it would be fine, but I wanted to double-check in case there's a quirk of the iteration protocol that I've overlooked, or something like that. |
Here is a patch that rewrites BZ2File to implement the requested feature, and adds some tests using BytesIO objects. Some notes:
One thing I was unsure of is how to handle exceptions that occur in BZ2File_dealloc(). Does the error status need to be cleared before it returns? The documentation for the bz2 module appears to be quite out of date; I will upload a patch in the next day or so. On a related note, the 'buffering' argument to __init__() is ignored, and I was wondering whether this should be documented explicitly? The current documentation claims that it allows the caller to specify a buffer size, or request unbuffered I/O. |
Are there tests for this? |
Yes, see bz2module-v1.diff. |
Interesting! If you are motivated, a further approach would be to expose the compressor and decompressor objects from the C extension, and write the file object in Python (as in Lib/gzip.py).
Yes, it should. Actually, it would be better to write out the exception using PyErr_WriteUnraisable().
Yes, it should probably be deprecated if it's not useful anymore. By the way, the current patch produces reference leaks: $ ./python -m test -R 3:2 test_bz2
[1/1] test_bz2
beginning 5 repetitions
12345
.....
test_bz2 leaked [44, 44] references, sum=88 |
|
|
Yes, totally. |
Here is a quick-and-dirty reimplementation of BZ2File in Python, on top of the existing C implementation of BZ2Compressor and BZ2Decompressor. There are a couple of issues with this code that need to be fixed:
I hope to resolve both of these issues (and do a general code cleanup), by writing a C extension module that provides a thin wrapper around bzCompress()/bzDecompress(), and reimplementing the module's public interface in Python on top of it. This should reduce the size of the code by close to half, and make it easier to read and maintain. I'm not sure when I'll be able to get around to it, though, so I thought I should post what I've done so far. Other changes in the patch:
|
It probably depends on the buffer size. Trying to fix this /might/ be Also, as with GzipFile one goal should be for BZFile to be wrappable in
Ah, thank you. Will take a look. |
Actually, looking at the code again (and not being half-asleep this time), I
But you're right; this is probably premature optimization. I'll do some proper
Ah, OK. I suppose that is a sensible way of using it. peek() will be quite easy |
OK, I've rewritten the whole bz2 module (patch attached), and I think it is now ready for review. The BZ2File implementation is a cleaned-up version of the one from my previous patch, with some further additions. I've factored out the common compressor/decompressor stuff into classes Compressor and Decompressor in the _bz2 extension module; with these, BZ2Compressor, BZ2Decompressor, compress() and decompress() are trivial to implement in Python. My earlier efficiency concerns seem to have been unfounded; I ran some quick tests with a 4MB bz2 file, and there wasn't any measurable performance difference from the existing all-C implementation. I have added a peek() method to BZ2File, in accordance with Antoine's suggestion, but it's not clear how it should interpret its argument. I followed the lead of io.BufferedReader, and simply ignored the arg, returning whatever data as is already buffered. The patch also includes tests for peek() in test_bz2, based on test_io's BufferedRWPairTest. Also, while looking at io.BufferedReader's implementation, I noticed that it doesn't actually seem to use raw.peek() at all. If this is correct, then perhaps peek() is unnecessary, and shouldn't be added. The patch also adds a property 'eof' to BZ2Decompressor, so that the user can test whether EOF has been reached on the compressed stream. For the new files (Modules/_bz2module.c and Lib/bz2.py), I'm guessing there should be some license boilerplate stuff added at the top of each. I wasn't sure exactly what this should look like, though - some advice would be helpful here. |
Patch posted for review at http://codereview.appspot.com/4274045/. Still have to do a review though :) |
Review posted at http://codereview.appspot.com/4274045/ |
Reviewers: nadeem vawda <nadeem.vawda_gmail.com>, http://codereview.appspot.com/4274045/diff/1/Lib/bz2.py http://codereview.appspot.com/4274045/diff/1/Lib/bz2.py#newcode25 You should probably also implement fileno() (simply return http://codereview.appspot.com/4274045/diff/1/Lib/bz2.py#newcode386 http://codereview.appspot.com/4274045/diff/1/Modules/_bz2module.c http://codereview.appspot.com/4274045/diff/1/Modules/_bz2module.c#oldcode123 http://codereview.appspot.com/4274045/diff/1/Modules/_bz2module.c http://codereview.appspot.com/4274045/diff/1/Modules/_bz2module.c#newcode3
Modules/_bz2module.c:3: #include "Python.h"
Since this is a new start, perhaps we should add
#define PY_SSIZE_T_CLEAN
before including "Python.h"?
This will ensure no code in the module will rely on the old behaviour. http://codereview.appspot.com/4274045/diff/1/Modules/_bz2module.c#newcode48 http://codereview.appspot.com/4274045/diff/1/Modules/_bz2module.c#newcode78 http://codereview.appspot.com/4274045/diff/1/Modules/_bz2module.c#newcode122 http://codereview.appspot.com/4274045/diff/1/Modules/_bz2module.c#newcode209 http://codereview.appspot.com/4274045/diff/1/setup.py http://codereview.appspot.com/4274045/diff/1/setup.py#newcode1236 Please review this at http://codereview.appspot.com/4274045/ Affected files: |
Thanks for the review. I'll try and have an updated patch ready by next weekend. Regarding your comments:
|
Would it be possible to add an open() function to the bz2 module? Currently gzip has such a function, but bz2 does not: >>> import gzip
>>> gzip.open
<function open at 0x781f0>
>>> import bz2
>>> bz2.open
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'module' object has no attribute 'open'
>>> |
Well, it could be a topic for a separate issue. |
Yes, it would be quite trivial, though I don't think it would be worthwhile - Regarding the use of PY_SSIZE_T_CLEAN, I assume that Py_ssize_t is to be Also, I was wondering whether I need to add some sort of license boilerplate to /* _bz2 - Low-level Python interface to libbzip2. (Browsing through the source files in Lib/ and Modules/, there doesn't seem to |
Yes, ssize_t doesn't exist everywhere AFAIK.
Well, I would personally advocate not re-adding a license boilerplate, |
That sounds sensible to me. I'll see what the rest of python-dev thinks, though. |
Given the absence of response on python-dev, I'd say simply remove the obsolete copyright notice. |
Here is an updated patch, incorporating the feedback from your review. The new patch no longer checks for errors in bz2CompressEnd()/bz2DecompressEnd() The patch adds implementations of most of the io.BufferedIOBase methods |
Corresponding patch for the module docs. |
From the discussion on python-dev, it seems that I will need to submit a |
Ok, I was planning to do another review anyway. |
Nadeem,
Apparently the PSF has received your contributor agreement. Does it mean the situation is cleared? I plan to do a review of your latest patch. |
Great; I was just about to send them an email to check.
Yes, everything's sorted out. Go ahead :) |
Here is an updated patch that adds read1() to BZ2File. This should fix things |
Updated documentation. |
Thank you! A couple of comments:
|
Thanks for the review. I've made most of the changes you suggested, but there's
The tests for readline() and readlines() expect a TypeError if size is None.
|
Ah, you're right, TypeError should be raised. |
Here's the updated patch. |
... and the corresponding updated documentation patch. |
New changeset 2cb07a46f4b5 by Antoine Pitrou in branch 'default': |
Thank you very much, Nadeem. The patch is now in. |
Hi, thanks for the patch. Could you also publish a version for older python 2.x ? regards, |
As a new feature, this can’t go into older versions. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: