-
-
Notifications
You must be signed in to change notification settings - Fork 31.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fileinput.hook_compressed returning bytes from gz file #50008
Comments
The attached ZIP file contains "test.bat" which runs "test.py" with Python 2.6 behaves as expected (see "py26.out"), since it returns For reference, I tested this on Python versions: |
gzip.open() only implements the "rb" mode, and returns bytes. Then the encoding issue arises. |
Working on this |
I had a lot of help of R. David Murray, could someone review our patch, please? We are specially concerned about backward compatibility. |
Thanks for the patch. I commented on not-so-important things on the code review site (which should have sent an email to Tatiana) and wanted to raise an issue here. Is it best that the encoding used defaults to Python’s default encoding (UTF-8) or the I/O default encoding (depends on the OS and locale), like what open does? |
I think it should use the same default as open, but frankly I couldn't remember the right way to achieve that. Suggestions welcome :) |
We can use locale.getpreferredencoding, or be smarter and just pass down encoding=encoding to TextIOWrapper and let *it* call getpreferredencoding if it’s given a None :) |
Excellent! Thanks for all remarks, Éric! I'll follow your suggestions and will submit a new patch. |
Duh. I should have remembered that the io module used None to represent the default encoding. Thanks, Éric. |
Nobody can remember everything: I tried in a shell and looked at the docs to find it :) |
Improved patch, according to eric.araujo's suggestions and mnewman's guidance. |
Thanks. Unless another core dev wants to do a complementary review I will slightly tweak the patch and commit it. I need to finish waking up and eat some food before I do that :) Technically adding a new argument means that this is a new feature and cannot be applied to the stable 3.2 version, but something needs to be done for this bug in 3.2 too, like a recipe in the docs for a hook_compressed that returns strings (i.e. a function that calls fileinput.hook_compressed and wraps it in a TextIOWrapper), or at least a note to warn about this bug. |
For 3.2 could we use the same fix, but without exposing the ability to *change* the encoding? That is, we use TextIOWrapper but always with the default None for encoding. It also occurs to me that this really exposes a weakness in the design. What if the user wants to specify other open parameters? I wonder if we should say that for better future-proofing openhooks should always take **kw. You could even envision fileinput accepting **kw and passing them along to the openhook. I think charset is the most important open paramenter in this context, though, so I don't think we have to solve the general problem in this fix. |
It also occurs to me that this fix makes the charset hook look rather odd. We could render it redundant by passing charset to open in the non-openhook case, and mark it deprecated. There is also a bug in the hook_encoding docs. It says the file is opened with codecs.open, but that is not the case, regular open is used. |
Agreed on deprecating the charset hook when it becomes redundant. Will fix the doc bug about codecs.open too. |
This issue is not newcomer friendly, I remove the easy keyword. |
The patch for this change has broken code that relied on the old behavior. For example:
That code passes on Python3.10a6 but on 3.10b1 fails with:
I encountered this issue in this function and its test. I'm struggling with how to adapt the code to provide a uniform interface on Python 3.6+. Perhaps a backport of hook_compressed is in order. |
A backport now exists (https://pypi.org/project/backports.hook_compressed) and addresses the issue (https://github.com/jaraco/cmdix/actions/runs/873404846). |
It's easy. Just passing
This is better than |
I did consider and confirm that mode="rb" does also provide a uniform solution, but it also regresses the behavior of uncompressed inputs, making them bytes where they were text. This approach feels like the "Python 1" compatibility approach. At least in my case, the preference was to migrate to the future behavior, where all content is decoded to Unicode text, without having to wait for Python 3.9 to sunset. |
Of course, I suggested to use "rb" when you want to read bytes.
|
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: