-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In codecs, function 'normalizestring' should convert both spaces and hyphens to underscores. #81932
Comments
In codecs.c, when _PyCodec_Lookup() call normalizestring(), both spaces and hyphens should be convered to underscores. Not convert spaces to hyphens. see:https://github.com/python/peps/blob/master/pep-0100.txt, Codecs (Coder/Decoders) Lookup |
and I will try to fix it. |
Hm, there is a bit misleading between desc(https://github.com/python/cpython/blob/master/Python/codecs.c#L53) and the code (https://github.com/python/cpython/blob/master/Python/codecs.c#L74). |
The design and code of the following four places need to be consistent, No.1 https://github.com/python/peps/blob/master/pep-0100.txt#L292 |
Jordon is right. Conversion has to be to underscores, not hyphens. I guess this bug was introduced when the normalization function was converted to C. |
Thanks for the fix Jordon Xu. IMHO this change is not strictly a bugfix, but more like an enhancement. I close the issue. If you consider that a backport to Python 3.7 and 3.8 is needed, please say so. |
Thanks vstinner. I also don't think it's necessary to backport to the old version. Close this issue is fine. |
The change is backwards incompatible and a backport would break things. See for example how it breaks latexcodec: |
I reopen the issue. I proposed PR 17997 to *document* the incompatible change in What's New in Python 3.8. IMO it's a deliberate change and it's correct. I rely on Marc-Andre Lemburg who implemented codecs and encodings modules. He wrote: "Jordon is right. Conversion has to be to underscores, not hyphens.". |
It seems quite easy to update latexcodec project to support Python 3.9. I proposed a solution there: |
Just to clarify: the change in the C implementation was the breaking change. The patch just restores the previous behavior: https://github.com/python/cpython/blob/master/Lib/encodings/__init__.py#L43 Please note that external codec packages should not rely on the semantics of the Python stdlib encodings package's search function. They should really register their own search function: https://docs.python.org/3.9/library/codecs.html#codecs.register It's good practice to always only use ASCII lower case chars and the underscore for codec names. |
latexcodec does register a search function.
latexcodec uses encoding names like "latex+ascii" and their search function used "+" as a separator. Don't worry, I just fixed latexcodec, my fix is already merged upstream! I simply changed the search function to split on "_" if the name contains "_". |
I created bpo-39337: codecs.lookup() ignores non-ASCII characters, whereas encodings.normalize_encoding() copies them. |
If I understand the target of this issue, this is a breaking change in python 3.9 . E.g. see SAP-archive/PyHDB#149 Ideally if we did not intend to break libraries then can this be fixed? |
Hi, akdor1154, thanks for your notice. It was a bugfix in fact:) https://bugs.python.org/issue37751#msg349448
@victor ping :) |
I think it is too late. Python 3.9 has been released already. Reverting the change is also breaking change. PEP-100 says: But codecs.register() says: I don't know historical reason why two document are inconsistent. |
On 23.04.2021 03:37, akdor1154 wrote:
This patch only restored the behavior we had before (and for many many https://bugs.python.org/issue37751#msg349448 Please note that search functions determine how to map codec names The approach taken by the encodings search function is listed here: Other search functions can work in different ways. Now, unfortunately, parts of this kind of normalization have also made As I mentioned before, the safest way to go about this is to use The Python implementation should make sure that such names continue |
On 23.04.2021 07:47, Inada Naoki wrote:
I guess just an oversight on my part. PEP-100 is certainly what I meant and implemented. I should have also |
Thanks Inada-san for documenting the change in codecs.register() doc! |
Unfortunately this is not quite finished yet. First of all, the change is bigger than what is documented: “Changed in version 3.9: Hyphens and spaces are converted to underscore.“ In reality, now Secondly, this change breaks lots of iconv codecs with the python-iconv binding. E.g. The codecs api feels extremely well-fitting for integrating iconv in python and any alternative I can think of seems unsatisfactory. |
This issue is now closed, would you mind to open a new issue? |
https://bugs.python.org/issue46508 filed to track fixing the acceptance and use of garbage codec values regression that this caused. |
(note: this might not be the true cause of that issue; though it sounds potentially related - I haven't investigated far enough yet) |
note that Bodo's own followup issue about the breaking change for python-iconv was filed as https://bugs.python.org/issue44723 |
codecs.register()
doc. #25643codecs.register()
doc. (GH-25643) #25677Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: