-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup the unicodedata module #86323
Comments
Mohamed Koubaa and me are trying to convert the unicodedata module to the multi-phase initialization API (PEP-489) and to convert the UCD static type to a heap type in bpo-1635741. The unicodedata extension module has some special cases:
There is also a unicodedata.UCD type which cannot be instanciated in Python. It is only used to create the unicodedata.ucd_3_2_0 instance. In the commit 47e1afd, I moved the private _PyUnicode_Name_CAPI structure to internal C API. In the commit ddc0dd0, Mohammed added a ucd_type parameter to the UCD_Check() macro. I asked him to do that. In the commit e6b8c52, I added a "global module state" and a "state" parameter to most functions. This change prepares the code base to pass a UCD type instance to functions, to be able to have more than once UCD type when it will be converted to a heap type, one type per module instance. The technical problem is that unicodedata_functions is used for module functions and UCD methods. Duplicating unicodedata_functions requires to duplicate a lot of code and comments. Sadly, it does not seem easily possible to retrieve the "module state" ("state" variable) in functions since unicodedata_functions is reused for module functioins and UCD methods. Using "defining_class" in Argument Clinic would require to duplicate all unicodedata_functions functions, one flavor for module functions, one flavor for UCD type. It would also require to duplicate all docstrings, which means to increase the maintenance burden and introduce a risk of having inconsistencies. Maybe we could introduce a new UCD instance which would be mapped to the current Unicode Character Database version, and module functions which be bounded methods of this type. But it sounds overkill to me. By the way, Unicode 3.2 was released in 2002: 18 years ago. I don't think that it's still relevant in 2020 to keep backward compatibility with Unicode 3.2. I propose to deprecate unicodedata.ucd_3_2_0 and deprecate the unicodedate.UCD type. In Python 3.12, we will be able to remove a lot of code, and simplify the code a lot. For now, we can convert unicodedata to the multi-phase initilization API (PEP-489) and convert UCD static type to a heap type by avoiding references to the UCD type. Rather than checking if self is an instance of UCD_Type, we can check if it is not a module (PyModule_Check). This is exactly what Mohammed proposed in the first place, but I misunderstood the whole issue and gave him bad advices. |
On 26.10.2020 18:05, STINNER Victor wrote:
The version 3.2.0 is needed for IDNA compatibility: IDNA 2003: https://tools.ietf.org/html/rfc3490 IDNA 2008: https://tools.ietf.org/html/rfc5890 et al. Python only supports IDNA 2003 AFAIK and the ucs_3_2_0 tag was added IDNA 2008 seems to have mechanisms to also work for Unicode versions http://www.unicode.org/reports/tr46/ All that said, it may actually be better to deprecate IDNA 2003 support https://pypi.org/project/idna/ or incorporate this into the stdlib instead of IDNA 2003. The special -- Professional Python Services directly from the Experts (#1, Oct 26 2020)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 |
Oh, it is used by the IDNA encoding (encodings.idna module) and the stringprep module (which is used by the encodings.idna module. |
Oh, I missed your comment. I also discovered it by trying to remove it :-) So I think that the last thing to do for this issue is to remove unicodedata.ucnhash_CAPI: PR 22994. |
I kept unicodedata.ucd_3_2_0 and added a comment to explain why it's still relevant in 2020. I'm done with tasks listed in this issue, so I close it. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: