Skip to content

gh-130273: Add pure Python implementation of unicodedata.iter_graphemes()#148218

Open
ambv wants to merge 6 commits intopython:mainfrom
ambv:gh-130273-iter-graphemes
Open

gh-130273: Add pure Python implementation of unicodedata.iter_graphemes()#148218
ambv wants to merge 6 commits intopython:mainfrom
ambv:gh-130273-iter-graphemes

Conversation

@ambv
Copy link
Copy Markdown
Contributor

@ambv ambv commented Apr 7, 2026

New module Lib/_py_grapheme.py implements the full Unicode TR29 Extended Grapheme Cluster algorithm in pure Python, without relying on unicodedata.grapheme_cluster_break(), extended_pictographic(), and indic_conjunct_break() that were also added in Python 3.15.

Refactored GraphemeBreakTest into a BaseGraphemeBreakTest mixin so that both C and pure Python implementations share the same test suite, including the TR29 conformance test against GraphemeBreakTest.txt.

New module Lib/_py_grapheme.py implements the full Unicode TR29
Extended Grapheme Cluster algorithm in pure Python, using the
unicodedata.grapheme_cluster_break(), extended_pictographic(), and
indic_conjunct_break() property accessors.

Refactored GraphemeBreakTest into a BaseGraphemeBreakTest mixin so
that both C and pure Python implementations share the same test suite,
including the TR29 conformance test against GraphemeBreakTest.txt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ambv ambv requested review from Copilot and serhiy-storchaka April 7, 2026 13:59
@ambv ambv changed the title Add pure Python implementation of unicodedata.iter_graphemes() gh-130273: Add pure Python implementation of unicodedata.iter_graphemes() Apr 7, 2026
@ambv ambv added the skip news label Apr 7, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a pure-Python implementation of Unicode TR29 extended grapheme cluster segmentation to mirror unicodedata.iter_graphemes(), and refactors the existing grapheme-break tests so both the C and Python implementations can share the same conformance suite.

Changes:

  • Introduces Lib/_py_grapheme.py implementing TR29 Extended Grapheme Cluster segmentation using unicodedata property accessors.
  • Refactors GraphemeBreakTest into a BaseGraphemeBreakTest mixin and adds PyGraphemeBreakTest to exercise the Python implementation.
  • Shares the TR29 conformance test (GraphemeBreakTest.txt) across both implementations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
Lib/_py_grapheme.py New pure-Python TR29 grapheme cluster iterator returning Segment objects.
Lib/test/test_unicodedata.py Test refactor into a shared base mixin + new test class targeting _py_grapheme.iter_graphemes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@StanFromIreland
Copy link
Copy Markdown
Member

StanFromIreland commented Apr 7, 2026

Is it possible to ask the SC/@hugovk to allow for backporting of bab1d7a rather than increasing the maintenance burden with a duplicate and different implementation instead? We could make them private on 3.14?

ambv and others added 2 commits April 7, 2026 16:58
Add makegraphemedata() to Tools/unicode/makeunicodedata.py that
generates Lib/_py_grapheme_db.py from the Unicode data files
(GraphemeBreakProperty.txt, emoji-data.txt, DerivedCoreProperties.txt).

_py_grapheme.py now imports property tables from _py_grapheme_db
and uses bisect for lookups instead of calling unicodedata functions
added in 3.15. This makes the module usable on Python 3.13 and 3.14
by regenerating the tables for the appropriate Unicode version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ambv
Copy link
Copy Markdown
Contributor Author

ambv commented Apr 7, 2026

@StanFromIreland the exercise to see a pure Python reimplementation is interesting to me in a grander scheme that is providing the entirety of unicodedata in a pure Python version. That way the JIT could use that instead of hitting the C extension boundary every time unicodedata is needed. So this version wouldn't only be used for the backport, but could have a wider use for perf optimization in the future.

In any case, I asked Serhiy on gh-142529 to decide.

@vstinner
Copy link
Copy Markdown
Member

vstinner commented Apr 7, 2026

The traceback issue gh-130273 has been fixed in the main branch using unicodedata.iter_graphemes() which was added to Python 3.15 and implemented in C.

Adding the pure Python Lib/_py_grapheme.py (251 lines) and Lib/_py_grapheme_db.py (344 lines) to Python 3.14 sounds like a bad idea. That's a lot of new code for a bugfix (3.14.x) release. We don't do that usually (but only backport "tiny" changes fixing bugs). Also, I don't think that we already added new stdlib modules in a bugfix release.

That way the JIT could use that instead of hitting the C extension boundary every time unicodedata is needed. So this version wouldn't only be used for the backport, but could have a wider use for perf optimization in the future.

Does the JIT already support calling a different implementation of a function? Here two modules have to be imported to call a single function (_py_grapheme.iter_graphemes). It sounds like a lot of work for a single unicodedata function and the imports can fail with ImportError. iter_graphemes() is new, so it's not widely used. unicodedata.category() is more commonly used, so we should target this function first if we care about performance, no?

I'm not convinced by the JIT argument. IMO the unicodedata functions are too rarely used and it would be a lot of work to switch to a different implementation in the JIT.

I suggest to not backport the traceback fix to 3.13 and 3.14 stable branches. And I don't think that it's worth it to add a pure Python implementation of unicodedata.iter_graphemes() in the main branch.

@ambv
Copy link
Copy Markdown
Contributor Author

ambv commented Apr 7, 2026

I'm not convinced by the JIT argument.

The argument isn't about unicodedata.iter_graphemes() specifically, but about avoiding standard library boundaries in JIT traces in general. It's true the JIT isn't doing that at the moment, but I expect it will get relevant in the future. Just like PyPy only uses Python-based standard libraries.

Me exercising the re-implementation of a subset of unicodedata for this goal was simply a test in how complicated it would be to reimplement the entirety of unicodedata in a separate change. Most C-based standard libraries do have Python equivalents per PEP 399 requirements. unicodedata is an example that predates this PEP. So I'm looking at how hard it will be to patch this gap. PyPy could switch to our implementation then.

Now the iter_graphemes() subset of unicodedata is testing the waters, motivated by the bugfix that otherwise can't feasibly be backported. It seems to me like it would make sense to integrate this subset in main and backport it to allow fixing the bug.

I agree that we usually keep backported changes small, but it's not unheard of to backport several hundred lines of code. Here, I'd argue that _py_grapheme_db.py is irrelevant as it's generated code. The actual size of the change is 252 lines for _py_grapheme.py and 129 lines added to makeunicodedata.py. So ~380 lines altogether. Not trivial, but not excessive either.

I hear your argument, let's see what Serhiy's got to say.

Copy link
Copy Markdown
Member

@malemburg malemburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test cases look fine, but I don't see much point in adding a pure Python version of the huge unicodedata database to Python, so -1 on those parts.

If people want to use such a pure Python implementation, they should download a package from PyPI which provides this.

@bedevere-app
Copy link
Copy Markdown

bedevere-app bot commented Apr 7, 2026

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants