gh-130273: Add pure Python implementation of unicodedata.iter_graphemes() by ambv · Pull Request #148218 · python/cpython

ambv · 2026-04-07T13:59:38Z

New module Lib/_py_grapheme.py implements the full Unicode TR29 Extended Grapheme Cluster algorithm in pure Python, without relying on unicodedata.grapheme_cluster_break(), extended_pictographic(), and indic_conjunct_break() that were also added in Python 3.15.

Refactored GraphemeBreakTest into a BaseGraphemeBreakTest mixin so that both C and pure Python implementations share the same test suite, including the TR29 conformance test against GraphemeBreakTest.txt.

Issue: Traceback colors are shifted when the line contains wide unicode characters #130273

New module Lib/_py_grapheme.py implements the full Unicode TR29 Extended Grapheme Cluster algorithm in pure Python, using the unicodedata.grapheme_cluster_break(), extended_pictographic(), and indic_conjunct_break() property accessors. Refactored GraphemeBreakTest into a BaseGraphemeBreakTest mixin so that both C and pure Python implementations share the same test suite, including the TR29 conformance test against GraphemeBreakTest.txt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a pure-Python implementation of Unicode TR29 extended grapheme cluster segmentation to mirror unicodedata.iter_graphemes(), and refactors the existing grapheme-break tests so both the C and Python implementations can share the same conformance suite.

Changes:

Introduces Lib/_py_grapheme.py implementing TR29 Extended Grapheme Cluster segmentation using unicodedata property accessors.
Refactors GraphemeBreakTest into a BaseGraphemeBreakTest mixin and adds PyGraphemeBreakTest to exercise the Python implementation.
Shares the TR29 conformance test (GraphemeBreakTest.txt) across both implementations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
Lib/_py_grapheme.py	New pure-Python TR29 grapheme cluster iterator returning `Segment` objects.
Lib/test/test_unicodedata.py	Test refactor into a shared base mixin + new test class targeting `_py_grapheme.iter_graphemes`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Lib/_py_grapheme.py

Lib/test/test_unicodedata.py

StanFromIreland · 2026-04-07T14:55:12Z

Is it possible to ask the SC/@hugovk to allow for backporting of bab1d7a rather than increasing the maintenance burden with a duplicate and different implementation instead? We could make them private on 3.14?

Lib/_py_grapheme.py

Add makegraphemedata() to Tools/unicode/makeunicodedata.py that generates Lib/_py_grapheme_db.py from the Unicode data files (GraphemeBreakProperty.txt, emoji-data.txt, DerivedCoreProperties.txt). _py_grapheme.py now imports property tables from _py_grapheme_db and uses bisect for lookups instead of calling unicodedata functions added in 3.15. This makes the module usable on Python 3.13 and 3.14 by regenerating the tables for the appropriate Unicode version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ambv · 2026-04-07T15:13:33Z

@StanFromIreland the exercise to see a pure Python reimplementation is interesting to me in a grander scheme that is providing the entirety of unicodedata in a pure Python version. That way the JIT could use that instead of hitting the C extension boundary every time unicodedata is needed. So this version wouldn't only be used for the backport, but could have a wider use for perf optimization in the future.

In any case, I asked Serhiy on gh-142529 to decide.

vstinner · 2026-04-07T16:13:05Z

The traceback issue gh-130273 has been fixed in the main branch using unicodedata.iter_graphemes() which was added to Python 3.15 and implemented in C.

Adding the pure Python Lib/_py_grapheme.py (251 lines) and Lib/_py_grapheme_db.py (344 lines) to Python 3.14 sounds like a bad idea. That's a lot of new code for a bugfix (3.14.x) release. We don't do that usually (but only backport "tiny" changes fixing bugs). Also, I don't think that we already added new stdlib modules in a bugfix release.

That way the JIT could use that instead of hitting the C extension boundary every time unicodedata is needed. So this version wouldn't only be used for the backport, but could have a wider use for perf optimization in the future.

Does the JIT already support calling a different implementation of a function? Here two modules have to be imported to call a single function (_py_grapheme.iter_graphemes). It sounds like a lot of work for a single unicodedata function and the imports can fail with ImportError. iter_graphemes() is new, so it's not widely used. unicodedata.category() is more commonly used, so we should target this function first if we care about performance, no?

I'm not convinced by the JIT argument. IMO the unicodedata functions are too rarely used and it would be a lot of work to switch to a different implementation in the JIT.

I suggest to not backport the traceback fix to 3.13 and 3.14 stable branches. And I don't think that it's worth it to add a pure Python implementation of unicodedata.iter_graphemes() in the main branch.

ambv · 2026-04-07T16:28:27Z

I'm not convinced by the JIT argument.

The argument isn't about unicodedata.iter_graphemes() specifically, but about avoiding standard library boundaries in JIT traces in general. It's true the JIT isn't doing that at the moment, but I expect it will get relevant in the future. Just like PyPy only uses Python-based standard libraries.

Me exercising the re-implementation of a subset of unicodedata for this goal was simply a test in how complicated it would be to reimplement the entirety of unicodedata in a separate change. Most C-based standard libraries do have Python equivalents per PEP 399 requirements. unicodedata is an example that predates this PEP. So I'm looking at how hard it will be to patch this gap. PyPy could switch to our implementation then.

Now the iter_graphemes() subset of unicodedata is testing the waters, motivated by the bugfix that otherwise can't feasibly be backported. It seems to me like it would make sense to integrate this subset in main and backport it to allow fixing the bug.

I agree that we usually keep backported changes small, but it's not unheard of to backport several hundred lines of code. Here, I'd argue that _py_grapheme_db.py is irrelevant as it's generated code. The actual size of the change is 252 lines for _py_grapheme.py and 129 lines added to makeunicodedata.py. So ~380 lines altogether. Not trivial, but not excessive either.

I hear your argument, let's see what Serhiy's got to say.

malemburg

The new test cases look fine, but I don't see much point in adding a pure Python version of the huge unicodedata database to Python, so -1 on those parts.

If people want to use such a pure Python implementation, they should download a package from PyPI which provides this.

bedevere-app · 2026-04-07T16:31:36Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

ambv requested review from Copilot and serhiy-storchaka April 7, 2026 13:59

bedevere-app bot added the awaiting core review label Apr 7, 2026

ambv changed the title ~~Add pure Python implementation of unicodedata.iter_graphemes()~~ gh-130273: Add pure Python implementation of unicodedata.iter_graphemes() Apr 7, 2026

ambv added the skip news label Apr 7, 2026

bedevere-app bot mentioned this pull request Apr 7, 2026

Traceback colors are shifted when the line contains wide unicode characters #130273

Closed

Copilot started reviewing on behalf of ambv April 7, 2026 14:00 View session

ambv mentioned this pull request Apr 7, 2026

gh-130273: Fix traceback color output with unicode characters #142529

Merged

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Lib/_py_grapheme.py Show resolved Hide resolved

Lib/_py_grapheme.py Outdated Show resolved Hide resolved

Lib/test/test_unicodedata.py Show resolved Hide resolved

Lib/test/test_unicodedata.py Show resolved Hide resolved

ambv added 2 commits April 7, 2026 16:15

Add _py_grapheme to stdlib_module_names.h

6262980

Make the first argument positional-only

38db422

StanFromIreland reviewed Apr 7, 2026

View reviewed changes

Lib/_py_grapheme.py Outdated Show resolved Hide resolved

ambv and others added 2 commits April 7, 2026 16:58

Fix newlines to make linter happy

5701c0b

StanFromIreland requested a review from malemburg April 7, 2026 15:12

Achieve 100% statement and branch test coverage

e073e06

malemburg requested changes Apr 7, 2026

View reviewed changes

bedevere-app bot added awaiting changes and removed awaiting core review labels Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-130273: Add pure Python implementation of unicodedata.iter_graphemes()#148218

gh-130273: Add pure Python implementation of unicodedata.iter_graphemes()#148218
ambv wants to merge 6 commits intopython:mainfrom
ambv:gh-130273-iter-graphemes

ambv commented Apr 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StanFromIreland commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

ambv commented Apr 7, 2026

Uh oh!

vstinner commented Apr 7, 2026

Uh oh!

ambv commented Apr 7, 2026

Uh oh!

malemburg left a comment

Uh oh!

bedevere-app bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

ambv commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StanFromIreland commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ambv commented Apr 7, 2026

Uh oh!

vstinner commented Apr 7, 2026

Uh oh!

ambv commented Apr 7, 2026

Uh oh!

malemburg left a comment

Choose a reason for hiding this comment

Uh oh!

bedevere-app bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ambv commented Apr 7, 2026 •

edited

Loading

StanFromIreland commented Apr 7, 2026 •

edited

Loading