Split a string by graphemes #73

rhdunn · 2021-05-07T07:33:59Z

The new fn:characters function is useful, but doesn't solve a problem of manipulating strings where multiple codepoints correspond to a single grapheme. For example:

characters with one or more combining characters;
emoji with skin tone variant selectors;
emoji with gender variant selectors;
multi-sequence emoji -- family, wales flag, etc.;
region indicator pairs for flags.

Getting this right is complex, and implementing it as a regular expression is easy to get wrong/make mistakes.

fn:graphemes

Summary

Splits the supplied string into a sequence of single-grapheme (one or more character) strings.

Signature

fn:graphemes($value as xs:string?) as xs:string*

Properties

This function is ·deterministic·, ·context-independent·, and ·focus-independent·.

Rules

The function returns a sequence of strings, containing the corresponding ·grapheme· in $value. These are determined by the corresponding Unicode rules for what constitutes a ·grapheme·. The version of Unicode and the Unicode Emoji standards is ·implementation-dependent·.

If $value is a zero-length string or the empty sequence, the function returns the empty sequence.

Examples

The expression fn:graphemes("Thérèse") returns ("T", "h", "é", "r", "è", "s", "e"), irrespective of whether the e characters use combining characters or not.

The expression fn:graphemes("") returns ().

The expression fn:graphemes(()) returns ().

The expression fn:graphemes("👋🏻👋🏼👋🏽👋🏾👋🏿") returns ("👋🏻", "👋🏼", "👋🏽", "👋🏾", "👋🏿").

The expression fn:graphemes("👪") returns ("👪").

The expression fn:graphemes("👨‍🔬👩‍🔬") returns ("👨‍🔬", "👩‍🔬").

The expression fn:graphemes("🇪🇪🇩🇪🇫🇷🏴󠁧󠁢󠁷󠁬󠁳󠁿🇮🇸") returns ("🇪🇪", "🇩🇪", "🇫🇷", "🏴󠁧󠁢󠁷󠁬󠁳󠁿", "🇮🇸").

The text was updated successfully, but these errors were encountered:

rhdunn · 2021-05-07T07:40:44Z

It might also be worth adding a note to fn:characters about this issue and referencing fn:graphemes for the use cases where preserving graphemes is required.

Conal-Tuohy · 2021-05-07T09:19:11Z

It would be good to spell out the Unicode blocks of the combining characters, variation selectors, etc. Some of the current XPath functions are spelled out as "equivalent of the following function: ... " and this could be doable for fn:graphemes, too, I think.

"Text" and "emoji" variation selectors would be another good example to include:

The expression fn:graphemes("♎♎︎") returns ("♎", "♎︎").

rhdunn · 2021-05-07T15:22:59Z

The http://unicode.org/reports/tr51/ document should be referenced, which details how to identify an emoji grapheme. Which version is used should be implementation dependent.

liamquin · 2021-05-07T20:22:26Z

Are you sure you mean grapheme and not grapheme cluster here?

rhdunn · 2021-05-07T21:05:34Z

Ah yes, you are right. Interestingly, Unicode supports two grapheme cluster modes (https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) -- legacy grapheme cluster, and extended grapheme cluster -- in addition to the emoji rules in TR51.

As a result of this, it may be more useful to extend fn:characters with an options map. That could have two options:

record(grapheme-cluster as enum("codepoint", "legacy", "extended"),
       emoji as xs:boolean)

If grapheme-cluster is "codepoint", it works as $value => string-to-codepoints() -> codepoints-to-string(). If it is "legacy" or "extended", it follows the corresponding Unicode algorithm in TR29.

If emoji is true, then it follows the TR51 rules for segmenting emoji sequences.

If we want the default to be a simple conversion, the options could default to map { grapheme-cluster: "codepoint", emoji: false() }, otherwise, they could default to map { grapheme-cluster: "extended", emoji: true() }.

The exact behaviour would be implementation dependent and depend on the Unicode/Emoji version supported by that implementation.

rhdunn · 2021-05-07T21:11:48Z

I based the function name on http://www.unicode.org/glossary/#grapheme, specifically:

(2) What a user thinks of as a character.

michaelhkay · 2022-08-11T22:54:49Z

I love the phrase "what a user thinks of as a character". I don't imagine many users think a pile of poo is a character. They might think that Sherlock Holmes is.

duncdrum · 2022-08-12T08:29:15Z

This is quite interesting but we will need to ensure that graphemes in the context of CJK and unihan return what the users would expect. I ll try to come up with a few examples.

ChristianGruen · 2023-10-31T11:11:53Z

I wonder if this feature isn’t too sophisticated to be added to the spec. What do others think?

michaelhkay · 2023-10-31T11:49:34Z

Personally, this is not something I have ever felt a need for. I'm open to persuasion on that, though I'm aware that when one person enthusiastically wants a feature, and everyone else doesn't see the need for it, there is a tendency to add it, causing feature creep. But I also need convincing that specifying it, implementing it, and testing it are reasonably feasible. A few features like parse-html and parse-uri are inevitably going to be difficult, but we only want to tackle difficult issues if there's a high benefit.

Arithmeticus · 2023-10-31T15:53:44Z

Let me check with the TEI linguistic community and gauge their interest.

bansp · 2023-10-31T17:58:11Z

Regarding "people who may see the need for it", the following is a fragment of an e-mail that I have found interesting and kept aside for when I get a moment to research more. It might be relevant to the issue at hand (and I'm hoping for your patience in case it's completely immaterial), despite mentioning Java and Python -- because others would want to use purely XML-based solutions here:

(TL;DR? skip to point 2.)

Indic script

I always thought of Indic script dependent vowel (maatraa) as a character, but I recently found that languages like Java and Python do not treat such written symbols as character, so when I try to get the length of an Indic-script string, the in-built string length functions give only the number of consonant symbols and independent vowels in the string. We got wrong results using these functions and I only accidentally discovered that this is the case. The reason, of course, is that these functions and programming languages treat such dependent vowels as diacritics, which is also correct in some ways. I did not realize this earlier because in India we often use a Latin script-based notation called WX for Indic scripts in NLP due to the encoding and input method related problems that I referred to in one of my earlier replies. The WX notation, however, does not distinguish between dependent and independent vowels and treats both of them as the same character, which is how most of us, if not all, think of them in India to the best of my knowledge. On the other hand, the consonant symbol modifier 'halant' is not used in WX, but is used in Indic-scripts and its presence might also cause disagreements about what the string length is. In other words, character as a unit does not work in your terms. In fact, who knows how many errors for Indic script text have made their way into computational results due to this simple fact. And perhaps they still do because it took me a long time to realize this, which at first led to consternation, because in text processing if you can't rely on the string length function, what can you rely on?

(By Anil Singh, in a message to Corpora-l)

Arabic script

I would like to see if it is feasible to consistently get the same string length for the following variants of the same word (shukran, 'thanks'): شكرا and شُكْرًا. The latter example uses some diacritics (there can also be examples with more diacritics in a single grapheme; they can stack), not only for short vowels, but also for the final "n" (and also for the absence of a vowel, between "k" and "r"). And, naturally, they are bound to be produced by various methods. The result, in both cases, should be four, irrespective of how the second form is constructed.

If fn:characters were to differ from fn:graphemes in this case and/or for the Indic example (consistently) , then that might indicate a benefit in keeping the latter function in.

If, like me, you're not exactly eagle-eyed, you might appreciate the screenshot enlarging the squiggles:

(despite appearances, there is no whitespace before the final ا -- the whole thing is a single word)

Apologies if this is not on topic (I do hope it is, and will be curious to learn why it isn't, if it isn't -- even if by following a pointer, so thanks in advance).

Arithmeticus · 2023-10-31T19:52:35Z

Thinking about Piotr's examples, and my own in Greek and Syriac, I see in fn:grapheme clear benefits for those who work with non-Latin scripts. Easy for me to say, but the implementers should say whether it is manageable.

If the functionality is approved by the CG, I would prefer to see it as its own standalone function, and not packed as a map option into fn:characters: the function's name is misleading, and functions with parameters that expect a map can be a hassle to use. To support the two UAX 29 rule types, fn:grapheme could be extended to arity two with the parameter $extended as xs:boolean() := true(). (Crossing my fingers that Unicode doesn't introduce a third type of grapheme cluster.)

In reading UAX#29 I can't help but also suggest that we consider introducing the functions fn:words and fn:sentences. The caveats in UAX 29 would have to be iterated, but the result would provide significant utility to a very broad range of users, including the majority who work only in Latin scripts.

And XPath would finally have a function that begins with 'w'.

duncdrum · 2023-10-31T20:41:32Z

For cjk string manipulation unihan compliant fn:grapheme () would be highly useful. I ll gladly come up or review examples. ( not now on my phone)

@michaelhkay @ChristianGruen I don’t think of this as too sophisticated. The lack of perceived need seems to me accidentally based on the linguistic composition of the working group.

michaelhkay · 2023-10-31T22:17:52Z

I'm wondering if splitting text into graphemes could be presented as a use-case for invisible XML?

The thing that always worries me about features like this is that the WG doesn't have the expertise to get the specification right. It's bad enough with collations -- we do exactly what UCA says, and it turns out to not to meet users' needs.

liamquin · 2023-10-31T23:38:35Z

On Tue, 2023-10-31 at 12:52 -0700, Joel Kalvesmaki wrote: In reading UAX#29 I can't help but also suggest that we consider introducing the functions fn:words and fn:sentences. The caveats in UAX 29 would have to be iterated,

In particular, it doesn't work for huge numbers of people in the world unless your implementation has a dictionary (e.g. for China, Japan, Thailand). On the other hand it could use the same definition as regular expressions (\b \< \> \w \W in most systems). If these various functions are added, there should be support in regular expressions too (do we have \X already? see e.g. [1] Sentences, Mr. Kalvesmaki, are harder :-). [1] https://stackoverflow.com/questions/53198407/is-there-a-regular-expression-which-matches-a-single-grapheme-cluster

…

-- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org

duncdrum · 2023-11-01T13:00:25Z

If these various functions are added, there should be support in regular expressions too (do we have \X already?

RegEX support would be fantastic. I don't think we do, another syntax suggestion \g , see https://www.unicode.org/reports/tr18/#RL2.2

ChristianGruen · 2023-12-14T11:57:37Z

No further comments here for the last 6 weeks… Do we believe someone would be ready and willing to create a proposal?

Arithmeticus · 2023-12-14T17:26:40Z

I would be willing to do so, but only if (1) a standard function would be significantly more performative than one written by users and (2) there are but dim prospects for development of an ecosystem that allows independent QT libraries to flourish. (See thread "packaging".)

My personal preference is that a community of linguists develop grapheme and related functions. But (to restate the points in the previous paragraph) I would reconsider that recommendation if performance would be suboptimal, or if an independent library of linguistic functions would lie in oblivion.

rhdunn · 2023-12-16T20:52:28Z

The issue isn't really whether the user can do this efficiently, but whether they can do it accurately. Especially when writing this in pure XPath/XQuery/XSLT.

The information needed for this is in the Unicode Character Database (UCD) and the algorithms specified in the relevant Unicode TRs.

Doing this properly would likely involve including an external library such as https://icu.unicode.org/. This is difficult to do outside of the processor, and implementors will already be including this data for other functions such as upper/lower case conversion and regex script/general category selectors (\p{Latn}, \p{Lu}, etc.). As such, this would be easier to have processor support for than doing it in an external component.

Arithmeticus · 2024-02-27T05:43:51Z

It is worth looking at parallel efforts to fn:grapheme: in Rust and Python.

Unicode provides some excellent resources (links to UCD 15.1):

Having looked more closely at the algorithm, I think that this cannot be easily implemented in iXML or regular expressions. I agree with @rhdunn that if fn:grapheme should enter the QT ecosystem, it has to be done on the level of implementation.

I think that the very extensive test suite provided by Unicode (no. 1 above) provides exactly what would be needed to ensure accurate implementation. I'd be willing to convert the Unicode test suite into QT4 tests.

I think the more significant question is whether implementers of the QT 4.0 specs believe that their effort is worthwhile. There are several possible strategies an implementer could use to apply the rules.

Personally, I believe fn:grapheme has the potential to greatly help underserved communities. The communities that would benefit include those who use:

Korean
Myanmar
Thai
Assorted Indic scripts
Arabic
Syriac
Control and line characters (classes Cc, Cn, Cf, Zl, Zp)
Emojis

I'd also be willing to work on the specs to create an actionable PR that the CG can deliberate over (or @rhdunn can do so). But I wouldn't want to invest that time if no implementer expressed interest or willingness to implement the complex function.

gimsieke · 2024-02-27T07:05:22Z

Won’t the BreakIterators that ICU4J provides help implementers in the Java realm?

liamquin · 2024-02-27T07:38:32Z

On Mon, 2024-02-26 at 23:05 -0800, Gerrit Imsieke wrote: Won’t the BreakIterators that ICU4J provides help implementers in the Java realm?

Yes. And in C# and C++. There's also code for identifying grapheme clusters in harfbuzz, usable from C and C++ directly. There's some additional complexity in that e.g. SIL Graphite (e.g. in OpenOffice/LibreOffice) does shaping at the font level; the Unicode algorithm (I'm told) isn't adequate in all cases. If it becomes necessary i can find more details. But in practice the ICU BreakIterators or the harfbuzz hb_shape function seem to be what most people use. But either way it likely adds a dependency. I agree, however, it'd be a useful addition.

…

-- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org

ChristianGruen · 2024-02-27T10:16:05Z

I'd also be willing to work on the specs to create an actionable PR that the CG can deliberate over (or @rhdunn can do so). But I wouldn't want to invest that time if no implementer expressed interest or willingness to implement the complex function.

We’d be willing to provide an implementation. The function can be compared with fn:parse-html: It’s too complex to provide a custom implementation, but if a library is available that does the actual work (ICU, in our case), it will be easy to embed it, and to enable the function if the library is found in the classpath.

michaelhkay · 2024-02-28T11:55:01Z

Similarly to Christian, if tests are available and an ICU library implementation is available, then it's not a major cost to add a function that wraps the ICU implementation.

rhdunn · 2024-02-28T18:52:10Z

Yes, I would expect this functionality to be implementable by wrapping the ICU functionality, or other Unicode library that implements the relevant TR logic. This is about exposing that capability to XPath, XSLT, and XQuery.

ChristianGruen · 2024-05-15T11:12:10Z

Accepted

rhdunn added XQFO An issue related to Functions and Operators Feature A change that introduces a new feature labels Sep 14, 2022

ChristianGruen added this to the QT 4.0 milestone Oct 14, 2022

ChristianGruen removed this from the QT 4.0 milestone Apr 27, 2023

ChristianGruen changed the title ~~[FO] Split a string by graphemes~~ Split a string by graphemes Apr 27, 2023

Arithmeticus mentioned this issue Mar 8, 2024

73 fn:graphemes #1068

Merged

ChristianGruen mentioned this issue Mar 8, 2024

fn:ucd #1069

Open

ChristianGruen added the PR Pending A PR has been raised to resolve this issue label Mar 26, 2024

ChristianGruen closed this as completed May 15, 2024

ChristianGruen removed the PR Pending A PR has been raised to resolve this issue label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split a string by graphemes #73

Split a string by graphemes #73

rhdunn commented May 7, 2021 •

edited

Loading

rhdunn commented May 7, 2021 •

edited

Loading

Conal-Tuohy commented May 7, 2021

rhdunn commented May 7, 2021 •

edited

Loading

liamquin commented May 7, 2021

rhdunn commented May 7, 2021

rhdunn commented May 7, 2021

michaelhkay commented Aug 11, 2022

duncdrum commented Aug 12, 2022

ChristianGruen commented Oct 31, 2023

michaelhkay commented Oct 31, 2023

Arithmeticus commented Oct 31, 2023

bansp commented Oct 31, 2023

Arithmeticus commented Oct 31, 2023

duncdrum commented Oct 31, 2023

michaelhkay commented Oct 31, 2023

liamquin commented Oct 31, 2023 via email

duncdrum commented Nov 1, 2023 •

edited

Loading

ChristianGruen commented Dec 14, 2023

Arithmeticus commented Dec 14, 2023

rhdunn commented Dec 16, 2023

Arithmeticus commented Feb 27, 2024

gimsieke commented Feb 27, 2024

liamquin commented Feb 27, 2024 via email

ChristianGruen commented Feb 27, 2024 •

edited

Loading

michaelhkay commented Feb 28, 2024

rhdunn commented Feb 28, 2024

ChristianGruen commented May 15, 2024

Split a string by graphemes #73

Split a string by graphemes #73

Comments

rhdunn commented May 7, 2021 • edited Loading

fn:graphemes

Summary

Signature

Properties

Rules

Examples

rhdunn commented May 7, 2021 • edited Loading

Conal-Tuohy commented May 7, 2021

rhdunn commented May 7, 2021 • edited Loading

liamquin commented May 7, 2021

rhdunn commented May 7, 2021

rhdunn commented May 7, 2021

michaelhkay commented Aug 11, 2022

duncdrum commented Aug 12, 2022

ChristianGruen commented Oct 31, 2023

michaelhkay commented Oct 31, 2023

Arithmeticus commented Oct 31, 2023

bansp commented Oct 31, 2023

Arithmeticus commented Oct 31, 2023

duncdrum commented Oct 31, 2023

michaelhkay commented Oct 31, 2023

liamquin commented Oct 31, 2023 via email

duncdrum commented Nov 1, 2023 • edited Loading

ChristianGruen commented Dec 14, 2023

Arithmeticus commented Dec 14, 2023

rhdunn commented Dec 16, 2023

Arithmeticus commented Feb 27, 2024

gimsieke commented Feb 27, 2024

liamquin commented Feb 27, 2024 via email

ChristianGruen commented Feb 27, 2024 • edited Loading

michaelhkay commented Feb 28, 2024

rhdunn commented Feb 28, 2024

ChristianGruen commented May 15, 2024

rhdunn commented May 7, 2021 •

edited

Loading

rhdunn commented May 7, 2021 •

edited

Loading

rhdunn commented May 7, 2021 •

edited

Loading

duncdrum commented Nov 1, 2023 •

edited

Loading

ChristianGruen commented Feb 27, 2024 •

edited

Loading