Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split a string by graphemes #73

Closed
rhdunn opened this issue May 7, 2021 · 27 comments
Closed

Split a string by graphemes #73

rhdunn opened this issue May 7, 2021 · 27 comments
Labels
Feature A change that introduces a new feature PR Pending A PR has been raised to resolve this issue XQFO An issue related to Functions and Operators

Comments

@rhdunn
Copy link
Contributor

rhdunn commented May 7, 2021

The new fn:characters function is useful, but doesn't solve a problem of manipulating strings where multiple codepoints correspond to a single grapheme. For example:

  1. characters with one or more combining characters;
  2. emoji with skin tone variant selectors;
  3. emoji with gender variant selectors;
  4. multi-sequence emoji -- family, wales flag, etc.;
  5. region indicator pairs for flags.

Getting this right is complex, and implementing it as a regular expression is easy to get wrong/make mistakes.

fn:graphemes

Summary

Splits the supplied string into a sequence of single-grapheme (one or more character) strings.

Signature

fn:graphemes($value as xs:string?) as xs:string*

Properties

This function is ·deterministic·, ·context-independent·, and ·focus-independent·.

Rules

The function returns a sequence of strings, containing the corresponding ·grapheme· in $value. These are determined by the corresponding Unicode rules for what constitutes a ·grapheme·. The version of Unicode and the Unicode Emoji standards is ·implementation-dependent·.

If $value is a zero-length string or the empty sequence, the function returns the empty sequence.

Examples

The expression fn:graphemes("Thérèse") returns ("T", "h", "é", "r", "è", "s", "e"), irrespective of whether the e characters use combining characters or not.

The expression fn:graphemes("") returns ().

The expression fn:graphemes(()) returns ().

The expression fn:graphemes("👋🏻👋🏼👋🏽👋🏾👋🏿") returns ("👋🏻", "👋🏼", "👋🏽", "👋🏾", "👋🏿").

The expression fn:graphemes("👪") returns ("👪").

The expression fn:graphemes("👨‍🔬👩‍🔬") returns ("👨‍🔬", "👩‍🔬").

The expression fn:graphemes("🇪🇪🇩🇪🇫🇷🏴󠁧󠁢󠁷󠁬󠁳󠁿🇮🇸") returns ("🇪🇪", "🇩🇪", "🇫🇷", "🏴󠁧󠁢󠁷󠁬󠁳󠁿", "🇮🇸").

@rhdunn
Copy link
Contributor Author

rhdunn commented May 7, 2021

It might also be worth adding a note to fn:characters about this issue and referencing fn:graphemes for the use cases where preserving graphemes is required.

@Conal-Tuohy
Copy link

It would be good to spell out the Unicode blocks of the combining characters, variation selectors, etc. Some of the current XPath functions are spelled out as "equivalent of the following function: ... " and this could be doable for fn:graphemes, too, I think.

"Text" and "emoji" variation selectors would be another good example to include:

The expression fn:graphemes("♎♎︎") returns ("♎", "♎︎").

@rhdunn
Copy link
Contributor Author

rhdunn commented May 7, 2021

The http://unicode.org/reports/tr51/ document should be referenced, which details how to identify an emoji grapheme. Which version is used should be implementation dependent.

@liamquin
Copy link

liamquin commented May 7, 2021

Are you sure you mean grapheme and not grapheme cluster here?

@rhdunn
Copy link
Contributor Author

rhdunn commented May 7, 2021

Ah yes, you are right. Interestingly, Unicode supports two grapheme cluster modes (https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) -- legacy grapheme cluster, and extended grapheme cluster -- in addition to the emoji rules in TR51.

As a result of this, it may be more useful to extend fn:characters with an options map. That could have two options:

record(grapheme-cluster as enum("codepoint", "legacy", "extended"),
       emoji as xs:boolean)

If grapheme-cluster is "codepoint", it works as $value => string-to-codepoints() -> codepoints-to-string(). If it is "legacy" or "extended", it follows the corresponding Unicode algorithm in TR29.

If emoji is true, then it follows the TR51 rules for segmenting emoji sequences.

If we want the default to be a simple conversion, the options could default to map { grapheme-cluster: "codepoint", emoji: false() }, otherwise, they could default to map { grapheme-cluster: "extended", emoji: true() }.

The exact behaviour would be implementation dependent and depend on the Unicode/Emoji version supported by that implementation.

@rhdunn
Copy link
Contributor Author

rhdunn commented May 7, 2021

I based the function name on http://www.unicode.org/glossary/#grapheme, specifically:

(2) What a user thinks of as a character.

@michaelhkay
Copy link
Contributor

I love the phrase "what a user thinks of as a character". I don't imagine many users think a pile of poo is a character. They might think that Sherlock Holmes is.

@duncdrum
Copy link

This is quite interesting but we will need to ensure that graphemes in the context of CJK and unihan return what the users would expect. I ll try to come up with a few examples.

@rhdunn rhdunn added XQFO An issue related to Functions and Operators Feature A change that introduces a new feature labels Sep 14, 2022
@ChristianGruen ChristianGruen added this to the QT 4.0 milestone Oct 14, 2022
@ChristianGruen ChristianGruen removed this from the QT 4.0 milestone Apr 27, 2023
@ChristianGruen ChristianGruen changed the title [FO] Split a string by graphemes Split a string by graphemes Apr 27, 2023
@ChristianGruen
Copy link
Contributor

I wonder if this feature isn’t too sophisticated to be added to the spec. What do others think?

@michaelhkay
Copy link
Contributor

Personally, this is not something I have ever felt a need for. I'm open to persuasion on that, though I'm aware that when one person enthusiastically wants a feature, and everyone else doesn't see the need for it, there is a tendency to add it, causing feature creep. But I also need convincing that specifying it, implementing it, and testing it are reasonably feasible. A few features like parse-html and parse-uri are inevitably going to be difficult, but we only want to tackle difficult issues if there's a high benefit.

@Arithmeticus
Copy link
Contributor

Let me check with the TEI linguistic community and gauge their interest.

@bansp
Copy link

bansp commented Oct 31, 2023

Regarding "people who may see the need for it", the following is a fragment of an e-mail that I have found interesting and kept aside for when I get a moment to research more. It might be relevant to the issue at hand (and I'm hoping for your patience in case it's completely immaterial), despite mentioning Java and Python -- because others would want to use purely XML-based solutions here:

(TL;DR? skip to point 2.)

  1. Indic script

I always thought of Indic script dependent vowel (maatraa) as a character, but I recently found that languages like Java and Python do not treat such written symbols as character, so when I try to get the length of an Indic-script string, the in-built string length functions give only the number of consonant symbols and independent vowels in the string. We got wrong results using these functions and I only accidentally discovered that this is the case. The reason, of course, is that these functions and programming languages treat such dependent vowels as diacritics, which is also correct in some ways. I did not realize this earlier because in India we often use a Latin script-based notation called WX for Indic scripts in NLP due to the encoding and input method related problems that I referred to in one of my earlier replies. The WX notation, however, does not distinguish between dependent and independent vowels and treats both of them as the same character, which is how most of us, if not all, think of them in India to the best of my knowledge. On the other hand, the consonant symbol modifier 'halant' is not used in WX, but is used in Indic-scripts and its presence might also cause disagreements about what the string length is. In other words, character as a unit does not work in your terms. In fact, who knows how many errors for Indic script text have made their way into computational results due to this simple fact. And perhaps they still do because it took me a long time to realize this, which at first led to consternation, because in text processing if you can't rely on the string length function, what can you rely on?

(By Anil Singh, in a message to Corpora-l)

  1. Arabic script

I would like to see if it is feasible to consistently get the same string length for the following variants of the same word (shukran, 'thanks'): شكرا and شُكْرًا. The latter example uses some diacritics (there can also be examples with more diacritics in a single grapheme; they can stack), not only for short vowels, but also for the final "n" (and also for the absence of a vowel, between "k" and "r"). And, naturally, they are bound to be produced by various methods. The result, in both cases, should be four, irrespective of how the second form is constructed.

If fn:characters were to differ from fn:graphemes in this case and/or for the Indic example (consistently) , then that might indicate a benefit in keeping the latter function in.

If, like me, you're not exactly eagle-eyed, you might appreciate the screenshot enlarging the squiggles:
image

(despite appearances, there is no whitespace before the final ا -- the whole thing is a single word)


Apologies if this is not on topic (I do hope it is, and will be curious to learn why it isn't, if it isn't -- even if by following a pointer, so thanks in advance).

@Arithmeticus
Copy link
Contributor

Thinking about Piotr's examples, and my own in Greek and Syriac, I see in fn:grapheme clear benefits for those who work with non-Latin scripts. Easy for me to say, but the implementers should say whether it is manageable.

If the functionality is approved by the CG, I would prefer to see it as its own standalone function, and not packed as a map option into fn:characters: the function's name is misleading, and functions with parameters that expect a map can be a hassle to use. To support the two UAX 29 rule types, fn:grapheme could be extended to arity two with the parameter $extended as xs:boolean() := true(). (Crossing my fingers that Unicode doesn't introduce a third type of grapheme cluster.)

In reading UAX#29 I can't help but also suggest that we consider introducing the functions fn:words and fn:sentences. The caveats in UAX 29 would have to be iterated, but the result would provide significant utility to a very broad range of users, including the majority who work only in Latin scripts.

And XPath would finally have a function that begins with 'w'.

@duncdrum
Copy link

For cjk string manipulation unihan compliant fn:grapheme () would be highly useful. I ll gladly come up or review examples. ( not now on my phone)

@michaelhkay @ChristianGruen I don’t think of this as too sophisticated. The lack of perceived need seems to me accidentally based on the linguistic composition of the working group.

@michaelhkay
Copy link
Contributor

I'm wondering if splitting text into graphemes could be presented as a use-case for invisible XML?

The thing that always worries me about features like this is that the WG doesn't have the expertise to get the specification right. It's bad enough with collations -- we do exactly what UCA says, and it turns out to not to meet users' needs.

@liamquin
Copy link

liamquin commented Oct 31, 2023 via email

@duncdrum
Copy link

duncdrum commented Nov 1, 2023

If these various functions are added, there should be support in regular expressions too (do we have \X already?

RegEX support would be fantastic. I don't think we do, another syntax suggestion \g , see https://www.unicode.org/reports/tr18/#RL2.2

@ChristianGruen
Copy link
Contributor

No further comments here for the last 6 weeks… Do we believe someone would be ready and willing to create a proposal?

@Arithmeticus
Copy link
Contributor

I would be willing to do so, but only if (1) a standard function would be significantly more performative than one written by users and (2) there are but dim prospects for development of an ecosystem that allows independent QT libraries to flourish. (See thread "packaging".)

My personal preference is that a community of linguists develop grapheme and related functions. But (to restate the points in the previous paragraph) I would reconsider that recommendation if performance would be suboptimal, or if an independent library of linguistic functions would lie in oblivion.

@rhdunn
Copy link
Contributor Author

rhdunn commented Dec 16, 2023

The issue isn't really whether the user can do this efficiently, but whether they can do it accurately. Especially when writing this in pure XPath/XQuery/XSLT.

The information needed for this is in the Unicode Character Database (UCD) and the algorithms specified in the relevant Unicode TRs.

Doing this properly would likely involve including an external library such as https://icu.unicode.org/. This is difficult to do outside of the processor, and implementors will already be including this data for other functions such as upper/lower case conversion and regex script/general category selectors (\p{Latn}, \p{Lu}, etc.). As such, this would be easier to have processor support for than doing it in an external component.

@Arithmeticus
Copy link
Contributor

It is worth looking at parallel efforts to fn:grapheme: in Rust and Python.

Unicode provides some excellent resources (links to UCD 15.1):

  1. Grapheme break tests
  2. Guide to grapheme breaks
  3. Grapheme break properties

Having looked more closely at the algorithm, I think that this cannot be easily implemented in iXML or regular expressions. I agree with @rhdunn that if fn:grapheme should enter the QT ecosystem, it has to be done on the level of implementation.

I think that the very extensive test suite provided by Unicode (no. 1 above) provides exactly what would be needed to ensure accurate implementation. I'd be willing to convert the Unicode test suite into QT4 tests.

I think the more significant question is whether implementers of the QT 4.0 specs believe that their effort is worthwhile. There are several possible strategies an implementer could use to apply the rules.

Personally, I believe fn:grapheme has the potential to greatly help underserved communities. The communities that would benefit include those who use:

  • Korean
  • Myanmar
  • Thai
  • Assorted Indic scripts
  • Arabic
  • Syriac
  • Control and line characters (classes Cc, Cn, Cf, Zl, Zp)
  • Emojis

I'd also be willing to work on the specs to create an actionable PR that the CG can deliberate over (or @rhdunn can do so). But I wouldn't want to invest that time if no implementer expressed interest or willingness to implement the complex function.

@gimsieke
Copy link
Contributor

Won’t the BreakIterators that ICU4J provides help implementers in the Java realm?

@liamquin
Copy link

liamquin commented Feb 27, 2024 via email

@ChristianGruen
Copy link
Contributor

ChristianGruen commented Feb 27, 2024

I'd also be willing to work on the specs to create an actionable PR that the CG can deliberate over (or @rhdunn can do so). But I wouldn't want to invest that time if no implementer expressed interest or willingness to implement the complex function.

We’d be willing to provide an implementation. The function can be compared with fn:parse-html: It’s too complex to provide a custom implementation, but if a library is available that does the actual work (ICU, in our case), it will be easy to embed it, and to enable the function if the library is found in the classpath.

@michaelhkay
Copy link
Contributor

Similarly to Christian, if tests are available and an ICU library implementation is available, then it's not a major cost to add a function that wraps the ICU implementation.

@rhdunn
Copy link
Contributor Author

rhdunn commented Feb 28, 2024

Yes, I would expect this functionality to be implementable by wrapping the ICU functionality, or other Unicode library that implements the relevant TR logic. This is about exposing that capability to XPath, XSLT, and XQuery.

@ChristianGruen ChristianGruen added the PR Pending A PR has been raised to resolve this issue label Mar 26, 2024
@ChristianGruen
Copy link
Contributor

Accepted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A change that introduces a new feature PR Pending A PR has been raised to resolve this issue XQFO An issue related to Functions and Operators
Projects
None yet
Development

No branches or pull requests

9 participants