-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split a string by graphemes #73
Comments
It might also be worth adding a note to |
It would be good to spell out the Unicode blocks of the combining characters, variation selectors, etc. Some of the current XPath functions are spelled out as "equivalent of the following function: ... " and this could be doable for "Text" and "emoji" variation selectors would be another good example to include:
|
The http://unicode.org/reports/tr51/ document should be referenced, which details how to identify an emoji grapheme. Which version is used should be implementation dependent. |
Are you sure you mean grapheme and not grapheme cluster here? |
Ah yes, you are right. Interestingly, Unicode supports two grapheme cluster modes (https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) -- legacy grapheme cluster, and extended grapheme cluster -- in addition to the emoji rules in TR51. As a result of this, it may be more useful to extend
If If If we want the default to be a simple conversion, the options could default to The exact behaviour would be implementation dependent and depend on the Unicode/Emoji version supported by that implementation. |
I based the function name on http://www.unicode.org/glossary/#grapheme, specifically:
|
I love the phrase "what a user thinks of as a character". I don't imagine many users think a pile of poo is a character. They might think that Sherlock Holmes is. |
This is quite interesting but we will need to ensure that graphemes in the context of CJK and unihan return what the users would expect. I ll try to come up with a few examples. |
I wonder if this feature isn’t too sophisticated to be added to the spec. What do others think? |
Personally, this is not something I have ever felt a need for. I'm open to persuasion on that, though I'm aware that when one person enthusiastically wants a feature, and everyone else doesn't see the need for it, there is a tendency to add it, causing feature creep. But I also need convincing that specifying it, implementing it, and testing it are reasonably feasible. A few features like parse-html and parse-uri are inevitably going to be difficult, but we only want to tackle difficult issues if there's a high benefit. |
Let me check with the TEI linguistic community and gauge their interest. |
Regarding "people who may see the need for it", the following is a fragment of an e-mail that I have found interesting and kept aside for when I get a moment to research more. It might be relevant to the issue at hand (and I'm hoping for your patience in case it's completely immaterial), despite mentioning Java and Python -- because others would want to use purely XML-based solutions here: (TL;DR? skip to point 2.)
(By Anil Singh, in a message to Corpora-l)
I would like to see if it is feasible to consistently get the same string length for the following variants of the same word (shukran, 'thanks'): If If, like me, you're not exactly eagle-eyed, you might appreciate the screenshot enlarging the squiggles: (despite appearances, there is no whitespace before the final Apologies if this is not on topic (I do hope it is, and will be curious to learn why it isn't, if it isn't -- even if by following a pointer, so thanks in advance). |
Thinking about Piotr's examples, and my own in Greek and Syriac, I see in If the functionality is approved by the CG, I would prefer to see it as its own standalone function, and not packed as a map option into In reading UAX#29 I can't help but also suggest that we consider introducing the functions And XPath would finally have a function that begins with 'w'. |
For cjk string manipulation unihan compliant @michaelhkay @ChristianGruen I don’t think of this as too sophisticated. The lack of perceived need seems to me accidentally based on the linguistic composition of the working group. |
I'm wondering if splitting text into graphemes could be presented as a use-case for invisible XML? The thing that always worries me about features like this is that the WG doesn't have the expertise to get the specification right. It's bad enough with collations -- we do exactly what UCA says, and it turns out to not to meet users' needs. |
On Tue, 2023-10-31 at 12:52 -0700, Joel Kalvesmaki wrote:
In reading UAX#29 I can't help but also suggest that we consider
introducing the functions fn:words and fn:sentences. The caveats in
UAX 29 would have to be iterated,
In particular, it doesn't work for huge numbers of people in the world
unless your implementation has a dictionary (e.g. for China, Japan,
Thailand).
On the other hand it could use the same definition as regular
expressions (\b \< \> \w \W in most systems).
If these various functions are added, there should be support in
regular expressions too (do we have \X already? see e.g. [1]
Sentences, Mr. Kalvesmaki, are harder :-).
[1] https://stackoverflow.com/questions/53198407/is-there-a-regular-expression-which-matches-a-single-grapheme-cluster
…--
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org
|
RegEX support would be fantastic. I don't think we do, another syntax suggestion |
No further comments here for the last 6 weeks… Do we believe someone would be ready and willing to create a proposal? |
I would be willing to do so, but only if (1) a standard function would be significantly more performative than one written by users and (2) there are but dim prospects for development of an ecosystem that allows independent QT libraries to flourish. (See thread "packaging".) My personal preference is that a community of linguists develop |
The issue isn't really whether the user can do this efficiently, but whether they can do it accurately. Especially when writing this in pure XPath/XQuery/XSLT. The information needed for this is in the Unicode Character Database (UCD) and the algorithms specified in the relevant Unicode TRs. Doing this properly would likely involve including an external library such as https://icu.unicode.org/. This is difficult to do outside of the processor, and implementors will already be including this data for other functions such as upper/lower case conversion and regex script/general category selectors ( |
It is worth looking at parallel efforts to Unicode provides some excellent resources (links to UCD 15.1): Having looked more closely at the algorithm, I think that this cannot be easily implemented in iXML or regular expressions. I agree with @rhdunn that if I think that the very extensive test suite provided by Unicode (no. 1 above) provides exactly what would be needed to ensure accurate implementation. I'd be willing to convert the Unicode test suite into QT4 tests. I think the more significant question is whether implementers of the QT 4.0 specs believe that their effort is worthwhile. There are several possible strategies an implementer could use to apply the rules. Personally, I believe
I'd also be willing to work on the specs to create an actionable PR that the CG can deliberate over (or @rhdunn can do so). But I wouldn't want to invest that time if no implementer expressed interest or willingness to implement the complex function. |
Won’t the |
On Mon, 2024-02-26 at 23:05 -0800, Gerrit Imsieke wrote:
Won’t the BreakIterators that ICU4J provides help implementers in the
Java realm?
Yes. And in C# and C++. There's also code for identifying grapheme
clusters in harfbuzz, usable from C and C++ directly.
There's some additional complexity in that e.g. SIL Graphite (e.g. in
OpenOffice/LibreOffice) does shaping at the font level; the Unicode
algorithm (I'm told) isn't adequate in all cases. If it becomes
necessary i can find more details. But in practice the ICU
BreakIterators or the harfbuzz hb_shape function seem to be what most
people use. But either way it likely adds a dependency.
I agree, however, it'd be a useful addition.
…--
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org
|
We’d be willing to provide an implementation. The function can be compared with |
Similarly to Christian, if tests are available and an ICU library implementation is available, then it's not a major cost to add a function that wraps the ICU implementation. |
Yes, I would expect this functionality to be implementable by wrapping the ICU functionality, or other Unicode library that implements the relevant TR logic. This is about exposing that capability to XPath, XSLT, and XQuery. |
Accepted |
The new
fn:characters
function is useful, but doesn't solve a problem of manipulating strings where multiple codepoints correspond to a single grapheme. For example:Getting this right is complex, and implementing it as a regular expression is easy to get wrong/make mistakes.
fn:graphemes
Summary
Splits the supplied string into a sequence of single-grapheme (one or more character) strings.
Signature
fn:graphemes($value as xs:string?) as xs:string*
Properties
This function is ·deterministic·, ·context-independent·, and ·focus-independent·.
Rules
The function returns a sequence of strings, containing the corresponding ·grapheme· in $value. These are determined by the corresponding Unicode rules for what constitutes a ·grapheme·. The version of Unicode and the Unicode Emoji standards is ·implementation-dependent·.
If $value is a zero-length string or the empty sequence, the function returns the empty sequence.
Examples
The text was updated successfully, but these errors were encountered: