Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ambiguity or insufficient specification for Latin-based Oromo-Qubee counter styles #47

Open
verdy-p opened this issue Jul 5, 2022 · 4 comments

Comments

@verdy-p
Copy link

verdy-p commented Jul 5, 2022

The "alphabetic" algorithm is not supposed to generate the same sequence for different integer values (such repetition is just allowed and properly described only for "cyclic" algorithms).

But the alphabetic symbols that are doubling vowels (like 2=AA) cause problems (we get also 38=AA).
We have also the problem for the symbol "NY", because "N" and "Y" are also valid symbol in the defined set.

This means that the valid range for the current specification of Oromo-Qubee is 1-36 (but there's no valid range specified, so that it could fallback to decimal by default).

The "alphabetic" algorithm and its specification do not offer such warranty, and do not specify any option to conditionally insert distinctive separators or modifiers, when some alphabetic symbols are already a concatenation of several valid symbols in the defined set.

Or may be the given symbol "AA" is wrong/insufficient for the value 2 (similar case for "EE", "II", "OO", "UU" and "NY"), and it should be a ligature (using ZWJ in the middle?), or more complex cluster, with diacritics like "A͜A", "A͟A", "A͡A" or "A͠A", or joined by a distinctive separator like {"A-A", "A'A", "A·A"} for representing two separate symbols (alphabetic digits of the numeral system), or using superscript as appended or prepended modifiers like "Aª", or "ªA" insted of polygrams like "AA" that must represent an unbreakable single symbol (or digit) in the numeral alphabetic system.

  • I think there should be some separator (e.g. an hyphen or apostrophe) inserted when appending sequences of symbols that could be misinterpreted (such that 38="A-A" and not like 2="AA"), or that some polygrams should include enough modifiers/diacritics so that all symbols in the defined set are separable and correctly readable without ambiguity. Such check can just look at the shortest prefixes and see if the remaining characters are not valid prefix for other symbols in that defined set.
  • If this condition is not satisfied, then the style definition should not be "alphabetic" but should be "fixed" and should include a range of validity (i.e. 1-37 for Oromo-Qubee styles) that will avoid ambiguities that can occur only for values outside that range: in that case the style will have to fallback to "decimal" or some other specified style, possibly alphabetic as well and similar, but using additional diacritics/modifier in composite clusters).
  • The same validity check may also occur with the "cyclic" system, so that no symbol in the defined set can be the concatenation of several symbols in the defined set (beause it would give non-distinctive values in just a single cycle of the sequence of counters), but that is less critical: you may want to use repeated symbols like {"*", "-", "*", "+"} and even sequences where two successive symbols in the defined set are identical like {"*", "*", "-", "-"}.

Note that other digrams used in Oromo-Qubee like "CH" "DH" "KH" "PH" "SH" do not cause problem, because even if "C", "D", "K", "P" and "S" are valid numeric symbol, "H" alone is not valid in the defined set, but used only as an appended modifier.


More generally, the "alphabetic" algorithm is in fact a misnomer, and "symbols" as well. We should still speak about "digits" (of the numeral system), and the "alphabetic" algorithm is just a generalization of the "decimal" system, except that it can use other numeral bases than 10, itself generalized from the "positional" system (but where alternate digits/symbols may be used depending on the position, to abbreviate unwritten zeroes: that is the case of the Roman system) but also allowing "digits" in that system to be represented by arbitrary sequences of multiple characters (provided that they are not creating ambiguities).

So to be complete, some "digits" in those systems may need to provide alternate sequences (e.g. "i" or "I" in the Roman system have the same value when reading/decoding their numeral value, they can be selected distinctly when generating symbols from the value depending on a "style" (e.g. lowercase, uppercase). As well there should be allowance for common separators (which may be distinctful like on examples given above for Oroma, or not distinctful like commas/apostrophes/non-breaking spaces used as grouping separators in decimal systems). When there are alternate symbols, they should be listed in the order of priority: the algorithm should use the 1st one that validate without generating a string that is ambiguous to decode back to their numerical positional value).

@verdy-p verdy-p changed the title ambiguity or insufficient specification for Latin-based Oromo-Qubee counters ambiguity or insufficient specification for Latin-based Oromo-Qubee counter styles Jul 5, 2022
@r12a
Copy link
Contributor

r12a commented Jul 5, 2022

Initially i was also surprised to see that AA could represent either 2 or 38, but then i thought that not many lists are long enough (>37 items) for this ambiguity to occur, and if it does occur it is probably strongly mitigated by the context, given that these counters are usually used in sequences where the adjacent items will inform the user what number is intended.

It's true that if i happened to be dealing with a long list and wanted to point to a specific item in that list, it may create some ambiguity without qualification in written form (if i were to speak the counter there would be a difference between A=ɐ and AA=ɑː). But then the same ambiguity appears with cyclical lists.

As for the suggestion to distinguish the diphthongs using marks, that is not afaik how the alphabet is used for Oromo. On the other hand, Oromo does use an apostrophe to clarify sequences of vowels and diphthongs using the same letter (eg. boba'aa means “fuel”). I'm not sure how to make that appear in listings using the alphabetic rules, though, since it would probably look odd to include an apostrophe unless you had an ambiguous sequence.

So i'm not sure whether your initial proposition is necessarily true:

The "alphabetic" algorithm is not supposed to generate the same sequence for different integer values (such repetition is just allowed and properly described only for "cyclic" systems).

That said, @dyacob do you have any thoughts on this?

Btw, also note that the patterns described in this document are only suggestions. A content author is perfectly at liberty to modify the code so that it eliminates troublesome diphthongs, rather than just adopting it without change.

@dyacob
Copy link
Member

dyacob commented Jul 5, 2022

This is a really interesting question! I must admit that I haven't come into a list quite this long yet, but also haven't gone looking for one. I can see also that in a ridiculously long list "AAA" becomes ambiguous as it may be either "(A)(AA)" or "(AA)(A)".

I can look for an Afaan Oromo literary group, or publishing house, and put the question to them. I suspect that there is no set convention. An apostrophe is not a bad choice, an alternative might be to apply the defined list item marker as a cycle separator. I'm not sure which would be less confusing for a mother tongue reader.

I will seek out a representative party to provide input, it may take a few weeks.

@verdy-p
Copy link
Author

verdy-p commented Jul 5, 2022

"AAA" can also be (A)(A)(A)=37 ^^ 2 + 37 + 1, where (A)(AA) = 1 * 37 + 2 = 39 and (AA)A = 2 * 37 + 1 = 75. We are not talking of very large numbers !

As well "AA" could be 2=(AA) or 38=(A)(A) when using the current rules for the "alphabetic" style as it is defined today.

Use apostrophes as separators if you want (I already spoke about other possible separators, or joiners).

And it is still possible to redefine Orom-Qubee as a "fixed" style (only for numbers 1-37) and define a fallback using "alphabetic" styles where most significant leading symbols would include a trailing apostrophe. However the last symbol (least significant) should not have it and should match the "fixed" style. I've not seen anyway to insert separators conditionally between symbols.

But for now the definition of Oromo-Qubee cannot be correct as an "alphabetic" style, so it has to become "fixed" with existing rules, and then delimited to the range 1-37. Meaning that it will have to fallback then to "decimal" style four out-of-range values.

If you use an apostrophe separator (required to disambiguate some sequences, but that may be allowed between all symbols), thne you need an additional "separator" property in the style definition. And this would then really make Oromo-Qubee the only existing base-37 numeric system used in natural languages (Base-36 exists also used on computers for some protocols, with common symbols using one character in [0-9A-Z], and we all know the existence oif several "base-64" encodings using an alphabet of 64 symbols using in [0-9A-Za-z] completed with 2 other non-alphanumeric characters; PostScript uses also a base-85 encoding with more non-alphanumeric symbols.)

@verdy-p
Copy link
Author

verdy-p commented Jul 5, 2022

Note that lists containing a few hundreds items are absolutely not exceptional (just look at list of countries in ISO 3166, or olympic members, ranked lists of students in schools, results of sport competitions, list of candidates in elections, day rank numbers in the year on calendars, many business reports, list of articles in laws....

And for Oromo, it is likely that counters would be used for numbering verses in Quranic or Biblic texts (though old Oromo probably did not use the modern simplified Latin script, but more likely Semitic scripts like Arabic, South-Arabic, Hebrew, or Ethiopic; and there may already have existed Latin-based orthographies using diacritics as well, like the macron over long vowels instead of doubling them, or a variant of the letter N instead of NY).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants