consider Mongolian text (NNBSP) in wordcount #3

Open
srl295 opened this Issue Jan 26, 2017 · 5 comments

Projects

None yet

2 participants

@srl295
Member
srl295 commented Jan 26, 2017 edited

See: http://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

ᠡᠷᠲᠡ ᠲᠣᠭ ᠠ ᠲᠣᠮᠰᠢ ᠦ ᠢ ᠨᠥᠭᠴᠢᠭᠰᠡᠨ ᠠᠯᠠᠪ ᠤᠨ ᠤᠷᠢᠳᠠ ᠠᠨᠤ
The word count is correct at 8 words

@lianghai

The string extracted from PDF is not correct. We need help from L2/17-036's authors to get the original string. Or I can construct a similar test case.

The other aspect of the problem is, as I mentioned in L2/17-052, section 1.3, it's not preferred by the majority of Mongolian users that suffixes are not considered separate words.

@srl295
Member
srl295 commented Jan 26, 2017
@lianghai
lianghai commented Jan 27, 2017 edited

Counted differently by:

  1. Native users' common understanding (tend to consider a suffix simply another word) vs scholars' certain grammatical analysis (consider a suffix is part of the word).

    • It's like, in a parallel universe, the preposition "as" has a special and required way of writing "s" which doesn't occur in normal words — say, it must be written "aſ something" but normal words like "gas" always just have "s", so many English users might still consider "aſ" a separate word, but many scholars think the space in "aſ something" must be a special character (say, NNBSP) which makes this string to be counted as a single word and non-breaking.
  2. Languages might have different preferences too.

    • According to L2/17-008 Proposal to encode one Manchu format character, among the 4 major languages that use the Mongolian script, Mongolian and Todo's mainstream grammar considers a suffix to be part of the word, but Manchu and Sibe's mainstream grammar considers a suffix to be a separate word.

Minimum test cases for word counting (string, escaped, transliteration and reference image, notes):

  • Mongolian language

    • ᠮᠣᠩᠭᠣᠯ ᠤᠨ
    • <182E 1823 1829 182D 1823 182F 202F 1824 1828>
    • monggol-un screen shot 2017-01-27 at 20 02 48
    • TUS 9.0 expects this to be counted as 1 word. Users might prefer 2.
    • The space here is NNBSP (U+202F) as required by TUS 9.0.
    • NNBSP's width (including stretchability), word boundary extending, and non-breaking behaviors are not always preferred by general users.
  • Manchu language

    • ᠮᠠᠨᠵᡠ ᡳ
    • <182E 1820 1828 1835 1860 202F 1873>
    • manju-i screen shot 2017-01-27 at 20 02 55
    • TUS 9.0 expects this to be counted as 1 word. Users prefer 2.
    • The space here is NNBSP (U+202F) as required by TUS 9.0.
    • NNBSP's width (including stretchability) and word boundary extending behaviors are not expected by Manchu and Sibe. See L2/17-008 Proposal to encode one Manchu format character.
    • NNBSP's non-breaking behavior is not always preferred by general users.
@lianghai

The reply above was originally with some mistakes and some information was missing. Please see it in web page if GitHub didn't push notifications of later changes to your email inbox.

@srl295
Member
srl295 commented Feb 14, 2017

Thank you. Mongolian will become the best supported language here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment