consider Mongolian text (NNBSP) in wordcount #3

Open
srl295 opened this Issue Jan 26, 2017 · 15 comments

Comments

Projects
None yet
5 participants
@srl295
Member

srl295 commented Jan 26, 2017

See: http://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf

ᠡᠷᠲᠡ ᠲᠣᠭ ᠠ ᠲᠣᠮᠰᠢ ᠦ ᠢ ᠨᠥᠭᠴᠢᠭᠰᠡᠨ ᠠᠯᠠᠪ ᠤᠨ ᠤᠷᠢᠳᠠ ᠠᠨᠤ
The word count is correct at 8 words

@lianghai

This comment has been minimized.

Show comment
Hide comment
@lianghai

lianghai Jan 26, 2017

The string extracted from PDF is not correct. We need help from L2/17-036's authors to get the original string. Or I can construct a similar test case.

The other aspect of the problem is, as I mentioned in L2/17-052, section 1.3, it's not preferred by the majority of Mongolian users that suffixes are not considered separate words.

The string extracted from PDF is not correct. We need help from L2/17-036's authors to get the original string. Or I can construct a similar test case.

The other aspect of the problem is, as I mentioned in L2/17-052, section 1.3, it's not preferred by the majority of Mongolian users that suffixes are not considered separate words.

@srl295

This comment has been minimized.

Show comment
Hide comment
@srl295

srl295 Jan 26, 2017

Member
Member

srl295 commented Jan 26, 2017

@lianghai

This comment has been minimized.

Show comment
Hide comment
@lianghai

lianghai Jan 27, 2017

Counted differently by:

  1. Native users' common understanding (tend to consider a suffix simply another word) vs scholars' certain grammatical analysis (consider a suffix is part of the word).

    • It's like, in a parallel universe, the preposition "as" has a special and required way of writing "s" which doesn't occur in normal words — say, it must be written "aſ something" but normal words like "gas" always just have "s", so many English users might still consider "aſ" a separate word, but many scholars think the space in "aſ something" must be a special character (say, NNBSP) which makes this string to be counted as a single word and non-breaking.
  2. Languages might have different preferences too.

    • According to L2/17-008 Proposal to encode one Manchu format character, among the 4 major languages that use the Mongolian script, Mongolian and Todo's mainstream grammar considers a suffix to be part of the word, but Manchu and Sibe's mainstream grammar considers a suffix to be a separate word.

Minimum test cases for word counting (string, escaped, transliteration and reference image, notes):

  • Mongolian language

    • ᠮᠣᠩᠭᠣᠯ ᠤᠨ
    • <182E 1823 1829 182D 1823 182F 202F 1824 1828>
    • monggol-un screen shot 2017-01-27 at 20 02 48
    • TUS 9.0 expects this to be counted as 1 word. Users might prefer 2.
    • The space here is NNBSP (U+202F) as required by TUS 9.0.
    • NNBSP's width (including stretchability), word boundary extending, and non-breaking behaviors are not always preferred by general users.
  • Manchu language

    • ᠮᠠᠨᠵᡠ ᡳ
    • <182E 1820 1828 1835 1860 202F 1873>
    • manju-i screen shot 2017-01-27 at 20 02 55
    • TUS 9.0 expects this to be counted as 1 word. Users prefer 2.
    • The space here is NNBSP (U+202F) as required by TUS 9.0.
    • NNBSP's width (including stretchability) and word boundary extending behaviors are not expected by Manchu and Sibe. See L2/17-008 Proposal to encode one Manchu format character.
    • NNBSP's non-breaking behavior is not always preferred by general users.

lianghai commented Jan 27, 2017

Counted differently by:

  1. Native users' common understanding (tend to consider a suffix simply another word) vs scholars' certain grammatical analysis (consider a suffix is part of the word).

    • It's like, in a parallel universe, the preposition "as" has a special and required way of writing "s" which doesn't occur in normal words — say, it must be written "aſ something" but normal words like "gas" always just have "s", so many English users might still consider "aſ" a separate word, but many scholars think the space in "aſ something" must be a special character (say, NNBSP) which makes this string to be counted as a single word and non-breaking.
  2. Languages might have different preferences too.

    • According to L2/17-008 Proposal to encode one Manchu format character, among the 4 major languages that use the Mongolian script, Mongolian and Todo's mainstream grammar considers a suffix to be part of the word, but Manchu and Sibe's mainstream grammar considers a suffix to be a separate word.

Minimum test cases for word counting (string, escaped, transliteration and reference image, notes):

  • Mongolian language

    • ᠮᠣᠩᠭᠣᠯ ᠤᠨ
    • <182E 1823 1829 182D 1823 182F 202F 1824 1828>
    • monggol-un screen shot 2017-01-27 at 20 02 48
    • TUS 9.0 expects this to be counted as 1 word. Users might prefer 2.
    • The space here is NNBSP (U+202F) as required by TUS 9.0.
    • NNBSP's width (including stretchability), word boundary extending, and non-breaking behaviors are not always preferred by general users.
  • Manchu language

    • ᠮᠠᠨᠵᡠ ᡳ
    • <182E 1820 1828 1835 1860 202F 1873>
    • manju-i screen shot 2017-01-27 at 20 02 55
    • TUS 9.0 expects this to be counted as 1 word. Users prefer 2.
    • The space here is NNBSP (U+202F) as required by TUS 9.0.
    • NNBSP's width (including stretchability) and word boundary extending behaviors are not expected by Manchu and Sibe. See L2/17-008 Proposal to encode one Manchu format character.
    • NNBSP's non-breaking behavior is not always preferred by general users.
@lianghai

This comment has been minimized.

Show comment
Hide comment
@lianghai

lianghai Jan 27, 2017

The reply above was originally with some mistakes and some information was missing. Please see it in web page if GitHub didn't push notifications of later changes to your email inbox.

The reply above was originally with some mistakes and some information was missing. Please see it in web page if GitHub didn't push notifications of later changes to your email inbox.

@srl295

This comment has been minimized.

Show comment
Hide comment
@srl295

srl295 Feb 14, 2017

Member

Thank you. Mongolian will become the best supported language here!

Member

srl295 commented Feb 14, 2017

Thank you. Mongolian will become the best supported language here!

@badaa

This comment has been minimized.

Show comment
Hide comment
@badaa

badaa Feb 12, 2018

Hi lianghai and srl295,
All the suffixes are considered as part of the word, thus they should not be counted separately.
They are counted not differently. If you count mongolian suffixes differently then the count of the words are increased massively because Mongolian is very very very agglutinative language. It's common such a word "ᠠᠵᠢᠯ ᠤᠨ ᠬᠢᠨ ᠲᠠᠢ ᠪᠠᠨ" It is just one word with meaning in english "with colleagues".

badaa commented Feb 12, 2018

Hi lianghai and srl295,
All the suffixes are considered as part of the word, thus they should not be counted separately.
They are counted not differently. If you count mongolian suffixes differently then the count of the words are increased massively because Mongolian is very very very agglutinative language. It's common such a word "ᠠᠵᠢᠯ ᠤᠨ ᠬᠢᠨ ᠲᠠᠢ ᠪᠠᠨ" It is just one word with meaning in english "with colleagues".

@lianghai

This comment has been minimized.

Show comment
Hide comment
@lianghai

lianghai Feb 13, 2018

@badaa

All the suffixes are considered as part of the word, thus they should not be counted separately.

It's not a consensus among average users. There's a messy dislocation between linguistic scholars' definition of "word", encoding experts' requirement of NNBSP for correct shaping, common users' understanding of "word", and the intention of word counting (are we counting words for linguistic scholars?) for various user groups.

@badaa

All the suffixes are considered as part of the word, thus they should not be counted separately.

It's not a consensus among average users. There's a messy dislocation between linguistic scholars' definition of "word", encoding experts' requirement of NNBSP for correct shaping, common users' understanding of "word", and the intention of word counting (are we counting words for linguistic scholars?) for various user groups.

@badaa

This comment has been minimized.

Show comment
Hide comment
@badaa

badaa Feb 13, 2018

@lianghai
It's distinct term. Nobody could imagine in Mongolia that the suffixes are counted as words. What do you mean "average users"?
Every scholar teaches this term in school. Thus, I tend to think that you mean as "average users" anyone who either "unqualified" or who didn't attend Mongolian school.

badaa commented Feb 13, 2018

@lianghai
It's distinct term. Nobody could imagine in Mongolia that the suffixes are counted as words. What do you mean "average users"?
Every scholar teaches this term in school. Thus, I tend to think that you mean as "average users" anyone who either "unqualified" or who didn't attend Mongolian school.

@badaa

This comment has been minimized.

Show comment
Hide comment
@badaa

badaa Feb 13, 2018

@lianghai

NNBSP's width (including stretchability), word boundary extending, and non-breaking behaviors are not always preferred by general users.

You have misunderstood probably here. Because there is one situation the non-breakable behaviour is not preferred to write inflected words like "ᠠᠵᠢᠯ ᠤᠨ ᠬᠢᠨ ᠲᠠᠢ ᠪᠠᠨ" vertically in horizontal narrow space. It is actually design or space issue. Even the suffixes are started from new column they are part of the word thus they start with medial form. I would say the suffixes can be breakable until they start as medial form, but not separable from the word even they start from new line.

badaa commented Feb 13, 2018

@lianghai

NNBSP's width (including stretchability), word boundary extending, and non-breaking behaviors are not always preferred by general users.

You have misunderstood probably here. Because there is one situation the non-breakable behaviour is not preferred to write inflected words like "ᠠᠵᠢᠯ ᠤᠨ ᠬᠢᠨ ᠲᠠᠢ ᠪᠠᠨ" vertically in horizontal narrow space. It is actually design or space issue. Even the suffixes are started from new column they are part of the word thus they start with medial form. I would say the suffixes can be breakable until they start as medial form, but not separable from the word even they start from new line.

@lianghai

This comment has been minimized.

Show comment
Hide comment
@lianghai

lianghai Feb 25, 2018

@badaa

Nobody could imagine in Mongolia that the suffixes are counted as words.

Then I need to consider the possibility that users in Mongolia tend to consider enclitics as parts of the preceding word because the Cyrillic writing system tends to write enclitics actually as suffixes. My impression from Inner Mongolian users doesn't suggest so.

Note "suffix" refers to the structures that are actually attached, thus I try to call the detached structures "enclitics" to distinguish them.

What do you mean "average users"?

I ask non-scholar users of the Mongolian script every now and then about how do they count words and why. My impression is they don't have a strong tendency to omit enclitics. I don't know your qualifications but most of them did attend Mongolian schools.

Also, I might need to clarify that I consider scholars' opinion to be especially irrelevant to this topic, because I don't follow prescriptism. No matter what scholars/teachers teach, what average users have actually learned is what actually matters — note there's a difference.

Because there is one situation the non-breakable behaviour is not preferred to write inflected words like "ᠠᠵᠢᠯ ᠤᠨ ᠬᠢᠨ ᠲᠠᠢ ᠪᠠᠨ" vertically in horizontal narrow space. It is actually design or space issue.

I agree such cases involve more typographical considerations. However I don't observe a strong preference of non-breaking the space preceding enclitics even in running text of newspaper and books.

Even the suffixes are started from new column they are part of the word thus they start with medial form.

I understand break-ability and word-counting are not necessarily relevant, so you can be assured that I won't analyze word-counting according to break-ability.

I would say the suffixes can be breakable until they start as medial form, but not separable from the word even they start from new line.

Noted. Do you have a list of detached structures that you consider as "suffixes" (which I call "enclitics" and I assume you consider they should all be counted as part of preceding word), and marked with their break-ablilty?

@badaa

Nobody could imagine in Mongolia that the suffixes are counted as words.

Then I need to consider the possibility that users in Mongolia tend to consider enclitics as parts of the preceding word because the Cyrillic writing system tends to write enclitics actually as suffixes. My impression from Inner Mongolian users doesn't suggest so.

Note "suffix" refers to the structures that are actually attached, thus I try to call the detached structures "enclitics" to distinguish them.

What do you mean "average users"?

I ask non-scholar users of the Mongolian script every now and then about how do they count words and why. My impression is they don't have a strong tendency to omit enclitics. I don't know your qualifications but most of them did attend Mongolian schools.

Also, I might need to clarify that I consider scholars' opinion to be especially irrelevant to this topic, because I don't follow prescriptism. No matter what scholars/teachers teach, what average users have actually learned is what actually matters — note there's a difference.

Because there is one situation the non-breakable behaviour is not preferred to write inflected words like "ᠠᠵᠢᠯ ᠤᠨ ᠬᠢᠨ ᠲᠠᠢ ᠪᠠᠨ" vertically in horizontal narrow space. It is actually design or space issue.

I agree such cases involve more typographical considerations. However I don't observe a strong preference of non-breaking the space preceding enclitics even in running text of newspaper and books.

Even the suffixes are started from new column they are part of the word thus they start with medial form.

I understand break-ability and word-counting are not necessarily relevant, so you can be assured that I won't analyze word-counting according to break-ability.

I would say the suffixes can be breakable until they start as medial form, but not separable from the word even they start from new line.

Noted. Do you have a list of detached structures that you consider as "suffixes" (which I call "enclitics" and I assume you consider they should all be counted as part of preceding word), and marked with their break-ablilty?

@badaa

This comment has been minimized.

Show comment
Hide comment
@badaa

badaa Feb 25, 2018

@lianghai

No matter what scholars/teachers teach, what average users have actually learned is what actually matters — note there's a difference.

Actually, that is the direct opposite. Teachers teach how words should be written in this script and what is correct or incorrect, no matter what users have learned. We could not order that all they should learn correct but we wish always.

Noted. Do you have a list of detached structures that you consider as "suffixes" (which I call "enclitics" and I assume you consider they should all be counted as part of preceding word), and marked with their break-ablilty?

I will bring the list of all detached structures at Mongolian Meeting in April.

badaa commented Feb 25, 2018

@lianghai

No matter what scholars/teachers teach, what average users have actually learned is what actually matters — note there's a difference.

Actually, that is the direct opposite. Teachers teach how words should be written in this script and what is correct or incorrect, no matter what users have learned. We could not order that all they should learn correct but we wish always.

Noted. Do you have a list of detached structures that you consider as "suffixes" (which I call "enclitics" and I assume you consider they should all be counted as part of preceding word), and marked with their break-ablilty?

I will bring the list of all detached structures at Mongolian Meeting in April.

@lianghai

This comment has been minimized.

Show comment
Hide comment
@lianghai

lianghai Feb 26, 2018

@badaa

Actually, that is the direct opposite. Teachers teach how words should be written in this script and what is correct or incorrect, no matter what users have learned. We could not order that all they should learn correct but we wish always.

Then it's clear you tend to be more prescriptive, while I work in a more descriptive way. I hope this difference between our methodologies is helpful for you understand our disagreement on certain issues.

I will bring the list of all detached structures at Mongolian Meeting in April.

Thank you.

@badaa

Actually, that is the direct opposite. Teachers teach how words should be written in this script and what is correct or incorrect, no matter what users have learned. We could not order that all they should learn correct but we wish always.

Then it's clear you tend to be more prescriptive, while I work in a more descriptive way. I hope this difference between our methodologies is helpful for you understand our disagreement on certain issues.

I will bring the list of all detached structures at Mongolian Meeting in April.

Thank you.

@mongoltolbo

This comment has been minimized.

Show comment
Hide comment
@mongoltolbo

mongoltolbo Mar 2, 2018

Of course, suffixes are part of its stem word, and its concept same as Cyrillic Mongolian script. In academia, in the essay writing class, students count words exactly with its suffix as the word processing application works. It is obvious. That is why we need this http://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf proposal into real life as soon as possible. 180F's behavior should be included breakable attribute (like a non-breaking hyphen) but counted within its stem, I hope. When is this proposal going to be implemented in the Unicode standard? I have many font projects that on setback and waiting for this unsolved problem.
http://mongoltolbo.com

mongoltolbo commented Mar 2, 2018

Of course, suffixes are part of its stem word, and its concept same as Cyrillic Mongolian script. In academia, in the essay writing class, students count words exactly with its suffix as the word processing application works. It is obvious. That is why we need this http://www.unicode.org/L2/L2017/17036-mongolian-suffix.pdf proposal into real life as soon as possible. 180F's behavior should be included breakable attribute (like a non-breaking hyphen) but counted within its stem, I hope. When is this proposal going to be implemented in the Unicode standard? I have many font projects that on setback and waiting for this unsolved problem.
http://mongoltolbo.com

@lianghai

This comment has been minimized.

Show comment
Hide comment
@lianghai

lianghai Mar 2, 2018

@mongoltolbo Can you elaborate why your font projects are waiting for the proposed MSC model? And may I ask, do you consider it obvious whether the structures uu/üü and ügei should be counted or ignored (counted as part of the stem word)? (cc @badaa for the second question.)

lianghai commented Mar 2, 2018

@mongoltolbo Can you elaborate why your font projects are waiting for the proposed MSC model? And may I ask, do you consider it obvious whether the structures uu/üü and ügei should be counted or ignored (counted as part of the stem word)? (cc @badaa for the second question.)

@fromUB

This comment has been minimized.

Show comment
Hide comment
@fromUB

fromUB Mar 14, 2018

@lianghai
Inflections for grammatical cases are parts of the nouns which they are modifying. If they were separate words, they would be written with initial letters, but they start with mid characters indicating that they are part of the noun.

fromUB commented Mar 14, 2018

@lianghai
Inflections for grammatical cases are parts of the nouns which they are modifying. If they were separate words, they would be written with initial letters, but they start with mid characters indicating that they are part of the noun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment