Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make curved labels in more languages #555

Merged
merged 7 commits into from
May 23, 2017
Merged

Make curved labels in more languages #555

merged 7 commits into from
May 23, 2017

Conversation

mapmeld
Copy link
Contributor

@mapmeld mapmeld commented May 4, 2017

I saw Mapzen's blog post about curved labels awhile ago and was interested in how I could expand it to more languages

screen shot 2017-05-04 at 11 17 12 am

Burmese (pictured above) has some text-shaping - for example the word "ဟို" is three chars long. But it's similar to how "ý" can be two chars. Burmese doesn't connect all letters like Arabic or Devanagari. By counting grapheme clusters instead of chars, we can prevent both writing systems from being split up irregularly. That's what this pull request fixes.

The RegEx is long because it includes other writing systems (at least Latin alphabet, Devanagari and Tamil), but I have not enabled or tested them in curved labels yet.

In the future it would be possible to use Intl.Segmenter or a polyfill.

Arabic is another level of complexity (choosing the right initial/medial/final/isolated form) but it might be possible based on previous work in OSM iD and circular-arabic.

Edit: fixed screenshot, always preview

@mapmeld
Copy link
Contributor Author

mapmeld commented May 4, 2017

New commit to add more languages and better explain where the language RegEx comes from.

Test coordinates:

  • Burmese: 17.3/16.793161/96.169961
  • Khmer: 16.3/11.544808/104.937004
  • Lao: 19/19.88344/102.12752

India and Sri Lanka road data is overwhelmingly English in the name tag. By adding name:ta and other language options, I found some of the others:

  • Tamil: 17/9.0130559/80.4731369
  • Kannada: 16.96/12.2836016/76.652995

These have a few roads, but are currently sticking to regular, not-curved labels [Update: I hard-coded check.type = 'curved'; in label_line.js and feel good about the ability to write curved labels after this test]

  • Oriya: 15.56/21.446689/86.8124176
  • Telugu: 19/17.242273/80.143238
  • Sinhala: 17/7.3173068/80.3408611

I am having trouble finding roads on OSM named in the other languages, even when using Overpass: Gujarati, Gurmukhi/Punjabi. There are some rivers or footpaths, so if these were labeled we could probably test.

As mentioned earlier, Devanagari and Arabic scripts are connected so they're beyond the scope of this fix

@bcamper
Copy link
Member

bcamper commented May 5, 2017

This is an awesome contribution!! I've been playing around with it and noticed a couple things:

  1. Creating a new RegExp object for every character iteration is kind of intense and caused a noticeable slowdown (all this Canvas text measuring and rendering is already pushing it... :)

    I tried moving the grapheme characters check to a single compiled RegExp, with the first text character explicitly appended to the current segment text before the results of the regex match:

    // compiled regex to find grapheme cluster chars
    const grapheme_regex = new RegExp("(?:" + accents_and_vowels + "+)?" + "(" + combo_characters + "\\w(" + accents_and_vowels + ")?)?");
    ...
    
    // later, in function to find next grapheme cluster
    segment += text[0];
    let grapheme_cluster = text.substring(1).match(grapheme_regex);
    if (grapheme_cluster && grapheme_cluster[0] != null) {
       segment += grapheme_cluster[0];
    }
    

    I'm not yet familiar enough with these scripts and combining characters, but it seems like this should be OK since the original regex was just doing an explicit check for the first character, rather than any dependent match or lookahead/behind logic (new RegExp(testText[0] + "(?:" + ...)? After some fiddling and spot-checking, it seems to produce the same results as your original. Anything I'm missing here?

  2. Had problems with labels that had special regex characters (like parentheses) inside them breaking the regex. This can be fixed by escaping them, but is eliminated by the changes above in any case.

@mapmeld
Copy link
Contributor Author

mapmeld commented May 6, 2017

@bcamper - agreed, this looks better and doesn't cause any language issues. The pattern of a grapheme cluster can be: [character], [character][additional accents/signs], or the more complex one being [character][additional accents/signs][character-combining mark / virama][2nd character][2nd character's accents/signs]. If there were complex Emoji on OSM then we'd need more.

I was thinking more on this and would ask your input on two additional changes:

  • capturing the first character of the segment by starting the RegEx with . instead of adding it on a separate line. In this case we can always expect to add at least one character to the segment string.
  • changing from str.match(grapheme_regex) to grapheme_regex.exec(str) to only returns the first result, which is the only one we need

@bcamper
Copy link
Member

bcamper commented May 8, 2017

Thanks - the single regex looks good. I'm still debugging a case where it goes into an infinite loop, even though I'd expect it to always take at least one character from the . - fun :)

By setting scene.setIntrospection(true), you can get HTML tooltips (and thus theoretically corrected rendered by the browser) for everything with a name. This is handy for comparing to the Tangram rendering. This at least allows a layperson who doesn't read the script to do a visual confirmation.

I was spot-checking Khmer and came across a case that looked wrong (see last two characters):

Bad (curved label):
khmer bad

Good (straight label):
khmer good

I know nothing about Khmer but I'm definitely interested to understand what's going on here. I see that you had it defined as:
"\u17B4-\u17D1\u17D3" + // Khmer

There's a conspicuous single missing character in the sequence there...
screen shot 2017-05-08 at 5 40 29 pm
via http://www.unicode.org/charts/PDF/U1780.pdf

I tried adding that character to the range ("\u17B4-\u17D3" + // Khmer), and voila, it seems to work!

screen shot 2017-05-08 at 5 49 58 pm

Do you know why it wasn't present? Intentional/mistake?

I'm going to keep reviewing these but wanted to run that one by you in the meantime.

@mapmeld
Copy link
Contributor Author

mapmeld commented May 8, 2017

\u17D2 is not a visible accent by itself but a "combo character" that I put in that other regex. The regex is expecting a combo character and then another letter which goes subscript, so it shouldn't have split on it unless something weird is going on (like combo character + more vowel signs)

Do you have coordinates where this label is? I would like to look at the text char by char.

@bcamper
Copy link
Member

bcamper commented May 8, 2017

Aha I should have checked that other regex. It should be visible right in the middle of your test coords: 16.30000/11.54480/104.93701

The string is "វិថី កោះពេជ្រ". The OSM feature is http://www.openstreetmap.org/way/375404697

@bcamper
Copy link
Member

bcamper commented May 8, 2017

Similar deal with these two, near middle of 17.63334/11.54948/104.92715:

screen shot 2017-05-08 at 6 17 52 pm

screen shot 2017-05-08 at 6 17 32 pm

@mapmeld
Copy link
Contributor Author

mapmeld commented May 8, 2017

I think I see the problem and (fortunately? I think?) a regex issue

This sequence [letter][combo character][letter] is somehow failing because the regex includes an optional accents/vowel signs after each letter, which for some reason is not being treated as optional.

@mapmeld
Copy link
Contributor Author

mapmeld commented May 8, 2017

I'm thinking this is the right regex (using a . for both 1st letter and optional 2nd combo letter, also allowing multiple accents on the 2nd letter)

new RegExp(".(?:" + accents_and_vowels + "+)?" + "(" + combo_characters + ".(" + accents_and_vowels + "+)?)?")

@mapmeld
Copy link
Contributor Author

mapmeld commented May 9, 2017

ok here's my latest idea

const graphemeRegex = new RegExp("^.(?:" + accents_and_vowels + "+)?" + "(" + combo_characters + "\\W(?:" + accents_and_vowels + "+)?)?");
...
for (graphemeCount; graphemeCount < codon_length && testText.length > 0; graphemeCount++) {
    let graphemeCluster = (graphemeRegex.exec(testText) || testText)[0];
    segment += graphemeCluster;
    testText = testText.substring(graphemeCluster.length);
}

@bcamper
Copy link
Member

bcamper commented May 10, 2017

Thanks @mapmeld - this is looking good, plus I've found a few unrelated issues in the process of testing it :) Is this ready to go as far as you're concerned?

Question as I test and learn more about this. Here's an example Sinhala feature:
http://www.openstreetmap.org/way/368378417

Full text (name:si) is:
"ඉඹුල්ගස්දෙනිය - රඹුක්කන පාර"

The current grapheme cluster regex will split this up into these segments (I changed the codon_length to 1 for this, which will give us only one grapheme cluster per segment, instead of up to 2 as is the usual Tangram default):
["ඉ", "ඹු", "ල්ග", "ස්දෙ", "නි", "ය", " ", "-", " ", "ර", "ඹු", "ක්ක", "න", " ", "පා", "ර"]

@meetar and I were looking at some examples of groups like "ල්ග" and "ස්දෙ", where it seems visually like the characters could theoretically be broken down further. Is this just an effect of the current regex being conservative enough to catch cases like the Khmer issue we saw earlier? e.g. maybe it would be possible to create even more detailed / script-specific regexes to get more granularity, but this is a good middle-ground? Anyway I'm happy to stop where we are for now since this is a great improvement, just curious.

It's looking nice here:
17.14000/7.31549/80.32824
screen shot 2017-05-10 at 3 54 43 pm

I believe I've spot-checked all of the scripts you included except for Gujarati and Gurmukhi (will try those too if I can). BTW, if you want to just hard-code some label text for Tangram to render, to verify the output (if we can't find suitable OSM features), you can use a JS function in the text_source parameter to return an explicit string to render, e.g. text_source: function() { return 'ඉඹුල්ගස්දෙනිය' }

@bcamper
Copy link
Member

bcamper commented May 10, 2017

I was testing some Gurmukhi/Punjabi (http://www.openstreetmap.org/way/248011769, at 12.41000/31.13213/75.06559), and I noticed this script does have the bar running across the top -- do you know if it's OK for the continuity of that to be broken across segments like this? In any case, Google also appears to do that with Punjabi labels...

screen shot 2017-05-10 at 5 17 16 pm

@mapmeld
Copy link
Contributor Author

mapmeld commented May 10, 2017

You're right - that's wrong how the Sinhala text was blocking there. I looked at the Unicode chart and fixed two parts of my Sinhala text range, which appears to fix the problem. I've created a set of tests on my side to make sure that a change to one piece doesn't undo our progress somewhere else https://github.com/mapmeld/grapheme-regex/blob/master/test/index.js

I don't know enough about Gurmukhi/Punjabi to say whether that example is good or bad. I think I might remove it out of abundance of caution

(did a merge to update with master branch)

@@ -580,7 +580,7 @@ const accents_and_vowels = "[:\u0300-\u036F" + // Combining Diacritical Marks
"\u0B01-\u0B03\u0B3C\u0B3E-\u0B4C\u0B56\u0B57\u0B62\u0B63" + // Oriya
"\u0B82\u0BBE-\u0BCD\u0BD7" + // Tamil
"\u0C00-\u0C03\u0C3E-\u0C4C\u0C55\u0C56\u0C62\u0C63" + // Telugu
"\u0D82-\u0D83\u0DCF-\u0DDF\u0DF2\u0DF3" + // Sinhala
"\u0D82\u0D83\u0DCA-\u0DDF\u0DF2\u0DF3" + // Sinhala
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should \u0DCA be removed from combo_characters?

@bcamper
Copy link
Member

bcamper commented May 11, 2017

@tallytalwar says that the Gurmukhi/Punjabi example is OK and that it's fine for the "bar" at top to have some breaks when segments bend on curved labels.

I've been doing some reading up on grapheme clusters and Unicode, but I'm definitely still unclear on some things :) (which is OK to merge this, I would just like to have a general idea of what's going on). Is there a simple explanation for the criteria of when a character goes in accents_and_vowels vs. combo_characters?

Also, I understand why Arabic is beyond the scope here, but less so Devanagari (given other similar scripts in your set) -- does it have shaping rules that make it unsuitable for this kind of regex approach? @tallytalwar is a Devanagari reader and wasn't aware of why it wouldn't apply, but also isn't familiar with the encoding/grapheme rules.

Similarly, the only other remaining scripts in our "curved label blacklist" are Bengali, Mongolian, Tibetan. Have you left them out because you're not familiar with them, and/or do you know if this could be extended to cover them?

Otherwise, I think we are ready to merge on our side, if you are? Thanks again!

@mapmeld
Copy link
Contributor Author

mapmeld commented May 11, 2017

ok - thanks @tallytalwar for checking the Gurmukhi/Punjabi example. By my original thought process I could have left it out. If breaking the bar is OK on these labels, yes we can include Devanagari and Bengali.

Accents and vowel signs modify one base character (y -> ý, က -> ကြ). The character does not belong on its own ( ́ or ြ).

In this family of scripts that we're talking about, each base character is a syllable (Devanagari न = "na", नन = "nana"). People still need double-consonant sounds ("Chennai" as example). To make it work you "break the leg" of the 1st one to form न्न. In Unicode it looks like न + ् (virama) + न. The combinations don't need to be the same letter (न + ् + त = न्त). In Burmese it works a little differently; letters are stacked on top of each other (က + ္ + က = က္က). In Tamil there is just a dot over the first letter, and no change to the second letter. In any case where we combine letters, it's important not to break up letter1, letter2, and their accent marks.

This is my first time looking closely at Tibetan... it looks like we could apply the same rule similar to Burmese.

Mongolian is typically written vertically and its Wikipedia article contains no examples of it being written horizontally.

Also, Sinhala point is right. I'll remove that from combo_characters and add the other scripts later today, I think.

@nvkelso
Copy link
Member

nvkelso commented May 11, 2017 via email

@pnorman
Copy link
Contributor

pnorman commented May 12, 2017

Mongolian is typically written vertically and its Wikipedia article contains no examples of it being written horizontally.

We looked at Mongolian in OpenStreetMap Carto, and although the issue doesn't contain much useful my recollection is that @russdeffner, who has spent time there, said horizontal was okay. I would say that if you can correctly render horizontal on straight lines you're ahead of many map renderers, and along the line for a curved line is fine.

@mapmeld
Copy link
Contributor Author

mapmeld commented May 12, 2017

OK, latest commit covers everything mentioned here, except Mongolian, which I would like to leave for now. In addition to the vertical/horizontal issue, Mongolian letters have text-shaping more similar to Arabic, actually, so the word-chunking process would have similar issues.

For future reference on Tibetan: http://rishida.net/scripts/tibetan/

@bcamper
Copy link
Member

bcamper commented May 17, 2017

@mapmeld just wanted to mention that @tallytalwar and I saw what we think is a Devanagari issue when testing - will get you some more details tomorrow.

@mapmeld
Copy link
Contributor Author

mapmeld commented May 21, 2017

Might be good to add Malayalam for SW India: http://thottingal.in/blog/2017/05/21/a-formal-grammar-for-malayalam-conjunct/

@mapmeld
Copy link
Contributor Author

mapmeld commented May 21, 2017

Added Malayalam and included a fix which may resolve your Devanagari issue (if it is related to the concept in the Malayalam link?)

@bcamper
Copy link
Member

bcamper commented May 21, 2017

I believe that does fix the Devanagari issue! Hadn't gotten around to writing it up yet, and it looks like you probably already figured this out, but here is what I was seeing.

First of all there was a problem with the regex. For example, for the word "स्ट्रिंग" ("string"), our previous regex was returning these 3 characters as the first grapheme cluster:
["स", "्", "ट"].

I tried checking this against the results for Chrome's Intl.v8BreakIterator (as I understand it, close to a passthrough API to ICU), and it returns just these 2 characters for the first cluster:
["स", "्"]

I was wondering if our regex was too greedy (to account for some of the other script features you adjusted it to support?), and it looked like we were actually taking a character from what V8 considered the next cluster:

Tangram: [["स", "्", "ट"], ["्"]]
V8: [["स", "्"], ["ट", "्"]

But even using Intl.v8BreakIterator for grapheme cluster detection did not render properly. Here, all roads have the same text ("string string string"), but note the curved label character order is different from the (correct) straight order:

screen shot 2017-05-16 at 10 43 56 pm

It looks like Is what we are seeing here an issue with what the link you provided calls conjuncts, and what this link calls consonant clusters? Does this mean that in cases like this, the grapheme cluster itself is not sufficient as an independent visual unit, but requires additional script knowledge?

In any case, your changes do fix the example above, which now renders correctly:

screen shot 2017-05-21 at 4 32 29 pm

Your updated regex takes these sets of characters as the first two clusters:

[["स", "्", "ट", "्", "र", "ि", "ं"], ["ग"]]

@tallytalwar doe this make sense to you?

@mapmeld do you know if there are any downsides to the more expansive regex for other scripts?

@mapmeld
Copy link
Contributor Author

mapmeld commented May 21, 2017

In the past I had been led to believe that joining more than two consonants was extremely rare, and bringing it up I had been asked to provide examples. This article got me thinking that maybe you could be seeing the same thing on the map. In any case the RegEx now supports unlimited consonants joined there.

@tallytalwar
Copy link
Member

@bcamper yes that does make sense and indic language overview here does help explain the scenario.

The question we raised the other day was what happens for a character set with 2 or more consecutive placed (half sounding) consonants, as this introduces the virama/halant characters. And the logic mentioned in @mapmeld's link above does take care of these situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants