Make curved labels in more languages #555

mapmeld · 2017-05-04T18:40:41Z

I saw Mapzen's blog post about curved labels awhile ago and was interested in how I could expand it to more languages

Burmese (pictured above) has some text-shaping - for example the word "ဟို" is three chars long. But it's similar to how "ý" can be two chars. Burmese doesn't connect all letters like Arabic or Devanagari. By counting grapheme clusters instead of chars, we can prevent both writing systems from being split up irregularly. That's what this pull request fixes.

The RegEx is long because it includes other writing systems (at least Latin alphabet, Devanagari and Tamil), but I have not enabled or tested them in curved labels yet.

In the future it would be possible to use Intl.Segmenter or a polyfill.

Arabic is another level of complexity (choosing the right initial/medial/final/isolated form) but it might be possible based on previous work in OSM iD and circular-arabic.

Edit: fixed screenshot, always preview

mapmeld · 2017-05-04T22:46:40Z

New commit to add more languages and better explain where the language RegEx comes from.

Test coordinates:

Burmese: 17.3/16.793161/96.169961
Khmer: 16.3/11.544808/104.937004
Lao: 19/19.88344/102.12752

India and Sri Lanka road data is overwhelmingly English in the name tag. By adding name:ta and other language options, I found some of the others:

Tamil: 17/9.0130559/80.4731369
Kannada: 16.96/12.2836016/76.652995

These have a few roads, but are currently sticking to regular, not-curved labels [Update: I hard-coded check.type = 'curved'; in label_line.js and feel good about the ability to write curved labels after this test]

Oriya: 15.56/21.446689/86.8124176
Telugu: 19/17.242273/80.143238
Sinhala: 17/7.3173068/80.3408611

I am having trouble finding roads on OSM named in the other languages, even when using Overpass: Gujarati, Gurmukhi/Punjabi. There are some rivers or footpaths, so if these were labeled we could probably test.

As mentioned earlier, Devanagari and Arabic scripts are connected so they're beyond the scope of this fix

bcamper · 2017-05-05T21:08:58Z

This is an awesome contribution!! I've been playing around with it and noticed a couple things:

Creating a new RegExp object for every character iteration is kind of intense and caused a noticeable slowdown (all this Canvas text measuring and rendering is already pushing it... :)

I tried moving the grapheme characters check to a single compiled RegExp, with the first text character explicitly appended to the current segment text before the results of the regex match:
```
// compiled regex to find grapheme cluster chars
const grapheme_regex = new RegExp("(?:" + accents_and_vowels + "+)?" + "(" + combo_characters + "\\w(" + accents_and_vowels + ")?)?");
...

// later, in function to find next grapheme cluster
segment += text[0];
let grapheme_cluster = text.substring(1).match(grapheme_regex);
if (grapheme_cluster && grapheme_cluster[0] != null) {
   segment += grapheme_cluster[0];
}
```
I'm not yet familiar enough with these scripts and combining characters, but it seems like this should be OK since the original regex was just doing an explicit check for the first character, rather than any dependent match or lookahead/behind logic (new RegExp(testText[0] + "(?:" + ...)? After some fiddling and spot-checking, it seems to produce the same results as your original. Anything I'm missing here?
Had problems with labels that had special regex characters (like parentheses) inside them breaking the regex. This can be fixed by escaping them, but is eliminated by the changes above in any case.

mapmeld · 2017-05-06T05:47:28Z

@bcamper - agreed, this looks better and doesn't cause any language issues. The pattern of a grapheme cluster can be: [character], [character][additional accents/signs], or the more complex one being [character][additional accents/signs][character-combining mark / virama][2nd character][2nd character's accents/signs]. If there were complex Emoji on OSM then we'd need more.

I was thinking more on this and would ask your input on two additional changes:

capturing the first character of the segment by starting the RegEx with . instead of adding it on a separate line. In this case we can always expect to add at least one character to the segment string.
changing from str.match(grapheme_regex) to grapheme_regex.exec(str) to only returns the first result, which is the only one we need

bcamper · 2017-05-08T21:51:40Z

Thanks - the single regex looks good. I'm still debugging a case where it goes into an infinite loop, even though I'd expect it to always take at least one character from the . - fun :)

By setting scene.setIntrospection(true), you can get HTML tooltips (and thus theoretically corrected rendered by the browser) for everything with a name. This is handy for comparing to the Tangram rendering. This at least allows a layperson who doesn't read the script to do a visual confirmation.

I was spot-checking Khmer and came across a case that looked wrong (see last two characters):

Bad (curved label):

Good (straight label):

I know nothing about Khmer but I'm definitely interested to understand what's going on here. I see that you had it defined as:
"\u17B4-\u17D1\u17D3" + // Khmer

There's a conspicuous single missing character in the sequence there...

via http://www.unicode.org/charts/PDF/U1780.pdf

I tried adding that character to the range ("\u17B4-\u17D3" + // Khmer), and voila, it seems to work!

Do you know why it wasn't present? Intentional/mistake?

I'm going to keep reviewing these but wanted to run that one by you in the meantime.

mapmeld · 2017-05-08T22:01:59Z

\u17D2 is not a visible accent by itself but a "combo character" that I put in that other regex. The regex is expecting a combo character and then another letter which goes subscript, so it shouldn't have split on it unless something weird is going on (like combo character + more vowel signs)

Do you have coordinates where this label is? I would like to look at the text char by char.

bcamper · 2017-05-08T22:09:07Z

Aha I should have checked that other regex. It should be visible right in the middle of your test coords: 16.30000/11.54480/104.93701

The string is "វិថី កោះពេជ្រ". The OSM feature is http://www.openstreetmap.org/way/375404697

bcamper · 2017-05-08T22:19:32Z

Similar deal with these two, near middle of 17.63334/11.54948/104.92715:

mapmeld · 2017-05-08T22:21:21Z

I think I see the problem and (fortunately? I think?) a regex issue

This sequence [letter][combo character][letter] is somehow failing because the regex includes an optional accents/vowel signs after each letter, which for some reason is not being treated as optional.

mapmeld · 2017-05-08T22:26:58Z

I'm thinking this is the right regex (using a . for both 1st letter and optional 2nd combo letter, also allowing multiple accents on the 2nd letter)

new RegExp(".(?:" + accents_and_vowels + "+)?" + "(" + combo_characters + ".(" + accents_and_vowels + "+)?)?")

mapmeld · 2017-05-09T17:16:34Z

ok here's my latest idea

const graphemeRegex = new RegExp("^.(?:" + accents_and_vowels + "+)?" + "(" + combo_characters + "\\W(?:" + accents_and_vowels + "+)?)?");
...
for (graphemeCount; graphemeCount < codon_length && testText.length > 0; graphemeCount++) {
    let graphemeCluster = (graphemeRegex.exec(testText) || testText)[0];
    segment += graphemeCluster;
    testText = testText.substring(graphemeCluster.length);
}

bcamper · 2017-05-10T20:03:55Z

Thanks @mapmeld - this is looking good, plus I've found a few unrelated issues in the process of testing it :) Is this ready to go as far as you're concerned?

Question as I test and learn more about this. Here's an example Sinhala feature:
http://www.openstreetmap.org/way/368378417

Full text (name:si) is:
"ඉඹුල්ගස්දෙනිය - රඹුක්කන පාර"

The current grapheme cluster regex will split this up into these segments (I changed the codon_length to 1 for this, which will give us only one grapheme cluster per segment, instead of up to 2 as is the usual Tangram default):
["ඉ", "ඹු", "ල්ග", "ස්දෙ", "නි", "ය", " ", "-", " ", "ර", "ඹු", "ක්ක", "න", " ", "පා", "ර"]

@meetar and I were looking at some examples of groups like "ල්ග" and "ස්දෙ", where it seems visually like the characters could theoretically be broken down further. Is this just an effect of the current regex being conservative enough to catch cases like the Khmer issue we saw earlier? e.g. maybe it would be possible to create even more detailed / script-specific regexes to get more granularity, but this is a good middle-ground? Anyway I'm happy to stop where we are for now since this is a great improvement, just curious.

It's looking nice here:
17.14000/7.31549/80.32824

I believe I've spot-checked all of the scripts you included except for Gujarati and Gurmukhi (will try those too if I can). BTW, if you want to just hard-code some label text for Tangram to render, to verify the output (if we can't find suitable OSM features), you can use a JS function in the text_source parameter to return an explicit string to render, e.g. text_source: function() { return 'ඉඹුල්ගස්දෙනිය' }

bcamper · 2017-05-10T21:22:17Z

I was testing some Gurmukhi/Punjabi (http://www.openstreetmap.org/way/248011769, at 12.41000/31.13213/75.06559), and I noticed this script does have the bar running across the top -- do you know if it's OK for the continuity of that to be broken across segments like this? In any case, Google also appears to do that with Punjabi labels...

…rmese curved labels

mapmeld · 2017-05-10T22:12:35Z

You're right - that's wrong how the Sinhala text was blocking there. I looked at the Unicode chart and fixed two parts of my Sinhala text range, which appears to fix the problem. I've created a set of tests on my side to make sure that a change to one piece doesn't undo our progress somewhere else https://github.com/mapmeld/grapheme-regex/blob/master/test/index.js

I don't know enough about Gurmukhi/Punjabi to say whether that example is good or bad. I think I might remove it out of abundance of caution

(did a merge to update with master branch)

bcamper · 2017-05-11T19:04:14Z

src/styles/text/canvas_text.js

@@ -580,7 +580,7 @@ const accents_and_vowels = "[:\u0300-\u036F" + // Combining Diacritical Marks
 "\u0B01-\u0B03\u0B3C\u0B3E-\u0B4C\u0B56\u0B57\u0B62\u0B63" + // Oriya
 "\u0B82\u0BBE-\u0BCD\u0BD7" + // Tamil
 "\u0C00-\u0C03\u0C3E-\u0C4C\u0C55\u0C56\u0C62\u0C63" + // Telugu
-"\u0D82-\u0D83\u0DCF-\u0DDF\u0DF2\u0DF3" + // Sinhala
+"\u0D82\u0D83\u0DCA-\u0DDF\u0DF2\u0DF3" + // Sinhala


Should \u0DCA be removed from combo_characters?

bcamper · 2017-05-11T20:12:24Z

@tallytalwar says that the Gurmukhi/Punjabi example is OK and that it's fine for the "bar" at top to have some breaks when segments bend on curved labels.

I've been doing some reading up on grapheme clusters and Unicode, but I'm definitely still unclear on some things :) (which is OK to merge this, I would just like to have a general idea of what's going on). Is there a simple explanation for the criteria of when a character goes in accents_and_vowels vs. combo_characters?

Also, I understand why Arabic is beyond the scope here, but less so Devanagari (given other similar scripts in your set) -- does it have shaping rules that make it unsuitable for this kind of regex approach? @tallytalwar is a Devanagari reader and wasn't aware of why it wouldn't apply, but also isn't familiar with the encoding/grapheme rules.

Similarly, the only other remaining scripts in our "curved label blacklist" are Bengali, Mongolian, Tibetan. Have you left them out because you're not familiar with them, and/or do you know if this could be extended to cover them?

Otherwise, I think we are ready to merge on our side, if you are? Thanks again!

mapmeld · 2017-05-11T21:10:52Z

ok - thanks @tallytalwar for checking the Gurmukhi/Punjabi example. By my original thought process I could have left it out. If breaking the bar is OK on these labels, yes we can include Devanagari and Bengali.

Accents and vowel signs modify one base character (y -> ý, က -> ကြ). The character does not belong on its own ( ́ or ြ).

In this family of scripts that we're talking about, each base character is a syllable (Devanagari न = "na", नन = "nana"). People still need double-consonant sounds ("Chennai" as example). To make it work you "break the leg" of the 1st one to form न्न. In Unicode it looks like न + ् (virama) + न. The combinations don't need to be the same letter (न + ् + त = न्त). In Burmese it works a little differently; letters are stacked on top of each other (က + ္ + က = က္က). In Tamil there is just a dot over the first letter, and no change to the second letter. In any case where we combine letters, it's important not to break up letter1, letter2, and their accent marks.

This is my first time looking closely at Tibetan... it looks like we could apply the same rule similar to Burmese.

Mongolian is typically written vertically and its Wikipedia article contains no examples of it being written horizontally.

Also, Sinhala point is right. I'll remove that from combo_characters and add the other scripts later today, I think.

nvkelso · 2017-05-11T21:33:57Z

Will be nice to see Tibetan in there :)

…

On Thu, May 11, 2017 at 2:10 PM, Nick Doiron ***@***.***> wrote: ok - thanks @tallytalwar <https://github.com/tallytalwar> for checking the Gurmukhi/Punjabi example. By my original thought process I could have left it out. If breaking the bar is OK on these labels, yes we can include Devanagari and Bengali. *Accents and vowel signs* modify one base character (y -> ý, က -> ကြ). The character does not belong on its own ( ́ or ြ). In this family of scripts that we're talking about, each base character is a syllable (Devanagari न = "na", नन = "nana"). People still need double-consonant sounds ("Chennai" as example). To make it work you "break the leg" of the 1st one to form न्न. In Unicode it looks like न + ् (virama) + न. The combinations don't need to be the same letter (न + ् + त = न्त). In Burmese it works a little differently; letters are stacked on top of each other. In Tamil there is just a dot and no merging. This is my first time looking closely at Tibetan... it looks like we could apply the same rule similar to Burmese. Mongolian is typically written vertically and its Wikipedia article <https://en.wikipedia.org/wiki/Mongolian_script> contains no examples of it being written horizontally. Also, Sinhala point is right. I'll remove that from combo_characters and add the other scripts later today, I think. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#555 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA0EOxKC_YC7_zTUJ3k6DtFEuVnr-OB2ks5r43ldgaJpZM4NREBN> .

pnorman · 2017-05-12T02:27:55Z

Mongolian is typically written vertically and its Wikipedia article contains no examples of it being written horizontally.

We looked at Mongolian in OpenStreetMap Carto, and although the issue doesn't contain much useful my recollection is that @russdeffner, who has spent time there, said horizontal was okay. I would say that if you can correctly render horizontal on straight lines you're ahead of many map renderers, and along the line for a curved line is fine.

mapmeld · 2017-05-12T04:08:32Z

OK, latest commit covers everything mentioned here, except Mongolian, which I would like to leave for now. In addition to the vertical/horizontal issue, Mongolian letters have text-shaping more similar to Arabic, actually, so the word-chunking process would have similar issues.

For future reference on Tibetan: http://rishida.net/scripts/tibetan/

bcamper · 2017-05-17T01:42:11Z

@mapmeld just wanted to mention that @tallytalwar and I saw what we think is a Devanagari issue when testing - will get you some more details tomorrow.

mapmeld · 2017-05-21T06:25:37Z

Might be good to add Malayalam for SW India: http://thottingal.in/blog/2017/05/21/a-formal-grammar-for-malayalam-conjunct/

mapmeld · 2017-05-21T18:17:21Z

Added Malayalam and included a fix which may resolve your Devanagari issue (if it is related to the concept in the Malayalam link?)

bcamper · 2017-05-21T20:44:46Z

I believe that does fix the Devanagari issue! Hadn't gotten around to writing it up yet, and it looks like you probably already figured this out, but here is what I was seeing.

First of all there was a problem with the regex. For example, for the word "स्ट्रिंग" ("string"), our previous regex was returning these 3 characters as the first grapheme cluster:
["स", "्", "ट"].

I tried checking this against the results for Chrome's Intl.v8BreakIterator (as I understand it, close to a passthrough API to ICU), and it returns just these 2 characters for the first cluster:
["स", "्"]

I was wondering if our regex was too greedy (to account for some of the other script features you adjusted it to support?), and it looked like we were actually taking a character from what V8 considered the next cluster:

Tangram: [["स", "्", "ट"], ["्"]]
V8: [["स", "्"], ["ट", "्"]

But even using Intl.v8BreakIterator for grapheme cluster detection did not render properly. Here, all roads have the same text ("string string string"), but note the curved label character order is different from the (correct) straight order:

It looks like Is what we are seeing here an issue with what the link you provided calls conjuncts, and what this link calls consonant clusters? Does this mean that in cases like this, the grapheme cluster itself is not sufficient as an independent visual unit, but requires additional script knowledge?

In any case, your changes do fix the example above, which now renders correctly:

Your updated regex takes these sets of characters as the first two clusters:

[["स", "्", "ट", "्", "र", "ि", "ं"], ["ग"]]

@tallytalwar doe this make sense to you?

@mapmeld do you know if there are any downsides to the more expansive regex for other scripts?

mapmeld · 2017-05-21T20:54:58Z

In the past I had been led to believe that joining more than two consonants was extremely rare, and bringing it up I had been asked to provide examples. This article got me thinking that maybe you could be seeing the same thing on the map. In any case the RegEx now supports unlimited consonants joined there.

tallytalwar · 2017-05-21T20:55:20Z

@bcamper yes that does make sense and indic language overview here does help explain the scenario.

The question we raised the other day was what happens for a character set with 2 or more consecutive placed (half sounding) consonants, as this introduces the virama/halant characters. And the logic mentioned in @mapmeld's link above does take care of these situations.

mapmeld added 5 commits May 10, 2017 14:35

make codons using grapheme cluster length, not char length - makes Bu…

0eceeaa

…rmese curved labels

add language support

d422451

onetime regex

b116be9

patch up regex and for loop making the codons

d1bd750

Sinhala unicode ranges fix

1f25a1c

bcamper reviewed May 11, 2017

View reviewed changes

complete language list

e2aa923

add Malayalam and support multiple virama joins

7724941

pnorman mentioned this pull request May 23, 2017

Client-side rendering engine pnorman/bolder#4

Closed

bcamper merged commit 39f2fc8 into tangrams:master May 23, 2017

bcamper added this to the v0.12.5 milestone May 23, 2017

bcamper mentioned this pull request Jul 19, 2017

Myanmar/Burmese script rendering incorrectly tangrams/tangram-es#1513

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make curved labels in more languages #555

Make curved labels in more languages #555

mapmeld commented May 4, 2017 •

edited

Loading

mapmeld commented May 4, 2017 •

edited

Loading

bcamper commented May 5, 2017

mapmeld commented May 6, 2017

bcamper commented May 8, 2017

mapmeld commented May 8, 2017 •

edited

Loading

bcamper commented May 8, 2017 •

edited

Loading

bcamper commented May 8, 2017

mapmeld commented May 8, 2017

mapmeld commented May 8, 2017

mapmeld commented May 9, 2017 •

edited

Loading

bcamper commented May 10, 2017

bcamper commented May 10, 2017

mapmeld commented May 10, 2017 •

edited

Loading

bcamper May 11, 2017

bcamper commented May 11, 2017

mapmeld commented May 11, 2017 •

edited

Loading

nvkelso commented May 11, 2017 via email

pnorman commented May 12, 2017

mapmeld commented May 12, 2017

bcamper commented May 17, 2017

mapmeld commented May 21, 2017 •

edited

Loading

mapmeld commented May 21, 2017

bcamper commented May 21, 2017

mapmeld commented May 21, 2017

tallytalwar commented May 21, 2017

Make curved labels in more languages #555

Make curved labels in more languages #555

Conversation

mapmeld commented May 4, 2017 • edited Loading

mapmeld commented May 4, 2017 • edited Loading

bcamper commented May 5, 2017

mapmeld commented May 6, 2017

bcamper commented May 8, 2017

mapmeld commented May 8, 2017 • edited Loading

bcamper commented May 8, 2017 • edited Loading

bcamper commented May 8, 2017

mapmeld commented May 8, 2017

mapmeld commented May 8, 2017

mapmeld commented May 9, 2017 • edited Loading

bcamper commented May 10, 2017

bcamper commented May 10, 2017

mapmeld commented May 10, 2017 • edited Loading

bcamper May 11, 2017

Choose a reason for hiding this comment

bcamper commented May 11, 2017

mapmeld commented May 11, 2017 • edited Loading

nvkelso commented May 11, 2017 via email

pnorman commented May 12, 2017

mapmeld commented May 12, 2017

bcamper commented May 17, 2017

mapmeld commented May 21, 2017 • edited Loading

mapmeld commented May 21, 2017

bcamper commented May 21, 2017

mapmeld commented May 21, 2017

tallytalwar commented May 21, 2017

mapmeld commented May 4, 2017 •

edited

Loading

mapmeld commented May 4, 2017 •

edited

Loading

mapmeld commented May 8, 2017 •

edited

Loading

bcamper commented May 8, 2017 •

edited

Loading

mapmeld commented May 9, 2017 •

edited

Loading

mapmeld commented May 10, 2017 •

edited

Loading

mapmeld commented May 11, 2017 •

edited

Loading

mapmeld commented May 21, 2017 •

edited

Loading