-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make curved labels in more languages #555
Conversation
New commit to add more languages and better explain where the language RegEx comes from. Test coordinates:
India and Sri Lanka road data is overwhelmingly English in the name tag. By adding name:ta and other language options, I found some of the others:
These have a few roads, but are currently sticking to regular, not-curved labels [Update: I hard-coded
I am having trouble finding roads on OSM named in the other languages, even when using Overpass: Gujarati, Gurmukhi/Punjabi. There are some rivers or footpaths, so if these were labeled we could probably test. As mentioned earlier, Devanagari and Arabic scripts are connected so they're beyond the scope of this fix |
This is an awesome contribution!! I've been playing around with it and noticed a couple things:
|
@bcamper - agreed, this looks better and doesn't cause any language issues. The pattern of a grapheme cluster can be: [character], [character][additional accents/signs], or the more complex one being [character][additional accents/signs][character-combining mark / virama][2nd character][2nd character's accents/signs]. If there were complex Emoji on OSM then we'd need more. I was thinking more on this and would ask your input on two additional changes:
|
Thanks - the single regex looks good. I'm still debugging a case where it goes into an infinite loop, even though I'd expect it to always take at least one character from the By setting I was spot-checking Khmer and came across a case that looked wrong (see last two characters): I know nothing about Khmer but I'm definitely interested to understand what's going on here. I see that you had it defined as: There's a conspicuous single missing character in the sequence there... I tried adding that character to the range ( Do you know why it wasn't present? Intentional/mistake? I'm going to keep reviewing these but wanted to run that one by you in the meantime. |
\u17D2 is not a visible accent by itself but a "combo character" that I put in that other regex. The regex is expecting a combo character and then another letter which goes subscript, so it shouldn't have split on it unless something weird is going on (like combo character + more vowel signs) Do you have coordinates where this label is? I would like to look at the text char by char. |
Aha I should have checked that other regex. It should be visible right in the middle of your test coords: 16.30000/11.54480/104.93701 The string is "វិថី កោះពេជ្រ". The OSM feature is http://www.openstreetmap.org/way/375404697 |
I think I see the problem and (fortunately? I think?) a regex issue This sequence [letter][combo character][letter] is somehow failing because the regex includes an optional accents/vowel signs after each letter, which for some reason is not being treated as optional. |
I'm thinking this is the right regex (using a . for both 1st letter and optional 2nd combo letter, also allowing multiple accents on the 2nd letter) new RegExp(".(?:" + accents_and_vowels + "+)?" + "(" + combo_characters + ".(" + accents_and_vowels + "+)?)?") |
ok here's my latest idea const graphemeRegex = new RegExp("^.(?:" + accents_and_vowels + "+)?" + "(" + combo_characters + "\\W(?:" + accents_and_vowels + "+)?)?");
...
for (graphemeCount; graphemeCount < codon_length && testText.length > 0; graphemeCount++) {
let graphemeCluster = (graphemeRegex.exec(testText) || testText)[0];
segment += graphemeCluster;
testText = testText.substring(graphemeCluster.length);
} |
Thanks @mapmeld - this is looking good, plus I've found a few unrelated issues in the process of testing it :) Is this ready to go as far as you're concerned? Question as I test and learn more about this. Here's an example Sinhala feature: Full text ( The current grapheme cluster regex will split this up into these segments (I changed the @meetar and I were looking at some examples of groups like "ල්ග" and "ස්දෙ", where it seems visually like the characters could theoretically be broken down further. Is this just an effect of the current regex being conservative enough to catch cases like the Khmer issue we saw earlier? e.g. maybe it would be possible to create even more detailed / script-specific regexes to get more granularity, but this is a good middle-ground? Anyway I'm happy to stop where we are for now since this is a great improvement, just curious. It's looking nice here: I believe I've spot-checked all of the scripts you included except for Gujarati and Gurmukhi (will try those too if I can). BTW, if you want to just hard-code some label text for Tangram to render, to verify the output (if we can't find suitable OSM features), you can use a JS function in the |
I was testing some Gurmukhi/Punjabi (http://www.openstreetmap.org/way/248011769, at 12.41000/31.13213/75.06559), and I noticed this script does have the bar running across the top -- do you know if it's OK for the continuity of that to be broken across segments like this? In any case, Google also appears to do that with Punjabi labels... |
You're right - that's wrong how the Sinhala text was blocking there. I looked at the Unicode chart and fixed two parts of my Sinhala text range, which appears to fix the problem. I've created a set of tests on my side to make sure that a change to one piece doesn't undo our progress somewhere else https://github.com/mapmeld/grapheme-regex/blob/master/test/index.js I don't know enough about Gurmukhi/Punjabi to say whether that example is good or bad. I think I might remove it out of abundance of caution (did a merge to update with master branch) |
src/styles/text/canvas_text.js
Outdated
@@ -580,7 +580,7 @@ const accents_and_vowels = "[:\u0300-\u036F" + // Combining Diacritical Marks | |||
"\u0B01-\u0B03\u0B3C\u0B3E-\u0B4C\u0B56\u0B57\u0B62\u0B63" + // Oriya | |||
"\u0B82\u0BBE-\u0BCD\u0BD7" + // Tamil | |||
"\u0C00-\u0C03\u0C3E-\u0C4C\u0C55\u0C56\u0C62\u0C63" + // Telugu | |||
"\u0D82-\u0D83\u0DCF-\u0DDF\u0DF2\u0DF3" + // Sinhala | |||
"\u0D82\u0D83\u0DCA-\u0DDF\u0DF2\u0DF3" + // Sinhala |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should \u0DCA
be removed from combo_characters
?
@tallytalwar says that the Gurmukhi/Punjabi example is OK and that it's fine for the "bar" at top to have some breaks when segments bend on curved labels. I've been doing some reading up on grapheme clusters and Unicode, but I'm definitely still unclear on some things :) (which is OK to merge this, I would just like to have a general idea of what's going on). Is there a simple explanation for the criteria of when a character goes in Also, I understand why Arabic is beyond the scope here, but less so Devanagari (given other similar scripts in your set) -- does it have shaping rules that make it unsuitable for this kind of regex approach? @tallytalwar is a Devanagari reader and wasn't aware of why it wouldn't apply, but also isn't familiar with the encoding/grapheme rules. Similarly, the only other remaining scripts in our "curved label blacklist" are Bengali, Mongolian, Tibetan. Have you left them out because you're not familiar with them, and/or do you know if this could be extended to cover them? Otherwise, I think we are ready to merge on our side, if you are? Thanks again! |
ok - thanks @tallytalwar for checking the Gurmukhi/Punjabi example. By my original thought process I could have left it out. If breaking the bar is OK on these labels, yes we can include Devanagari and Bengali. Accents and vowel signs modify one base character (y -> ý, က -> ကြ). The character does not belong on its own ( ́ or ြ). In this family of scripts that we're talking about, each base character is a syllable (Devanagari न = "na", नन = "nana"). People still need double-consonant sounds ("Chennai" as example). To make it work you "break the leg" of the 1st one to form न्न. In Unicode it looks like न + ् (virama) + न. The combinations don't need to be the same letter (न + ् + त = न्त). In Burmese it works a little differently; letters are stacked on top of each other (က + ္ + က = က္က). In Tamil there is just a dot over the first letter, and no change to the second letter. In any case where we combine letters, it's important not to break up letter1, letter2, and their accent marks. This is my first time looking closely at Tibetan... it looks like we could apply the same rule similar to Burmese. Mongolian is typically written vertically and its Wikipedia article contains no examples of it being written horizontally. Also, Sinhala point is right. I'll remove that from |
Will be nice to see Tibetan in there :)
…On Thu, May 11, 2017 at 2:10 PM, Nick Doiron ***@***.***> wrote:
ok - thanks @tallytalwar <https://github.com/tallytalwar> for checking
the Gurmukhi/Punjabi example. By my original thought process I could have
left it out. If breaking the bar is OK on these labels, yes we can include
Devanagari and Bengali.
*Accents and vowel signs* modify one base character (y -> ý, က -> ကြ).
The character does not belong on its own ( ́ or ြ). In this family of
scripts that we're talking about, each base character is a syllable
(Devanagari न = "na", नन = "nana"). People still need double-consonant
sounds ("Chennai" as example). To make it work you "break the leg" of the
1st one to form न्न. In Unicode it looks like न + ् (virama) + न. The
combinations don't need to be the same letter (न + ् + त = न्त). In Burmese
it works a little differently; letters are stacked on top of each other. In
Tamil there is just a dot and no merging.
This is my first time looking closely at Tibetan... it looks like we could
apply the same rule similar to Burmese.
Mongolian is typically written vertically and its Wikipedia article
<https://en.wikipedia.org/wiki/Mongolian_script> contains no examples of
it being written horizontally.
Also, Sinhala point is right. I'll remove that from combo_characters and
add the other scripts later today, I think.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#555 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA0EOxKC_YC7_zTUJ3k6DtFEuVnr-OB2ks5r43ldgaJpZM4NREBN>
.
|
We looked at Mongolian in OpenStreetMap Carto, and although the issue doesn't contain much useful my recollection is that @russdeffner, who has spent time there, said horizontal was okay. I would say that if you can correctly render horizontal on straight lines you're ahead of many map renderers, and along the line for a curved line is fine. |
OK, latest commit covers everything mentioned here, except Mongolian, which I would like to leave for now. In addition to the vertical/horizontal issue, Mongolian letters have text-shaping more similar to Arabic, actually, so the word-chunking process would have similar issues. For future reference on Tibetan: http://rishida.net/scripts/tibetan/ |
@mapmeld just wanted to mention that @tallytalwar and I saw what we think is a Devanagari issue when testing - will get you some more details tomorrow. |
Might be good to add Malayalam for SW India: http://thottingal.in/blog/2017/05/21/a-formal-grammar-for-malayalam-conjunct/ |
Added Malayalam and included a fix which may resolve your Devanagari issue (if it is related to the concept in the Malayalam link?) |
I believe that does fix the Devanagari issue! Hadn't gotten around to writing it up yet, and it looks like you probably already figured this out, but here is what I was seeing. First of all there was a problem with the regex. For example, for the word "स्ट्रिंग" ("string"), our previous regex was returning these 3 characters as the first grapheme cluster: I tried checking this against the results for Chrome's I was wondering if our regex was too greedy (to account for some of the other script features you adjusted it to support?), and it looked like we were actually taking a character from what V8 considered the next cluster: Tangram: But even using It looks like Is what we are seeing here an issue with what the link you provided calls conjuncts, and what this link calls consonant clusters? Does this mean that in cases like this, the grapheme cluster itself is not sufficient as an independent visual unit, but requires additional script knowledge? In any case, your changes do fix the example above, which now renders correctly: Your updated regex takes these sets of characters as the first two clusters:
@tallytalwar doe this make sense to you? @mapmeld do you know if there are any downsides to the more expansive regex for other scripts? |
In the past I had been led to believe that joining more than two consonants was extremely rare, and bringing it up I had been asked to provide examples. This article got me thinking that maybe you could be seeing the same thing on the map. In any case the RegEx now supports unlimited consonants joined there. |
@bcamper yes that does make sense and indic language overview here does help explain the scenario. The question we raised the other day was what happens for a character set with 2 or more consecutive placed (half sounding) consonants, as this introduces the |
I saw Mapzen's blog post about curved labels awhile ago and was interested in how I could expand it to more languages
Burmese (pictured above) has some text-shaping - for example the word "ဟို" is three chars long. But it's similar to how "ý" can be two chars. Burmese doesn't connect all letters like Arabic or Devanagari. By counting grapheme clusters instead of chars, we can prevent both writing systems from being split up irregularly. That's what this pull request fixes.
The RegEx is long because it includes other writing systems (at least Latin alphabet, Devanagari and Tamil), but I have not enabled or tested them in curved labels yet.
In the future it would be possible to use Intl.Segmenter or a polyfill.
Arabic is another level of complexity (choosing the right initial/medial/final/isolated form) but it might be possible based on previous work in OSM iD and circular-arabic.
Edit: fixed screenshot, always preview