Cannot visually differenciate between Latin & Cyrillic characters #343

amandasaurus · 2021-12-03T12:45:53Z

Currently there are 970 addr:city=Русе, and 125 addr:city=Pyce. The first is in Cyrillilc, the second is a mistake using latin characters that look like the cyrillic.

Currently, a user cannot tell apart which is which because they all look the same in taginfo. I think this is a lacking feature, but I'm unsure what the solution is.

Use a different font which will display latin & cyrillic letters differently?
Display some unicode details of a tag/value?
Use HTML <span title=…/<acronym> tags on all the letters in a key/value (seems a bit heavy)

$ uniwhat "Русе"
character   byte  UTF-32  encoded as     glyph    name
        0      0  000420  D0 A0            Р      CYRILLIC CAPITAL LETTER ER
        1      2  000443  D1 83            у      CYRILLIC SMALL LETTER U
        2      4  000441  D1 81            с      CYRILLIC SMALL LETTER ES
        3      6  000435  D0 B5            е      CYRILLIC SMALL LETTER IE

and

$ uniwhat "Pyce"
character   byte  UTF-32  encoded as     glyph    name
        0      0  000050  50               P      LATIN CAPITAL LETTER P
        1      1  000079  79               y      LATIN SMALL LETTER Y
        2      2  000063  63               c      LATIN SMALL LETTER C
        3      3  000065  65               e      LATIN SMALL LETTER E

The text was updated successfully, but these errors were encountered:

joto · 2021-12-03T13:07:17Z

When taginfo updates its database it also gets the unicode database of characters. My plan was to "somehow" do something like what your uniwhat examples shows. But I never could figure out what to do exactly and how to show it. One problem is, that I can't know what the "correct" characters should be in the general case, if there is a "correct" at all. And its easy to get into cultural biases there. Just showing everything but latin characters as somehow special could already be interpreted as biased ("but taginfo shows cyrillic letters in red so I interpreted that as cyrillic letters being bad and removed them").

I think the question here is: What would actually be useful to show (as compared to just interesting)? What problem are we trying to solve? Then we can think about how to do that.

Dimitar5555 · 2021-12-11T11:06:22Z

As a person who speaks Bulgarian (which uses Cyrillic letters) I can offer some insight.

One problem is, that I can't know what the "correct" characters should be in the general case, if there is a "correct" at all.

That's true, but it would be nice to know when Cyrillic and Latin letters are mixed in the same key value. Usually that's something that should be avoided (there may be edge cases but it's definitely an error when it happens on addr:streetname, addr:city and other addr:*). Another edge case could be Serbian since the Serbian language uses both Cyrillic and Latin letters.

What would actually be useful to show (as compared to just interesting)?

It would be useful to show the glyph and character name like uniwhat shows them.

What problem are we trying to solve?

Mixing Latin and Cyrillic letters in the same key (mostly name:*, name and addr:*.

Zverik · 2021-12-11T11:17:16Z

Using different font or colours for entire values would be not only not useful, but also harmful, for reading mixed-language values would become harder and slower (think numbers).

What could be useful is flagging values with words that contain letters from different alphabets. For example, c and с (latin "c" and cyrillic "s") look the same and reside on the same keyboard key, so it's a common error to mix up these two. And that definitely impedes searching, for example.

amandasaurus · 2021-12-11T11:49:25Z

On Fri, 03 Dec 2021 14:07 +01:00, Jochen Topf ***@***.***> wrote: What problem are we trying to solve? Then we can think about *how* to do that.

Someone in Bulgaria noticed the 2 different `addr:city` values. They wanted to fix the OSM data. If they pressed the “Overpass Turbo”/“JOSM” link they could open it and fix it. But they didn't know which one was right, and which was wrong. I copy & pasted from taginfo website and ran it through `uniwhat` to figure out which was right. That's a problem. I'm not sure how to solve it...

amandasaurus · 2021-12-11T11:55:47Z

On Fri, 03 Dec 2021 14:07 +01:00, Jochen Topf ***@***.***> wrote: My plan was to "somehow" do something like what your `uniwhat` examples shows. But I never could figure out what to do exactly and how to show it.

What about another tab, which shows the detailed break down of the unicode characters in the key, and value, which shows similar output to `uniwhat`? Then you are not deciding “Latin alphabet is right”, you are merely providing a deep dive into the “binary” representation of the tag. In the case of homoglyphs, a mapper can deduce which is latin & which is cyrllic.

This tab shows a table with all unicode characters used in the key/tag/relation. See #343

joto · 2021-12-15T10:22:09Z

This was one of those "how hard can it be?" things I thought I can quickly do... Took me two days of fiddling around with strange tables of unicode scripts, properties, etc. But now there is a new "Characters" tab on key/tag/relation pages which show a table of all characters used along with the script and unicode general category and unicode name of that code point.

amandasaurus · 2021-12-15T11:26:45Z

The Programmers’ Credo: we do these things not because they are easy, but because we thought they were going to be easy

https://twitter.com/Pinboard/status/761656824202276864

Thanks, I think this solves the root cause of this issue. 🙂

joto added a commit that referenced this issue Dec 15, 2021

Add "characters" tab to key/tag/relation pages

f0cfd5f

This tab shows a table with all unicode characters used in the key/tag/relation. See #343

amandasaurus closed this as completed Dec 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot visually differenciate between Latin & Cyrillic characters #343

Cannot visually differenciate between Latin & Cyrillic characters #343

amandasaurus commented Dec 3, 2021

joto commented Dec 3, 2021

Dimitar5555 commented Dec 11, 2021

Zverik commented Dec 11, 2021

amandasaurus commented Dec 11, 2021 via email

amandasaurus commented Dec 11, 2021 via email

joto commented Dec 15, 2021

amandasaurus commented Dec 15, 2021

Cannot visually differenciate between Latin & Cyrillic characters #343

Cannot visually differenciate between Latin & Cyrillic characters #343

Comments

amandasaurus commented Dec 3, 2021

joto commented Dec 3, 2021

Dimitar5555 commented Dec 11, 2021

Zverik commented Dec 11, 2021

amandasaurus commented Dec 11, 2021 via email

amandasaurus commented Dec 11, 2021 via email

joto commented Dec 15, 2021

amandasaurus commented Dec 15, 2021