Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot visually differenciate between Latin & Cyrillic characters #343

Closed
amandasaurus opened this issue Dec 3, 2021 · 7 comments
Closed

Comments

@amandasaurus
Copy link
Contributor

Currently there are 970 addr:city=Русе, and 125 addr:city=Pyce. The first is in Cyrillilc, the second is a mistake using latin characters that look like the cyrillic.

Currently, a user cannot tell apart which is which because they all look the same in taginfo. I think this is a lacking feature, but I'm unsure what the solution is.

  • Use a different font which will display latin & cyrillic letters differently?
  • Display some unicode details of a tag/value?
  • Use HTML <span title=…/<acronym> tags on all the letters in a key/value (seems a bit heavy)
$ uniwhat "Русе"
character   byte  UTF-32  encoded as     glyph    name
        0      0  000420  D0 A0            Р      CYRILLIC CAPITAL LETTER ER
        1      2  000443  D1 83            у      CYRILLIC SMALL LETTER U
        2      4  000441  D1 81            с      CYRILLIC SMALL LETTER ES
        3      6  000435  D0 B5            е      CYRILLIC SMALL LETTER IE

and

$ uniwhat "Pyce"
character   byte  UTF-32  encoded as     glyph    name
        0      0  000050  50               P      LATIN CAPITAL LETTER P
        1      1  000079  79               y      LATIN SMALL LETTER Y
        2      2  000063  63               c      LATIN SMALL LETTER C
        3      3  000065  65               e      LATIN SMALL LETTER E
@joto
Copy link
Member

joto commented Dec 3, 2021

When taginfo updates its database it also gets the unicode database of characters. My plan was to "somehow" do something like what your uniwhat examples shows. But I never could figure out what to do exactly and how to show it. One problem is, that I can't know what the "correct" characters should be in the general case, if there is a "correct" at all. And its easy to get into cultural biases there. Just showing everything but latin characters as somehow special could already be interpreted as biased ("but taginfo shows cyrillic letters in red so I interpreted that as cyrillic letters being bad and removed them").

I think the question here is: What would actually be useful to show (as compared to just interesting)? What problem are we trying to solve? Then we can think about how to do that.

@Dimitar5555
Copy link

As a person who speaks Bulgarian (which uses Cyrillic letters) I can offer some insight.

One problem is, that I can't know what the "correct" characters should be in the general case, if there is a "correct" at all.

That's true, but it would be nice to know when Cyrillic and Latin letters are mixed in the same key value. Usually that's something that should be avoided (there may be edge cases but it's definitely an error when it happens on addr:streetname, addr:city and other addr:*). Another edge case could be Serbian since the Serbian language uses both Cyrillic and Latin letters.

What would actually be useful to show (as compared to just interesting)?

It would be useful to show the glyph and character name like uniwhat shows them.

What problem are we trying to solve?

Mixing Latin and Cyrillic letters in the same key (mostly name:*, name and addr:*.

@Zverik
Copy link
Contributor

Zverik commented Dec 11, 2021

Using different font or colours for entire values would be not only not useful, but also harmful, for reading mixed-language values would become harder and slower (think numbers).

What could be useful is flagging values with words that contain letters from different alphabets. For example, c and с (latin "c" and cyrillic "s") look the same and reside on the same keyboard key, so it's a common error to mix up these two. And that definitely impedes searching, for example.

@amandasaurus
Copy link
Contributor Author

amandasaurus commented Dec 11, 2021 via email

@amandasaurus
Copy link
Contributor Author

amandasaurus commented Dec 11, 2021 via email

joto added a commit that referenced this issue Dec 15, 2021
This tab shows a table with all unicode characters used in the
key/tag/relation.

See #343
@joto
Copy link
Member

joto commented Dec 15, 2021

This was one of those "how hard can it be?" things I thought I can quickly do... Took me two days of fiddling around with strange tables of unicode scripts, properties, etc. But now there is a new "Characters" tab on key/tag/relation pages which show a table of all characters used along with the script and unicode general category and unicode name of that code point.

@amandasaurus
Copy link
Contributor Author

The Programmers’ Credo: we do these things not because they are easy, but because we thought they were going to be easy

https://twitter.com/Pinboard/status/761656824202276864

Thanks, I think this solves the root cause of this issue. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants