Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusable definition #127

Closed
r12a opened this issue Oct 18, 2017 · 14 comments
Closed

Confusable definition #127

r12a opened this issue Oct 18, 2017 · 14 comments
Assignees
Labels

Comments

@r12a
Copy link
Contributor

r12a commented Oct 18, 2017

2.3 Identical-Appearing Characters and the Limitations of Normalization
https://w3c.github.io/charmod-norm/#normalizationLimitations

When two logical characters look similar or can look similar under certain presentations (and which include homographs), they are said to be "confusable".

Shouldn't that say 'two logically different characters'?

Or perhaps better:
When graphemes look similar but actually represent things that are logically different, they are said to be confusable.

@asmusf
Copy link

asmusf commented Oct 19, 2017

Sounds good.

I would introduce the term "exact homoglyph/homograph" for truly identical code points, and point to the UCD file "Intentional.txt" that lists all of them.

I would add to the examples: 01DD and 0259 and or the three cases of capital D with stroke. These look very nicely identical and are more easily understood by readers used to the Latin script (which by definition, includes anyone reading this text in the original English).

@aphillips
Copy link
Contributor

@r12a: adopted your wording with revisions to make it consistent with the preceding introduction of the term homograph.

@asmusf: what UCD file is that? 10.0.0 doesn't have it and I don't see it in UTR36. Where am I forgetting to look?

I could add more examples, but hunting about for the characters and such seems overkill. The 01DD and 0259 example is not nearly as familiar as the "P" example given and isn't particularly different from the P example (which security folks remember as part of the "paypal bug" in IDNA or which causes consternation because one is an ASCII letter).

@aphillips aphillips self-assigned this Oct 26, 2017
@asmusf
Copy link

asmusf commented Oct 26, 2017 via email

@r12a
Copy link
Contributor Author

r12a commented Nov 1, 2017

But two logically distinct characters or grapheme clusters can still look the same or very similar.

suggest 'still' -> 'also'

When a pair of graphemes look identical (or very similar), they are called homographs. When a pair of graphemes look similar or are homographs but actually represent logically different characters or character sequences, they are said to be "confusable".

This seems to be suggesting that homographs and confusables are different things, and that the logical difference only applies for confusables, which i find confusing.

I think this needs more work.

btw, for the P example, you may want to say that they actually represent different letters of the alphabet, ie. the pronunciation is different from the Latin. It's not just that there are copies of the same letter for each alphabet.

@aphillips
Copy link
Contributor

Homographs and confusables are (slightly) separate concepts. There are confusables that are not exact homographs (1 vs. lowercase-L). There are homographs that are not confusable (À vs À, where one is U+00C0 and one is U+0041 U+0300) because they are logically the same thing.

The "P" example was difficult to convey. I was looking for a way to say that more elegantly than I ended up with. Perhaps go from:

These letters from the Greek, Cyrillic, and Latin scripts look identical in most fonts (that is, they are homographs), but they are encoded separately, as they are logically distinct parts of their respective Greek, Cyrillic, or Latin alphabet.

To:

These letters from the Greek, Cyrillic, and Latin scripts look identical in most fonts (that is, they are homographs), but they are encoded separately, as they are logically distinct letters (indeed, not even pronounced the same way) in their respective Greek, Cyrillic, or Latin alphabet.

I wanted to mention that they were separate alphabets to draw attention to the fact that each alphabet is complete unto itself. Some letters, after all, are more closely related between the separate scripts.

@r12a
Copy link
Contributor Author

r12a commented Nov 1, 2017

Homographs and confusables are (slightly) separate concepts. There are confusables that are not exact homographs (1 vs. lowercase-L). There are homographs that are not confusable (À vs À, where one is U+00C0 and one is U+0041 U+0300) because they are logically the same thing.

But that not what the text says. How about this.

One or more graphemes that look identical (or very similar) are called homographs. The character sequences underlying homographs may be alternative ways of expressing the same logical grapheme, or may represent different graphemes that just happen to look alike. In the latter case, the character sequences are said to be "confusable".

@r12a
Copy link
Contributor Author

r12a commented Nov 1, 2017

Btw, we need to think about making images for the examples because (a) the different Ps may actually look less identical if your system substitutes different fonts, and (b) because even my system doesn't display the ARABIC LETTER BEH WITH HAMZA ABOVE, so good luck to anyone else in understanding the point there ;-)

@aphillips
Copy link
Contributor

Really? I'll admit I was a little surprised that the examples "just worked", but they did---for me

image

Note that we do build a subset font. That might help. But I agree that we should set up some images.

@aphillips
Copy link
Contributor

Regarding your previous comment, I think that you make a good point and I'll make the change.

@r12a
Copy link
Contributor Author

r12a commented Nov 1, 2017

fwiw, here's what i see

screen shot 2017-11-01 at 18 07 13

can you tell from the Inspector what font gives you the precomposed character?

@aphillips
Copy link
Contributor

Would you believe Times New Roman?

@r12a
Copy link
Contributor Author

r12a commented Feb 1, 2018

It appears to be working in my browser now, even though my browser's default font is set to something else. Did you add the letter to the webfont?

@aphillips
Copy link
Contributor

I have not yet regenerated the webfont. Could be either (a) a fix to iOS or (b) I got the fallback order in the font names correct ;-). I will regenerate the webfont as part of the clearing up for publication process.

@r12a r12a removed the close? label Apr 12, 2018
@r12a
Copy link
Contributor Author

r12a commented Jan 17, 2019

The latter comments were veering off-track, so i raised a new issue at #188

I'm happy to close the current issue (discussion about confusables).

@r12a r12a added the close? label Jan 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants