Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTL text examples lose their RTL marker #244

Closed
chaals opened this issue Oct 31, 2020 · 17 comments · Fixed by #245
Closed

RTL text examples lose their RTL marker #244

chaals opened this issue Oct 31, 2020 · 17 comments · Fixed by #245
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.

Comments

@chaals
Copy link

chaals commented Oct 31, 2020

The RTL text in examples, despite markup like

someText {
  value : <bdo dir="rtl">"HTML היא שפת סימון."</bdo>

isn't marked as RTL in the final rendered docs. My first guess is that respec (perhaps through the code prettifier) is removing the markup.

@iherman
Copy link
Member

iherman commented Oct 31, 2020

@chaals can you explain what the problem is? In my understanding (but, as always, @r12a is the final arbiter) the examples are o.k.: the JSON content in, e.g., Example 2 looks correct to me.

@iherman
Copy link
Member

iherman commented Nov 1, 2020

Also pinging @xfq and @TzviyaSiegman besides @r12a

@iherman
Copy link
Member

iherman commented Nov 1, 2020

I believe the point is that pure JSON does not know the notion of bidi setting. If I copy the string (either the pure Hebrew or a mixed Latin & Hebrew) from the correct HTML rendering in the example, the string is stored in memory in "logical" order. Taking the examples in the text, if I copy the Hebrew text only, the string in memory begins with the character "ה"; if I take the full example then with the character "H". If I then copy these strings into JSON, the editor displaying my JSON displays the text using the directionality of the first character in the string, although it correctly displays the Hebrew part from right to left. Here is what I get in my text editor:

{
    "pure hebrew" : "היא שפת סימון",
    "mixed hebrew": "HTML היא שפת סימון"
}

As far as the mixed Hebrew/Latin text is concerned, this display is wrong, because the HTML text should appear on the right, but JSON does not know better. It just displays, using the basic Unicode directionality, the characters as they come. This is exactly what the direction setting compensate in our manifest and which results in the correct HTML display.

My own conclusion is that Example 2 is fine, and the note after the example explain things correctly...

@mattgarrish
Copy link
Member

mattgarrish commented Nov 2, 2020

Not normally dealing with i18n issues, I'm a bit confused about what is going on here.

The direction in the bdo is actually ltr not rtl, so if I manually put it back into the rendered document the Hebrew text is reversed:

bdo with ltr

Oddly, this no longer matches the stated incorrect rendering below where the dir attribute is on a p tag:

p with ltr

So if the issue is the bdo tag being stripped, which rendering is correct? The only way I've made them match is to put a bdo tag inside the incorrect example with the direction. (Do browsers not handle dir on p?)

If the issue is that the text in the code example is not rtl, that should be by design since the bdo tag wasn't setting that anyway.

But if the issue is that the code example should show the text as rtl, then I can always manually re-add the bdo tag before we publish with a correct direction.

Anyone more knowledgeable about these things want to weigh in on what is intended here?

@himorin himorin added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Nov 2, 2020
@iherman
Copy link
Member

iherman commented Nov 2, 2020

To add my non-expert view...

In memory the character string is stored as: 'H', 'T', 'M', 'L', 'ה', 'י', 'א', etc. However, if I take a screen editor for JSON, and I copy the string, it will display it as: "HTML היא שפת סימון", just like in #244 (comment) and like in the TR document itself:

Screen Shot 2020-11-02 at 12 22 15

(this is a screen dump from Firefox on the /TR document). The JSON editor tries to do the correct thing by using the Unicode RTL algorithm on the Hebrew portion but, because there is no "ltr" in JSON, that is actually an incorrect rendering of the string, which must be (as we all know by now…):

HTML היא שפת סימון.

In other words, what the first image in #244 (comment) shows is the way the string are laid out in memory and in the source of JSON, but not the way anybody would see the very same JSON in, say, a screen editor. Which also means that there is no real "good" or "bad" here, because to show the first version (ie, the dump of the memory layout) only in the spec would actually be misleading: nobody would ever see the content this way in a JSON editor.

If I am right (which is a big 'if') I would suggest leaving the document as is, and the explanation note should include a few words just about that, i.e., that what is shown in the example is not what is stored in Memory, but how JSON would be shown by any user facing tool. And we can all rest.


(If you look at the markdown source of this comment you will see that I had to use a bunch of html span elements to get things right…)

@mattgarrish
Copy link
Member

I guess what confuses me is why the bdo tag is needed at all in the source. As I understand it now, the difference between dir on p or bdo is that with bdo it changes the text character direction. But if that's not what we want to begin with, and is stripped by respec, can't we just remove the bdo tag from the source and move on from this issue? Do we even need a note?

@iherman
Copy link
Member

iherman commented Nov 2, 2020

@mattgarrish I think "that's not what we want to begin with" is the unclear part. Do we want, in the example:

  • show what the JSON file contains (i.e., 'H', 'T', 'M', 'L', 'ה', 'י', 'א', etc) or
  • show what the user sees in a JSON file when it is displayed on the screen (i.e., ""HTML היא שפת סימון")

If my analysis above is correct, the two are not the same. (And none are the desired display of the text.) (I do not know and do not really care as a reader about what combination of bdo and dir is to be used to achieve that...)

My proposal is to pick one of the two above and make it clear to the reader that these two exist, and we picked one.

All this is dependent on whether I was right or whether I just made a fool of myself...

@mattgarrish
Copy link
Member

show what the user sees in a JSON file when it is displayed on the screen

This is the thing. It seems like we're going beyond what is relevant to publication manifests to explain character encoding in files vis-a-vis display.

In any case, the example isn't "broken", so let's not get hung up on this.

@iherman
Copy link
Member

iherman commented Nov 3, 2020

In any case, the example isn't "broken", so let's not get hung up on this.

I agree.

I would actually propose not to change the spec now, and, when published, turn this issue into an official erratum, asking for a possible clarification note. Changing the spec at this stage (ie, between PR and REC) is touchy. The maintenance WG can then decide later what to do about it...

@matial
Copy link

matial commented Nov 4, 2020

Note: I have some experience with Bidi.
I have another problem with the example.
Background: as far as I understand, "bdo"stands for "bidi override", meaning that within the span of the bdo the natural directionality of the characters is ignored in favor of the specified dir, "rtl"in this case.
This means that if the word "HTML" is stored in memory as: 'H', 'T', 'M', 'L', it will be displayed as "LMTH", which is probably not the author's intent.
Thus this example confuses me and I ask the following questions:

  1. Why use <bdo> and not a simple <span dir="rtl"> or (better) <bdi dir="rtl"> ?
  2. Does the example mean to show what is stored in memory? In that case, the Hebrew part should be displayed as ה‎י‎א‎ ש‎פ‎ת‎ ס‎י‎מ‎ו‎ן, understanding that the rendering for the reader of the processed text will be היא שפת סימון, through the magic of the Unicode bidirectional algorithm.

@mattgarrish
Copy link
Member

"bdo"stands for "bidi override", meaning that within the span of the bdo the natural directionality of the characters is ignored in favor of the specified dir, "rtl"in this case.

Part of the confusion here is the issue misquotes the source. The actual source is:

<pre>{
    "value"     : "<bdo dir="ltr">HTML היא שפת סימון.</bdo>",
    "language"  : "he",
    "direction" : "rtl"
}</pre>

@r12a
Copy link

r12a commented Nov 4, 2020

@iherman @mattgarrish welcome to the topsy-turvy world of bidi code examples, where it's difficult to straighten things out meaningfully.

First, i don't see any bdo element in the source text, so i'm not sure what that part of the discussion is about.

What people expect to see when the value of example 2 is presented to them for reading in a page or application is indeed the 2nd string in the note just below examples 1 & 2, ie. the overall reading direction is RTL, but the characters are read in the direction of the arrows below:

Screenshot 2020-11-04 at 12 59 56

Your value in example 2 is currently showing what the text would look like if it were viewed in a LTR context (which it is), and this breaks the sense of the text. You'd only expect to see this if you looked at the source code, and your editor was set up for LTR. If you change nothing, the example would look ok in a Hebrew translation of this page (it would be to the left of the colon). But in your spec, it's not very informative, visually as is. It doesn't show the order of the characters in memory, it just shows some messed up ordering of directional runs. Note that this quickly becomes even more broken-looking if you have more than 2 directional runs in the example.

This exposes a typical quandry for examples of code containing bidi text: should you display it as the end user would expect to see it, or broken as someone with a LTR editor would see it? I usually do neither. I display the characters in the order they are stored in memory, from left to right, and tell the reader that i'm doing that. To do this, you need to, behind the scenes of your spec example, add a bdo element with dir set to ltr. That produces something that no-one will normally ever see, but at least it's less confusing, imo. It would look like this:

Screenshot 2020-11-04 at 13 14 16

The choice is yours, but hopefully this clarifies the situation a little.

@xfq
Copy link
Member

xfq commented Nov 4, 2020

First, i don't see any bdo element in the source text, so i'm not sure what that part of the discussion is about.

@r12a See

pub-manifest/index.html

Lines 434 to 440 in b5c50f7

<aside class="example" title="Set the language and the base direction of a string">
<pre>{
"value" : "<bdo dir="ltr">HTML היא שפת סימון.</bdo>",
"language" : "he",
"direction" : "rtl"
}</pre>
</aside>
and https://w3c.github.io/pub-manifest/#example-2-set-the-language-and-the-base-direction-of-a-string

@mattgarrish
Copy link
Member

mattgarrish commented Nov 4, 2020

To do this, you need to, behind the scenes of your spec example, add a bdo element with dir set to ltr.

Right, this is what we have (and maybe was contributed by you?). The problem is respec strips it while pretty printing the pre text.

We should probably open an issue against respec as this seems likely to reoccur. We only have a REC left to publish, but I wouldn't want to be manually fixing bdo tags after every export.

But, @iherman, should we try and fix the source and add a note now, or should we leave this for an erratum to explain?

@chaals
Copy link
Author

chaals commented Nov 4, 2020

As a reader my preference would be that in the Rec you manually fix up the generated source, showing each of the variants noted for the code with a note about what it is: common rendering, memory order, "expected" correct rendering - and maybe adding a link to a more thorough explanation of the whole issue of when bidi override is necessary...

@iherman
Copy link
Member

iherman commented Nov 4, 2020

@mattgarrish

I share your unease about changing the final document (ie, after respec processing) but if we also submit an issue on respec, that can be updated the next time we want to publish.

As Richard said in #244 (comment) if we display, in the example, a text that reflects the memory content (his second image), then we also create a source of confusion because that cannot be reproduced in one's own JSON text editor. Which is another source of confusion: we actually show a JSON portion that the user never sees and cannot really produce!

We cannot really get it right :-(. To be honest, at this moment, I am undecided.

If we decide to do the change, I think that it would be necessary to add a minor note in the example about the ins and outs of all this, i.e., to explain what you really see in the example.

But, @iherman, should we try and fix the source and add a note now, or should we leave this for an erratum to explain?

Whether we can do it: I have asked the "Director" whether this type of change can be done at this stage in the first place.

@iherman
Copy link
Member

iherman commented Nov 4, 2020

But, @iherman, should we try and fix the source and add a note now, or should we leave this for an erratum to explain?

Coming from the "Director": yes, it is o.k. for us to do this change, if we feel comfortable, and we can leave it as an Erratum.

@mattgarrish, do you think you can come up with a PR that includes the changes together with an explanation note? (The problem is of course that the file has to be edited post-respec :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants