Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text language / direction attributes #6

Closed
mgiuca opened this issue Jul 20, 2016 · 30 comments
Closed

Add text language / direction attributes #6

mgiuca opened this issue Jul 20, 2016 · 30 comments
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.

Comments

@mgiuca
Copy link
Collaborator

mgiuca commented Jul 20, 2016

@marcoscaceres indicated that we would need text language and direction attributes for shared text.

This seems like overkill to me, as we're essentially just sharing plain text. I'm not sure why language is required, and directionality should be inferrable from the text (in the standard way, by looking at the first strong character, and if the client really wants to override it, they can add a U+200F RLM). Do you have an example of an existing W3C standard that has out-of-band language and direction attributes for plain text?

I found the Notifications spec which has an explicit direction attribute. It seems there is a single direction that controls the entire notification UI ("auto", "ltr" or "rtl"), affecting all text fields (title, text, etc) as well as the layout of the buttons. It seems like that's useful for giving the site control over the precise layout of the notification (since the notification forms part of the site's UI). That doesn't make a lot of sense for Web Share because the sending site is not presenting any form of UI; rather it is sending text and other data to a different app. It's not clear what an explicit "ltr" or "rtl" should mean in the context of the receiving app. (For example, what is Twitter supposed to do if it receives an all-English text string with dir="rtl"?)

@marcoscaceres
Copy link
Member

marcoscaceres commented Jul 20, 2016

It's not clear what an explicit "ltr" or "rtl" should mean in the context of the receiving app. (For example, what is Twitter supposed to do if it receives an all-English text string with dir="rtl"?)

It would set those when it tries to display the data within a HTML tag.

<p dir="rtl" lang=en>Oh, this is actually LTR</p>

But that's developer error, not an error in the API.

@marcoscaceres
Copy link
Member

CC'ing @r12a from the i18n WG to put this in their radar.

@mgiuca
Copy link
Collaborator Author

mgiuca commented Jul 20, 2016

It would set those when it tries to display the data within a HTML tag.

But it's going to put the data into its main text field, as a user-editable string (that just happens to be received from another app). I don't think it's reasonable to allow the sending app to change the paragraph direction on Twitter's Compose field. (However, if the sending app wants to embed some directional formatting characters, those would go into the text field, but the user could edit them out if they wanted to.)

@marcoscaceres
Copy link
Member

But it's going to put the data into its main text field, as a user-editable string (that just happens to be received from another app).

It's not 100% certain that it's going to end up in a user editable string. Where it ends up is up to the registered provider - which may or may not be a web app. The same about how it is used.

I don't think it's reasonable to allow the sending app to change the paragraph direction on Twitter's Compose field.

Sure, but Twitter could choose to ignore the information. It's there for guidance to the receiving app.

(However, if the sending app wants to embed some directional formatting characters, those would go into the text field, but the user could edit them out if they wanted to.)

Agree.

@delapuente
Copy link

delapuente commented Jul 26, 2016

IMHO, @marcoscaceres is right. If you are sharing text, you should be giving the developer the chance of sharing metadata about that text

and directionality should be inferrable from the text (in the standard way, by looking at the first strong character, and if the client really wants to override it, they can add a U+200F RLM).

Then consuming applications must know about your specific format encoding format and plain texts becomes no longer plain.

Another point is accessibility. If sharing content in another language, it would be very convenient to know which language are you using so an screen reader could change the accent accurately.

The API does not require to change so much. It suffices for title and text to accept an alternative object value with { content: string , rtl: bool, lang: string }.

@mgiuca
Copy link
Collaborator Author

mgiuca commented Jul 27, 2016

and directionality should be inferrable from the text (in the standard way, by looking at the first strong character, and if the client really wants to override it, they can add a U+200F RLM).

Then consuming applications must know about your specific format encoding format and plain texts becomes no longer plain.

What do you mean by "plain text"? I consider "plain text" to be a raw string of Unicode characters, without any markup like HTML or styling details. The directionality overrides I'm talking about (U+200F RLM, for example) are part of the Unicode standard and can be considered plain text. They can be used in any text field, and any Unicode compliant text layout engine will render them correctly. This isn't "my specific encoding format", it's just plain Unicode text.

If by "encoding format" you mean Unicode encoding (like Latin-1, UTF-8, UTF-16, etc), well that is already a concern for any text transmission, because we will certainly be dealing with Unicode strings. That's all taken care by the underlying layer (either UTF-16 for JavaScript strings, or UTF-8 for URL parameters) so we don't have to worry about that detail here.

Also, in the vast majority of cases, you don't need these, because the Unicode bidirectional algorithm infers directionality from the text. To be clear, I don't expect these overrides to be used much here. But they are available for senders that need them.

Another point is accessibility. If sharing content in another language, it would be very convenient to know which language are you using so an screen reader could change the accent accurately.

That's valid but I still don't see much utility in specifying the language. It seems like something that would be so rarely used (on both the sending and receiving side), since most senders will not set the language and most receivers will ignore it. In the rare case where a sender does set it and the receiver does use it, you just have potential for bugs due to incorrectly set language (imagine an English site sets lang: "en" on all content, but they share user-supplied content that may not be in English; it mostly works because most sites ignore lang, but breaks occasional receivers that don't ignore it).

I'd prefer to have just lang and no direction, as I see some utility in having a lang attribute.

The API does not require to change so much. It suffices for title and text to accept an alternative object value with { content: string , rtl: bool, lang: string }.

That's actually quite a lot of complexity given how simple my proposal is. I'd prefer to follow the Notifications model which has a single language and direction attribute for the entire notification, not a separate one for each piece of text.

@delapuente
Copy link

I consider "plain text" to be a raw string of Unicode characters, without any markup like HTML or styling details. The directionality overrides I'm talking about (U+200F RLM, for example) are part of the Unicode standard and can be considered plain text. They can be used in any text field, and any Unicode compliant text layout engine will render them correctly. This isn't "my specific encoding format", it's just plain Unicode text.

It works for me but this must to be included in the spec to make it clear this is the interpretation you want for the information being shared.

@delapuente
Copy link

delapuente commented Jul 27, 2016

I'd prefer to follow the Notifications model which has a single language and direction attribute for the entire notification.

Better than nothing.

@mgiuca
Copy link
Collaborator Author

mgiuca commented Jul 27, 2016

It works for me but this must to be included in the spec to make it clear this is the interpretation you want for the information being shared.

It shouldn't need to be explicit in the spec because it is implied for all Unicode strings (it is explicit in the Unicode spec). This is a bit like saying we need to add a line saying "If a U+263A is found in the string, it should be interpreted as a smiling face." ... not necessary because the meaning of both U+263A and U+200F are standard. But it would be fine to add a non-normative remark to this effect.

@delapuente
Copy link

I was not referring to that specific character but to specify those strings are Unicode strings although this is derived from these fields to be of type DOMString. Sorry for missing that part.

@mgiuca
Copy link
Collaborator Author

mgiuca commented Jul 28, 2016

Right, well they would definitely be Unicode strings (if we didn't specify that somewhere there would be many more problems than just not being able to specify RTL overrides).

@r12a
Copy link

r12a commented Aug 2, 2016

So you're aware, i haven't been ignoring this thread, but we're currently working through this same issue with some other WGs, and i'd like to get through that before recommending something.

For now, some quick notes:

  1. i think you should have separate issues for language and for direction - there are differences between them
  2. language information is useful for the target environment to take decisions about how to render the text shared (which may include voice browsers). Eg, if you send han ideographs out, it's important to choose Chinese fonts for chinese text, and Japanese fonts for Japanese text, even though they may be the same code points otherwise you'll have unhappy users.
  3. Expecting users to type control codes at the start and end of every paragraph (or add markup) when they write in Arabic, Hebrew, Urdu, Thaana, etc. is problematic (a) because it's too much work (b) because they may not have access to the control characters on the keyboard (c) because it's really difficult to edit those characters in bidi text, even if you can detect where they are (d) because there are various errors that can easily creep in. Best is to be able to set a default direction (which may for example sometimes be obtained from a manual setting in an HTML form or by detecting the direction in scope where the user types) and rely on control codes/markup to just handle the cases that need special treatment.

Anyway, more on that when i've finished developing the guidelines for the other specs.

@r12a
Copy link

r12a commented Aug 2, 2016

Oh, and btw, people considering this topic may find this useful background info: Unicode Bidirectional Algorithm basics [that's a temporary URL for a recent update to https://www.w3.org/International/articles/inline-bidi-markup/uba-basics]

Note in particular the cases where the Unicode Bidi Algorithm is insufficient to produce the necessary rendering.

@mgiuca
Copy link
Collaborator Author

mgiuca commented Aug 3, 2016

Thanks for your comments.

i think you should have separate issues for language and for direction - there are differences between them

Sure, but I think it's helpful to have the conversation about both together since they are related and the same people will be involved. If you think it's helpful to split it to a separate issue, I will.

language information is useful for the target environment to take decisions about how to render the text shared (which may include voice browsers).

Yeah, that's a good point. It just seems like the wrong place to have it, since one string can have multiple languages, but there isn't really a better way to communicate language information.

Expecting users to type control codes at the start and end of every paragraph (or add markup) when they write in Arabic, Hebrew, Urdu, Thaana, etc. is problematic

Who said anything about users? We're talking about an API here. If the requesting site really wants to set a direction, it can insert those control characters. Also, the requesting site is not normally going to be an editable text field; it's more likely to be some fixed content (like the title and URL of the article you are sharing).

My main concern with this whole API is that it be so simple it can be almost a lowest common denominator between all the existing native sharing systems (see Native Integration Survey). I want as little complexity as possible in the data format so that the maximum amount of data and metadata will survive being passed through the browser, to a third-party sharing system, to the receiving app, and potentially, a user-editable textfield in the receiving app.

I don't see any way to set language or direction on shared text in the Android, iOS or Windows sharing APIs linked in the above doc (though I could have missed something). That means even if the requesting site sets this metadata, it'll be lost upon entering the native share system.

Even if the data can survive that process, the receiving app needs to fit it into its data model. As discussed above, it doesn't make a lot of sense in the common case of a receiving app inserting the text string into a text field for sharing to social media, email, or text messaging (it would be incorrect to set the language / direction of the entire textfield, because then it will affect all the text that the user subsequently types into the text field). I'd wager that most of these receivers don't have language/direction in their data models so it would be lost anyway (for example, the Twitter API doesn't have language or direction metadata in a status update; SMS messages certainly don't; Android clipboard doesn't).

In summary, I'm just not seeing any of the services I'm trying to interop with supporting these attributes, so I expect they'll just be dropped on the floor. But, every single one of those above scenarios (even SMS, I think, but not 100% sure) will transparently carry through those directionality marks, because they are just Unicode characters. So that is the most reliable way for a sending app to set the direction of the text being shared. I'm happy to acquiesce on the language attribute, since there is no better way to represent language metadata. But I think it would be a mistake to have a direction attribute, since there is a much better way to encode direction in the text itself.

Note in particular the cases where the Unicode Bidi Algorithm is insufficient to produce the necessary rendering.

Right, but those are the cases where directionality marks should be used to fix the rendering. That article does a good job explaining the issues, but fundamentally the problem is with mixed languages, and that can't be solved in general with a global language/direction setting.

Take the example shown in that article:
"פיצה סגלה - 5 reviews"
[Edit: I realised I made a mistake here; that example actually works fine by setting the global language to LTR.]

A better example is a multi-paragraph text, with the bulk of the text in English, but one paragraph in Hebrew, but that happens to have some English text at the start of that paragraph. That needs detailed markup, not just a global direction setting.

@r12a
Copy link

r12a commented Sep 9, 2016

Ok, so i've been trying to summarise the alternatives, with pros and cons, at http://w3c.github.io/i18n-discuss/notes/json-bidi.html Please take a look.

All approaches have problems. Note in particular that:

  1. the page i'm pointing to is concerned only with establishing the paragraph level base direction, not the inline direction
  2. whichever approach is used, it's typically necessary to explore around the string in the originating location to determine the original paragraph base direction (eg. is the direction inherited from the html tag, or from user keystrokes in a form)
  3. with all approaches it may be necessary to construct a context around the string that sets the right paragraph base direction when the string is inserted into a target web page or plain text location.
  4. things are more complicated when dealing with markup, and only a bit easier if you can assume that the markup is HTML.

@mgiuca
Copy link
Collaborator Author

mgiuca commented Sep 12, 2016

Hi r12a,

Thanks for putting that doc together. I wanted to comment on it inline but it's just a static page, so I pasted it into this Google Doc and added comments:
Notes on JSON strings and text direction

Overall, the main point of that post appears to be that you need to have an out-of-line directionality specifier with your strings, because there is no good or standard way to tell from the strings what the paragraph direction should be. But I did not find the arguments there compelling. There is a standard way to tell what the paragraph direction should be (the Unicode Bidirectional Algorithm). Most of the examples in that article were either a) ones that could already be properly detected by the UBA, or b) ones that require more fine-grained inline markup (a paragraph direction of either LTR and RTL is insufficient).

I think where we run into an actual problem is that HTML itself has a default direction of LTR, and so it is likely that when users on an English website enter mostly-Hebrew text into a text field, it is rendered with a base direction of LTR. Perhaps you want a way that the site can send that data as JSON with an annotation: "this text was rendered as LTR when the user typed it, so be sure to present it as LTR on the other end." I have two problems with this: a) as I said in previous comments on this issue, many receiver services will just ignore that field anyway, and b) it's probably not what the user wants anyway. If the sending site messes up mostly-Hebrew text display in the text field (because it's an English site), you are essentially advocating that the receiving site also mess it up. I would prefer that the receiving site render text according to the UBA, and not according to the default language of the sending site. I think the correct behaviour is for both the sending and receiving site to set dir="auto" on any text fields and paragraphs that contain user-supplied text.

The other problem identified is that some services don't use the standard paragraph detection algorithm, and/or ignore or otherwise butcher the existing inline markup (e.g., Twitter). But I don't think adding more standards is a way to get these services to behave. For example, Twitter does not have a paragraph direction field in its API, and if we added a "dir" attribute to Web Share, Twitter would just drop it on the floor anyway.

@r12a
Copy link

r12a commented Sep 12, 2016

Matt, thanks for your comments. I will rework the document, because i think there are a number of places where i'm not getting the message across clearly. I will also take your above comments into consideration – some of your points may require me to make more significant changes to some bits of the doc, so thanks for those.

Note, though, that the doc is really only about handling those situations where the UBA alone will not produce the right result, and these are not just rare corner cases. The question is how to capture any information that a user of RTL scripts had at their disposal to overcome the limitations of the UBA when they originated the text and associate that information in some way with the string. These problems don't just occur in LTR pages containing RTL text, they also cause headaches for users working on pages that have the direction set to RTL (my ears are still ringing with the complaints from Persians, Moroccans, Israelis, Egyptians, Omanis, etc who i have spoken with recently).

@r12a
Copy link

r12a commented Oct 20, 2017

Dear repo administrator, i'd like to set up notifications for the i18n WG so that we can track this and any other issues. Could you add two labels for me: i18n-comment and i18n-tracking, and assign the latter to this issue? I can take care of the rest of the mechanism that will notify the i18n folks of changes. Thanks.

@marcoscaceres marcoscaceres added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Oct 25, 2017
@marcoscaceres
Copy link
Member

Could you add two labels for me: i18n-comment and i18n-tracking,

Done.

and assign the latter to this issue?

Done.

Thanks @r12a! Looking forward to seeing you at TPAC!

@js-choi
Copy link

js-choi commented Nov 11, 2017

Relevant to this issue is the work of the W3C Web Annotation Working Group, which attempted to standardize annotations of web resources and their parts, and which published three approved recommendations earlier this year. Its members encountered a similar if not identical problem in the course of creating its Web Annotation Data Model.

The problem was about annotation contents’ writing direction and language: would embedded bidirectional markers in text be sufficient, or should annotations also support specifying base writing direction and language in out-of-band metadata? The Web Annotation Working Group and the Internationalization Working Group discussed this together over the course of months. @azaroth42, @iherman, and @BigBlueHat were particularly involved in these debates, in addition to @r12a above.

The two Working Groups eventually resolved that it is not possible to determine intended writing order from language or text content alone. They decided to make the Annotation standards support putting text direction and language in out-of-band metadata, using its data model’s textDirection and processingLanguage JSON properties, for both embedded text content and external text content.

You all may find the records of these prior debates useful, since there is a clear parallel between Web Annotation’s situation and Web Share’s situation. Hope they help.

@aphillips
Copy link

Following up on an action item, the I18N WG is about to publish (i.e. this coming week) a FPWD of our document String-Meta, which is about this issue.

@aphillips
Copy link

Following up on a different action item, and noting that JSON-LD has done some work to provide for direction metadata, I've been tasked with mentioning String-Meta again. The document has not progressed significantly since my last comment, but we intend to progress it towards Note status this year.

In reviewing the spec, which as updated recently, I don't see any progress in providing language or direction metadata. What's the status of addressing this issue? Do you need suggestions from I18N?

@mgiuca
Copy link
Collaborator Author

mgiuca commented May 8, 2020

I would be happy to receive specific advice from I18N experts on this.

My position, which is as the original designer of web share but not an I18N expert, remains consistent with my comment above from September 2016 (wow this has been awhile coming hasn't it?): putting direction metadata in a share is unlikely to help, because unlike most strings which get displayed by the user agent (which can therefore be compelled by the spec to display the string correctly), web-shared strings are transmitted unmodified to an arbitrary recipient, usually in a format we don't control (such as an Android system intent). The metadata would likely be discarded, or if not, it would be up to each individual recipient application to respect it in some way.

I don't think it's helpful, as it would likely be ignored, and if applications want to ensure that the shared string has a particular top-level direction, they should enclose the string in RLE / PDF marks or similar.

@marcoscaceres
Copy link
Member

What @mgiuca describes also matches our implementation experience in Firefox... we just pass the strings to the OS (Windows and Android in our case also).

CC @saschanaz for the Windows implementation, as they may be interested to follow along.

@aphillips
Copy link

@mgiuca @marcoscaceres This has been a long time in coming, but it has also been a long-standing gap in I18N on the Web. While it's almost certainly true that direct (and language) metadata will be ignored by existing implementations, that's because it doesn't exist: you can't consume what isn't there. It's true that consumers would need to decide when (hopefully!) or whether (less hopefully 😛) to consume the metadata. I realize that different operating environments may provide different levels of API access to the user agent for realizing the benefits of language/direction metadata. But we've been pushing this because of the display and experience gaps for users (both applications and end-users/customers).

if applications want to ensure that the shared string has a particular top-level direction, they should enclose the string in RLE / PDF marks or similar

This would put applications in the business of evaluating strings and decorating them with additional control characters. It means modifying the content, which most APIs go out of their way to avoid. This is the sort of thing that frameworks and specifications should build in, rather than every application assembling their own.

See our document String-Meta for a general description of the problems and potential solutions. (sections 4.3/4.4 discuss control characters)

@mgiuca
Copy link
Collaborator Author

mgiuca commented May 14, 2020

While it's almost certainly true that direct (and language) metadata will be ignored by existing implementations, that's because it doesn't exist: you can't consume what isn't there.

The problem with this spec in particular is that the receiver is not necessarily a web application. We have no control over the system that this text is put into. It's not correct to say "apps aren't honoring this field because it doesn't exist; once we add it, the apps can handle it correctly", because the target app may never be able to see this field.

For example, on Android, browsers will typically implement Web Share by converting the share data into an Intent object. That object has no concept of a "text direction" at a meta level. So the browser would simply have to either a) throw it away, or b) apply it by wrapping the text in directional marks, which is specifically something you say we should be avoiding. There's no way for the receiving app, an Android app (or a Web Share Target that's receiving a share round-tripped through the Android intent system) to get that direction field and honor it.

Given that we have no way of reliably preserving this metadata on the way to the share target, I would argue that the safest thing to do is not accept it in the first place.

@saschanaz
Copy link
Member

I think this needs OS implementer interests, as Windows doesn't have this feature either.

@marcoscaceres
Copy link
Member

marcoscaceres commented Oct 25, 2021

This is unfortunately the same issue that was encountered by Payment Request.

Summarizing: the OS-level share targets don't provide a means to do anything meaningful with dir/lang, so the dir/lang get dropped on the floor.

@marcoscaceres
Copy link
Member

Closing, as there is nothing actionable here.

@marcoscaceres
Copy link
Member

@aphillips (or @xfq) given the lack of OS support, we did not add this feature.

We are open to revising this in the future, however, if operating systems actually start adding support (same issue we had with Payment Request API).

If you are ok with this, could you please remove the i18n-tracker label?

@w3cbot w3cbot added i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. and removed i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. labels Sep 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on.
Projects
None yet
Development

No branches or pull requests

8 participants