Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] bidi control characters when formatting dates #28

Closed
caridy opened this issue Sep 12, 2015 · 23 comments
Closed

[Discussion] bidi control characters when formatting dates #28

caridy opened this issue Sep 12, 2015 · 23 comments
Labels

Comments

@caridy
Copy link
Contributor

caridy commented Sep 12, 2015

Notes:

  • Edge currently includes "a bunch" of bidi control characters when formatting dates.
  • CLDR has the bidi data for structured text (STT is still in the proposal status, need more exploration here).
  • CLDR has direction marks in the date patterns for locales that need it.
  • Spec says nothing about bidi (probably assumes the previous bullet).
  • Other browsers are not using bidi structured text for dates explicitly.

Problems:

  • Formatting dates as regular text does not preserve the structure or direction and as a result the text on the screen becomes incomprehensible.
  • Users in different geographies are accustomed to different rules for structured text display. (e.g.: Arabic vs. Hebrew).
  • Isomorphic/Universal apps that will format dates on the server using node/v8 will produce different results than Chakra/Edge. (e.g.: React checksum will fail, which implies a full re-rendering of the initial payload).
  • Chakra/Edge doing something different causes interop problems when users try to parse the localized output.

Proposals:

  1. Spec the details about STT and direction marks and the use of bidi for dates, and align all implementations with Chakra/Edge.
  2. Spec the optional use of STT and direction marks for dates, get Chakra/Edge to align, get others to add the new option.
  3. Make localized output opaque by ignoring STT rules on dates, and get Chakra/Edge to drop the feature (add a strong statement about how the localized output should be considered opaque).
  4. Spec the use of direction marks for dates, get Chakra/Edge to align (others are already using CLDR which includes the direction marks when needed).

Links:

@caridy
Copy link
Contributor Author

caridy commented Sep 12, 2015

/cc @bterlson

@srl295
Copy link
Member

srl295 commented Sep 12, 2015

I don't think CLDR currently uses LRMs/RLMs in dates. Edit: CLDR definitely uses RLM in dates.

STT is still in the proposal status.
Forwarding to CLDR TC..

@srl295
Copy link
Member

srl295 commented Sep 12, 2015

CLDR TR35 5.3.2 says:

The content of certain elements, such as date or number formats, may consist of several sub-elements with an inherent order (for example, the year, month, and day for dates). In some cases, the order of these sub-elements may be changed depending on the bidirectional context in which the element is embedded.

For example, short date formats in languages such as Arabic may contain neutral or weak characters at the beginning or end of the element content. In such a case, the overall order of the sub-elements may change depending on the surrounding text.

Element content whose display may be affected in this way should include an explicit direction mark, such as U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK, at the beginning or end of the element content, or both.

@srl295
Copy link
Member

srl295 commented Sep 12, 2015

other browsers..

new Date(0).toLocaleString("ar",{month:"numeric",day:"numeric",year:"numeric"}).indexOf('\u200f')
returns !== -1 (RLM present) for:

  • Chrome 45.0.2454.85
  • Firefox developer 42.0a2
  • node 3.3 (with full ICU data)

@caridy
Copy link
Contributor Author

caridy commented Sep 12, 2015

@srl295 yes, that's part of the date patterns from CLDR, I can confirm that the we are getting the same results when using Intl.js polyfill, here is the data with a bunch of \u200f in the patterns: https://raw.githubusercontent.com/andyearnshaw/Intl.js/master/locale-data/json/ar-001.json

The question is:

  • is this enough?

I wonder what is Chakra/Edge doing differently since it doesn't use CLDR. @bterlson can you clarify?

@srl295
Copy link
Member

srl295 commented Sep 12, 2015 via email

@bterlson
Copy link
Member

We don't use CLDR for Intl in Edge, fwiw.

Here is one difference, though I'm not sure how to characterize it as I'm no expert :)

new Date(0).toLocaleString("ar",{month:"numeric",day:"numeric",year:"numeric"})
Edge: '\u200F\u0662\u0662\u200F\x2F\u200F\u0661\u0660\u200F\x2F\u200F\u0661\u0663\u0668\u0669'
Chrome: '\u0663\u0661\u200F\x2F\u0661\u0662\u200F\x2F\u0661\u0669\u0666\u0669'

So you're right that Chrome does add bidi control characters, which I didn't notice before, but they are in different locations. Also Chrome does not include them when formatting en dates as Edge does. Example:

new Date(0).toLocaleString("en",{month:"numeric",day:"numeric",year:"numeric"})
Edge: '\u200E12\u200E/\u200E31\u200E/\u200E1969'
Chrome: '12/31/1969'

@srl295
Copy link
Member

srl295 commented Sep 14, 2015

I don't think the codes are "added ", they are just part of the CLDR data.

Do you happen to be in touch with the Microsoft cldr people?

Thanks for putting the code points here, I will take a look a little bit
later.

@bterlson
Copy link
Member

Yeah I guess "added" wasn't the verb I wanted. "Included" more like. I believe you that they're part of the CLDR data.

I am in touch with the CLDR folks here. I can ask any questions we might have if they don't chime in themselves. Let me know!

@caridy
Copy link
Contributor Author

caridy commented Sep 14, 2015

Ok @bterlson let's try to gather all the info for next week, so we can discuss it in person, and try to get to a resolution. I will update the description of this issue now that we have more information.

@srl295
Copy link
Member

srl295 commented Sep 22, 2015

Edge: '‏٢٢‏/‏١٠‏/‏١٣٨٩'
Chrome: '٣١‏/١٢‏/١٩٦٩'

So they are different dates in your example. As to formatting codes, it may be excessive but not harmful.

I'm not sure why this is an ecma402 discussion actually. I'd rather leave LRM/RLM out of the ecma402 discussion. If it's just a matter of content consistency, as I mentioned that's the whole point of CLDR, it seems akin to discussing whether "modifier letter turned comma" or "apostrophe" or curly quote should be used in certain languages.

@zbraniecki
Copy link
Member

I agree that it feels slightly out of scope for ecma402.

In our code, we wrap all variables in strings in FSI/PDI, but that's more of a mixed-content problem.

@bterlson
Copy link
Member

As to formatting codes, it may be excessive but not harmful.

In theory, but in practice I have gotten numerous bug reports on Edge's behavior as people expect to be able to parse some localized date in Chrome and have that same code work in Edge. This isn't too much of a stretch for people to make because Intl let's the specify exactly what components they want in the date. Why wouldn't it be safe to parse?

I'm not saying this has to be fixed/unified. If it isn't then there should be a statement in the spec I can point to that explicitly says that not treating formatted dates as opaque is a very bad idea and not guaranteed to work.

@shervinafshar
Copy link

Also Chrome does not include them when formatting en dates as Edge does.

Because these bidi marks are not needed for a date string requested for en. Needed RLMs are there when a date string for any RTL locale is requested. Since, according to UAX#9, a date string which is requested for en is not even considered bidirectional text, what is the rationale behind what Edge does here? Are these marks in the locale-specific data Edge consumes or they are just added on the fly?

@srl295
Copy link
Member

srl295 commented Sep 23, 2015 via email

@caridy
Copy link
Contributor Author

caridy commented Sep 23, 2015

@bterlson has a theory that we should validate, here is what we discussed: what happen when you have a system preferences in ar, with a page in ar and you render a date in en? should the annotations be in place? what does FF and Chrome do today?

@shervinafshar
Copy link

Both Chrome and FF are implementing UBA. You can check the behavior in the bidi demo tool. Note that the date string remains as one single run of L2.

One might argue that European numbers are directionally weak and might end up being resolved according to directional context (W2 to AN) and therefore some bidi marks are required for their correct display. Trying that, it's observed that the date string still remains as a single run of L2 and displays just fine without any need for additional bidi marks.

I would be more than happy to discuss any edge cases folks have encountered before, but even if for some edge case, the aforementioned theory is validated, adding invisible control marks to strings which are not requested for a bidi language is not a solution as it introduces control characters where they are not supposed to appear. Libraries with more peculiar requirements to tailor the directional behaviour of strings in diverse directional contexts can implement means to pass the the context if need be and appropriate the generated strings accordingly, but I strongly agree with others who voiced their concern about whether this topic actually falls within the scope of the spec.

@tomerm
Copy link

tomerm commented Oct 22, 2015

Joining the party late. Let us be clear on the reason why UCC are injected into date / time patterns (i.e. 05 August 1934) in CLDR. They are injected to assure certain display (i.e. we don't want to see 05 1934 August). Which means the assumption is that rendering engine using those date / time patterns is UBA (Unicode Bidi Algorithm) compliant and fully supports UCC (Unicode control characters such as LRE, RLE, PDF, LRM etc...). This is unfortunately not true in all cases.
Thus I believe UCC should be injected not by data provider (CLDR is a data provider having very little to do with display of data it provides), but rather by code responsible for rendering the data (in many cases various Java or JS based toolkits).

The proposal to CLDR mentioned at the top of this thread was not meant to resolve display problem. It was not about injection of UCC during formatting. It was about defining the rules for display of text with inherent structure (date /time stamp is just one of many cases). For example we want breadcrumbs to flow from left to right for English / French / Russian ... UI while (1 >> 2 >> 3) for Arabic / Hebrew / Urdu ... UI we want them to flow from right to left (3 <<< 2 <<< 1). Because before we approach the solution of the problem, we would like to have a clear understand about expected display. The expected display for the same pattern may be different for different cultures (think of mathematical formulas).

It is only because standard UBA is not capable of automatic identification of structure and enforcing it for display purposes , such a proposal was created. May be in the distant future (or may be not so distant) Siri, Cortana, Watson and similar technology will be able to cope with it. But at the moment it needs to be done manually. What I hope to achieve is some level of automation by:

  1. Defining types of structured text patterns (which can be specified during authoring time - via markup for example). This can include such types as file path, bread crumbs, date / time stamp etc.
  2. Defining a standard way of handling the display problem for structured text pattern and provide tools / API for concrete implementation.
    More elaborate description / specification of this proposal is available from:
    https://docs.google.com/document/d/1y9LhT7rbGGVHjh2uqTAYHzN5PfbAkPxO5sMJygOPc3I/edit#heading=h.b43z973xff51

@caridy
Copy link
Contributor Author

caridy commented Nov 6, 2015

Update: react-intl is reporting issues with the checksum due to the invisible characters used by IE11, @ericf has more information about this. We should try to reach to an agreement on the next meeting (in two weeks).

@bterlson
Copy link
Member

bterlson commented Nov 6, 2015

@caridy / @ericf, wouldn't they have similar problems with Chrome when the server and client locales are different and one is RTL and one is LTR?

@ericf
Copy link
Member

ericf commented Nov 6, 2015

@bterlson we resolve the user's locale on the server via a combination of HTTP content negotiation and the user's settings. With their resolved locale we render on the React app on the server and client using this resolved locale value.

@bterlson
Copy link
Member

bterlson commented Nov 6, 2015

@ericf alright makes sense, thanks for the clarification.

@sffc
Copy link
Contributor

sffc commented Mar 19, 2019

It looks like this discussion is resolved. Please reopen if necessary.

@sffc sffc closed this as completed Mar 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants