json-bidi, first draft

w3c · Aug 12, 2016 · 8d7bf18 · 8d7bf18
1 parent d903971
commit 8d7bf18
Show file tree

Hide file tree

Showing 6 changed files with 185 additions and 0 deletions.
diff --git a/notes/json-bidi-data/ltr-base-direction.png b/notes/json-bidi-data/ltr-base-direction.png
diff --git a/notes/json-bidi-data/ltr_hash_he.png b/notes/json-bidi-data/ltr_hash_he.png
diff --git a/notes/json-bidi-data/multiple_base_directions.png b/notes/json-bidi-data/multiple_base_directions.png
diff --git a/notes/json-bidi-data/rtl-base-direction.png b/notes/json-bidi-data/rtl-base-direction.png
diff --git a/notes/json-bidi-data/rtl_hash_he.png b/notes/json-bidi-data/rtl_hash_he.png
diff --git a/notes/json-bidi.html b/notes/json-bidi.html
@@ -0,0 +1,185 @@
+<!doctype html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<title>Default styling for multilingual quotes &amp; quotation marks in HTML</title>
+<link rel="stylesheet" href="../local.css"/>
+<style>
+#use_cases li {
+	border: 1px solid #ccc;
+	margin-bottom: 2em;
+	padding: 1em;
+}
+code {
+	color: brown;
+}
+#toc {
+	float: right;
+}
+footer {
+	margin-top: 5em;
+	font-size: 80%;
+}
+.ed {
+	color: red;
+	font-size: 140%;
+}
+</style>
+</head>
+
+<body>
+<h1>Notes on JSON strings and text direction</h1>
+<p style="color:red; font-size: 120%;">This is a draft for to promote discussion and to work through ideas, and may be updated at any time.</p>
+<p>This page gathers observations about handling of text direction and language in JSON data. <b>It doesn't make recommendations, it just aims to draw together useful background information, questions and lines of thought to help determine the best course of action to support text direction and language metadata associated with strings in JSON objects.</b> It was written by r12a, and these are personal thoughts, not endorsed by the i18n WG. </p>
+<p><b>Note: it is difficult to represent bidi examples satisfactorily. For example, "פעילות הבינאום, W3C" doesn't actually show the expected position of 'W3C' when displayed, but also doesn't represent the order of characters in memory, since the first such character is פ. Code examples below show characters from left-to-right in the order in which they are stored in memory, and we'll use Hebrew to avoid the confusing effects of Arabic letters joining backwards.</b> </p>
+<p>When talking about markup in JSON strings we will make the assumption for the purposes of this discussion that the markup is HTML5.</p>
+
+
+
+
+<section id="why">
+  <h2>Why information is needed about the base direction for a string</h2>
+  <p>In order to support correct display of text in right-to-left scripts, when they are eventually displayed to a human, it is necessary to be able to: </p>
+  <ol>
+    <li> establish the overall base direction for a paragraph </li>
+    <li> change the base direction used for a range of inline text where needed </li>
+  </ol>
+  <p><b>These notes are only about the former</b> as it relates to the JSON-based format.  Indications of direction change inline would just be carried over from whatever the original author used. </p>
+  <p>For a simple introduction into how the Unicode bidirectional algorithm works, and where it needs additional help, see <a href="https://www.w3.org/International/articles/inline-bidi-markup/uba-basics">Unicode Bidirectional Algorithm basics</a>. You will need a grasp of these basics to understand what follows. </p>
+  <p>By default, strings are generally handled as if their base direction (directional context) is LTR.  If the base direction for the string needs to be RTL, this information needs to be associated with the string in some way, since it affects the order in which elements in the string will be rendered to a user.  Without this information, users may be unable to understand a message.  For example, the following shows a string presented with a RTL base direction. </p>
+  <p><img alt="Rtl-base-direction.png" src="json-bidi-data/rtl-base-direction.png" height="43" width="666"> </p>
+  <p>Here is the same string presented with a LTR base direction. </p>
+  <p><img alt="Ltr-base-direction.png" src="json-bidi-data/ltr-base-direction.png" height="42" width="670"> </p>
+</section>
+
+
+<section id="ascertain">
+<h2>Ascertaining the base direction of  strings outside of JSON</h2>
+<p>Strings that will be stored in JSON need to be stored in some way that indicates what base direction should be used to display them correctly later. In order to be able to store that information, it's necessary to correctly detect the base direction for the string in its original source.</p>
+
+<section id="firststrong">
+<h3>Determining base direction from the string itself</h3>
+<p>It is often possible to look at the beginning of a string to determine whether it needs to be given a RTL or LTR base direction. That would work for the example above. The first strong character is RTL, so applying a RTL base direction to the string for display will produce the desired result. </p>
+<p>Note, by the way, that such a first-strong detection heuristic must sometimes move past the first character in the text stream. It must pass over punctuation, numbers, and other characters that don't have strong directional properties in Unicode. </p>
+<p>But sometimes finding the first strong character is actually misleading. Take a string typed into an HTML form input field such as the following: </p>
+<p><img alt="Rtl hash he.png" src="json-bidi-data/rtl_hash_he.png" height="63" width="418"> </p>
+<p>The sequence of characters, shown as stored in memory, is as follows. The point to note is that the sequence starts with LTR characters. </p>
+<p><code><bdo dir="ltr">‭#bidi פעילות הבינאום, W3C‬</bdo></code></p>
+<p>If the consumer of this string were to assume that the text needs a LTR base direction, based on the first strong character, the result would be incorrect when displayed to a human.</p>
+<p><img alt="Ltr hash he.png" src="json-bidi-data/ltr_hash_he.png" height="64" width="420"> </p>
+<p>A similar problem arises if the JSON string starts with markup. HTML tag names are always in Latin text, so to identify the first strong character in HTML markup you need to skip the markup, including attributes and their values. Here is an example (again the characters are shown left-to-right as stored in memory).</p>
+<p><code>
+  <bdo dir="ltr">&lt;p class='post'>פעילות הבינאום, W3C‬&lt;/p></bdo></code></p>
+<p>In some cases there are additional rules involved. For example, HTML5 skips the content of certain types of markup (such as <code class="kw" translate="no">bdi</code>) before identifying the first strong character. </p>
+<p>If, however, that markup comes with the base direction already specified, it would be important to try to understand that markup. For example:</p>
+<p><code>
+  <bdo dir="ltr">&lt;p dir='ltr'&gt;‭#bidi פעילות הבינאום, W3C&lt;/p&gt;</bdo>
+</code></p>
+<p>All of the above heuristics approaches, however, assumed that first-strong detection is established as the default way to assess the base direction for a string. There are, however, other algorithms in use. </p>
+<p>For example, although Facebook relies on first-strong detection to build markup with the right base direction around posts, Twitter instead counts the relative number of LTR vs RTL characters in a tweet to determine the base direction.  It can be argued that Twitter's approach produces less predictable output than first-strong, but that's beside the point. The point is that there needs to be a generally agreed way to detect the base direction to promote interoperability. </p>
+</section>
+
+<section id="othercases">
+<h3>Determining base direction from context</h3>
+<p>If this is text created using an HTML form, in order to establish the base direction you need to know the computed direction of the form field into which the text was typed.  That direction may be established by inheritance from a parent element, such as the <code class="kw" translate="no">html</code> tag, or may be set by the user with keyboard shortcuts. </p>
+<p>This information about the base direction of the string needs to be captured and stored in some way by the JSON, so that the string can be displayed correctly by a consumer of the JSON data. </p>
+</section>
+
+
+<section id="multipleparagraphs">
+<h3>Multiple paragraphs</h3>
+<p>In the case where the string input by the user contains multiple paragraphs (ie. multiple lines separated by line breaks in a form input or other plain text, and marked up text containing block level constructs), we only need to know the base direction of the string as a whole. Any differences in base direction introduced between such paragraphs in a single input field, eg. such as in </p>
+<p><img alt="Multiple base directions.png" src="json-bidi-data/multiple_base_directions.png"/></p>
+<p>would need to be introduced by the user anyway, and the mechanism used for that would be part of the captured string.  In other words, we only need to capture and store the base direction set for the input as a whole. </p>
+<p>Note, however, that the base direction is not only specified as RTL or LTR. If the input field is, say, a <code class="kw" translate="no">textarea</code> with direction set to <code class="kw" translate="no">auto</code>, it is expected that the base direction will be determined for each line (paragraph) on the basis of the first strong character in that line. </p>
+</section>
+</section>
+
+
+<section id="storing">
+<h2>Storing the base direction in JSON</h2>
+<p>There needs to be a way of storing information about the base direction for a string the first time it is encoded as JSON that can be recognised and used by consumers of the JSON data.</p>
+
+<section id="property">
+<h3>Using a  direction property</h3>
+<p>If direction information is stored as a property, there needs to be a property for each string. In the following example, a <code class="kw" translate="no">direction</code> property at the same level as <code class="kw" translate="no">name</code> and <code class="kw" translate="no">content</code> would not work, since it can't serve both of those strings.</p>
+<pre>{
+  "@context": {
+    "@value": "http://www.w3.org/ns/activitystreams",
+     },
+  "name": "r12a posted a note",
+  "type": "Note",
+  "content": "פעילות הבינאום, W3C"
+  }</pre>
+<p>What may work better, however, is a string type that allows for direction to be optionally stored with each string. For example,</p>
+<pre>{
+  "@context": {
+    "@value": "http://www.w3.org/ns/activitystreams",
+     },
+  "name": &quot;content&quot; : "r12a posted a note" ,
+  "type": "Note",
+  "content": { &quot;content&quot; : "פעילות הבינאום, W3C", &quot;dir&quot; : &quot;rtl&quot; }
+  }</pre>
+<p>For this to work, all JSON applications would need to recognise the structure of these strings, and be able to extract at least the content.</p>
+<p>If the direction information is omitted, the convention must be that the direction is LTR. This is important where LTR strings are injected into a RTL environment.</p>
+<p>There may be some appeal to this approach, if it were possible to adopt it widely, since the base direction really is metadata about the string, viz. it is often expressed separately from the string in the original source and the final destination of the string when visible to humans. It also avoids cluttering strings with additional characters (see below). It also simplifies the process of determining the base direction of the JSON string, since the inspection procedure is straightforward. The crux, of course, is getting it recognised as a standard approach.</p>
+<p>Needs more thought: [One of the problems of using the first-strong approach may be that if you are storing a string in the JSON format, you must know whether the string is coming from a non-JSON context (in which case you need to examine the context and decide whether or not to add RLM at the start of the string, etc.), or coming from a JSON context where those decisions have already be taken. If a string arrives with direction metadata, it's pretty clear that that initial process has been done.]</p>
+</section>
+<section id="storingfs">
+<h3>Relying on first-strong characters</h3>
+<p>One way to store the base direction for a string is to follow the convention that the first strong directional character in a string indicates the base direction for the whole string. </p>
+
+<section id="plaintext">
+<h4>Plain text</h4>
+<p>In the case of a string such as the following,</p>
+<p><code>
+  <bdo dir="ltr">"summary" : "‭פעילות הבינאום, W3C"</bdo></code></p>
+<p>which starts with a RTL character, this is straightforward as long as the intended base direction is indeed RTL. </p>
+<p>Consumers of the JSON string would still have to scan the string far enough to detect the first strong character.</p>
+<p>In the case where there is no strong character (for example, a telephone number), there would need to be a convention that the default is LTR. (Note that this does not necessarily mean that consumers of the JSON don't need to do anything. If a consumer is inserting the string into a RTL context, it would need to ensure that the LTR base direction was preserved for that string.)</p>
+<p>In the case of this example (where the characters are shown in memory order), </p>
+<p><code>
+  <bdo dir="ltr">"content" : "‭#bidi פעילות הבינאום, W3C‬"</bdo>
+</code></p>
+<p>which is a RTL phrase that started with a LTR strong character, the JSON string could have an invisible strong RTL character, ie. RLM, added to the start of the string to indicate the expected base direction of the string when consumed later.</p>
+<p>An application that constructs a JSON string should only add RLM if there isn't already a strong RTL character at the start. Indiscriminate addition of RLM to RTL strings can cause a build up of redundant RLMs at the start of a string.</p>
+<p>(Note that you cannot expect humans creating the original string to use RLM in this situation, since the string would look perfectly fine to them if the surrounding content was RTL. Furthermore, RLM characters are not commonly available on user keyboards, especially for mobile devices.)</p>
+<p>This way of indicating the intended base direction only works, however, if all consumers of the JSON strings know that they should look for the first strong character to determine the base direction that should be applied.</p>
+</section>
+
+
+<section id="markup">
+<h4>Strings that are enclosed in markup</h4>
+<p>Some strings may start with markup. Here we are looking at a string that begins and ends with HTML markup. For example, </p>
+<p><code>
+  <bdo dir="ltr">"summary" : "&lt;p class=&quot;summary&quot;&gt;‭פעילות הבינאום, W3C&lt;/p&gt;"</bdo>
+</code></p>
+<p>A consumer application that looks for the first strong character in such a string will always encounter a LTR character first, and the JSON string could either: </p>
+<ol>
+  <li> assume that the consumer will skip the markup to detect the first character, or </li>
+  <li> put a directional control character  at the start of the string in order to represent the base direction. </li>
+</ol>
+<p>If the incoming HTML happens to already contain a <code class="kw" translate="no">dir</code> attribute on the surrounding tags, eg. </p>
+<p><code>
+  <bdo dir="ltr">"summary" : "&lt;p dir=&quot;rtl&quot; class=&quot;summary&quot;&gt;#bidi ‭פעילות הבינאום, W3C&lt;/p&gt;"</bdo>
+</code></p>
+<p>then, the question is whether to continue to put a directional character at the start of the string to indicate the intended base direction, or to assume that the consumer of the JSON string will know enough about the markup to recognise that <code>dir=&quot;rtl&quot;</code> already specifies the direction, and use that.</p>
+<p>It may be worth noting that putting a control character such as RLM before a <code class="kw" translate="no">p</code> tag would have no effect on the rendering of a string where its destination is HTML, since RLM doesn't create a base direction in HTML. For that you would need to use markup. (Embedding/isolating control code pairs would not work either, since Unicode controls are only effective within the current paragraph (ie. they are inline constructs), and the <code class="kw" translate="no">p</code> tag immediately initiates a new paragraph.) </p>
+</section>
+</section>
+</section>
+
+
+<section id="consuming">
+<h2>Consuming JSON strings</h2>
+
+<section id="consumingfs">
+<h3>Consuming JSON strings that use first-strong characters to indicate base direction</h3>
+<p>Let's suppose that some JSON strings will eventually be displayed to an end user as part of a Web page. When the text is inserted into a target document, the application that reads the JSON needs to also use first-strong heuristics to detect the direction, and then build markup around the string to ensure that it is presented appropriately to the user. (This is probably easiest to achieve by using <code>dir=auto</code> on a block element around the string, or using <code>span dir=auto</code> or <code>bdi</code> around inline presentations.) </p>
+<p>The point here is that  the model needs to establish a convention that consumers should use the first-strong heuristic approach to represent the base direction, so that the consuming application knows where to look for the information about base direction that it will need while rendering the string in the destination. This is important because there are other heuristic approaches that ignore the first-strong character.  (Twitter, for example, does NOT currently use first-strong to determine the direction of text in a tweet.  It looks instead at the overall content of the tweet. Also, currently both Twitter and Facebook appear to actively strip out directional control characters before creating the HTML that supports their display. See <a rel="nofollow" class="external free" href="https://www.w3.org/International/wiki/Bidi_in_social_media">https://www.w3.org/International/wiki/Bidi_in_social_media</a>.) </p>
+<p>Btw, there are likely to be some special rules applied to the presentation of strings in some contexts. For example, Twitter locates, and treats specially, certain sequences of characters that begin with @ or #, isolating them and giving them an inline direction of LTR if they contain LTR characters.</p>
+<p>Note also that if the consumed JSON string is to be embedded into plain text, it will also be necessary to establish the base direction for the injected text if different from its surrounding context, and isolate it to avoid spillover effects with adjacent text. Note that RLM is not adequate to establish the base direction for a run of text – the consumer will need to use the paired control characters. Ideally that would be RLI/LRI/FSI...PDI, since they isolate, however these are new Unicode characters and not yet widely supported. The alternative is RLE/LRE...PDF, but that is discouraged by the Unicode Standard, because it doesn't eliminate spillover effects.</p>
+</section>
+</section>
+</body>
+</html>