Skip to content

Commit

Permalink
Merge pull request #522 from clavoline/fr-t9n-bom
Browse files Browse the repository at this point in the history
French t9n Language Tags in HTML and XML
  • Loading branch information
r12a committed Oct 13, 2023
2 parents 55cc8ac + 0e05f68 commit 72113b1
Show file tree
Hide file tree
Showing 3 changed files with 867 additions and 21 deletions.
43 changes: 24 additions & 19 deletions articles/language-tags/index.en.html
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ <h2>Overview</h2>
<p class="info">RFCs are what the IETF calls its specifications. Each RFC has a unique number. Unfortunately, it is not possible to tell, when reading RFC 1766 or RFC 3066 that these specifications have been obsoleted and replaced by other specifications.</p>
</div>

<p>Language tag syntax is defined by the <abbr title="Internet Engineering Task Force">IETF</abbr>'s <a class="print" href="http://www.rfc-editor.org/rfc/bcp/bcp47.txt">BCP 47</a>. BCP stands for 'Best Current Practice', and is a persistent name for a series of <abbr title="Request For Comment">RFC</abbr>s whose numbers change as they are updated. The latest RFC describing language tag syntax is <a class="print" href="http://www.rfc-editor.org/rfc/rfc5646.txt"><cite>RFC 5646, Tags for the Identification of Languages</cite></a>, and it obsoletes the older RFCs <a class="print" href="http://www.rfc-editor.org/rfc/rfc4646.txt"> 4646</a>, <a class="print" href="http://www.ietf.org/rfc/rfc3066.txt">3066</a> and <a class="print" href="http://www.nordu.net/ftp/rfc/rfc1766.txt">1766</a>. </p>
<p>Language tag syntax is defined by the <abbr title="Internet Engineering Task Force">IETF</abbr>'s <a class="print" href="http://www.rfc-editor.org/rfc/bcp/bcp47.txt">BCP 47</a>. BCP stands for 'Best Current Practice', and is a persistent name for a series of <abbr title="Request For Comment">RFC</abbr>s whose numbers change as they are updated. The latest RFC describing language tag syntax is <a class="print" href="http://www.rfc-editor.org/rfc/rfc5646.txt"><cite>RFC 5646, Tags for the Identification of Languages</cite></a>, and it obsoletes the older RFCs <a class="print" href="http://www.rfc-editor.org/rfc/rfc4646.txt"> 4646</a>, <a class="print" href="http://www.ietf.org/rfc/rfc3066.txt">3066</a> and <a class="print" href="https://www.ietf.org/rfc/rfc1766.txt">1766</a>. </p>

<p>You used to find subtags by consulting the lists of codes in various ISO standards, but now you can find all subtags in the <cite><a class="print" href="https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry"><abbr title="Internet Assigned Numbers Authority">IANA</abbr> Language Subtag Registry</a></cite>. We will describe the new registry below.</p>

Expand Down Expand Up @@ -233,7 +233,7 @@ <h2>Constructing language tags</h2>
class="puExample">privateuse</span></p>

<p>The entries in the registry follow certain conventions with regard to upper and lower letter-casing. For example, language tags are lower case,
alphabetic region subtags are upper case, and script tags begin with an initial capital. This is only a convention! When you use these subtags you
alphabetic region subtags are upper case, and script subtags begin with an initial capital. This is only a convention! When you use these subtags you
are free to do as you like, unless you are constrained by the rules of the system you are working with. For HTML and XML language markup, the case should not matter.</p>


Expand Down Expand Up @@ -374,7 +374,8 @@ <h3>The extended language subtag</h3>
Description: Gulf Arabic
Added: 2009-07-29
Preferred-Value: afb
Prefix: ar Macrolanguage: ar
Prefix: ar
Macrolanguage: ar
%%
</pre>
</figure>
Expand All @@ -398,8 +399,8 @@ <h3>The script subtag</h3>

<div class="insidenote">
<p class="scriptExample tagExampleHead">Script subtags</p>
<pre class="tagExample" translate="no"><span class="languageExample" title="Language subtag">zh</span>-<span class="scriptExample currentExample" title="Script tag">Hans</span></pre>
<pre class="tagExample" translate="no"><span class="languageExample" title="Language subtag">az</span>-<span class="scriptExample currentExample" title="Script tag">Latn</span></pre>
<pre class="tagExample" translate="no"><span class="languageExample" title="Language subtag">zh</span>-<span class="scriptExample currentExample" title="Script subtag">Hans</span></pre>
<pre class="tagExample" translate="no"><span class="languageExample" title="Language subtag">az</span>-<span class="scriptExample currentExample" title="Script subtag">Latn</span></pre>
<p class="readmore">Read more in the BCP 47 spec: </p>

<p><a class="print" href="https://www.rfc-editor.org/rfc/rfc5646.html#section-2.2.3">2.2.3 Script Subtag</a></p>
Expand All @@ -410,19 +411,19 @@ <h3>The script subtag</h3>
<p>Examples of language tags including script subtags are:</p>
<ul>
<li><code class="kw" translate="no">zh-Hans</code> (Simplified Chinese)</li>
<li><code class="kw" translate="no">az-Latn</code> (Azerbaijani, written in Latin script - since Azerbaijani can also be written using the Arabic script)</li>
<li><code class="kw" translate="no">az-Latn</code> (Azerbaijani, written in Latin script - since Azerbaijani can also be written using the Arabic or Cyrillic script)</li>
</ul>

<p>The script subtag was first introduced in RFC 4646. The subtags come from, and are kept up to date with, the list of ISO 15924 script codes.</p>

<p>Only one script subtag can appear in a language tag, and it must immediately follow the language or any extlang subtag. It is always four letters long.</p>

<p><strong>You should only use script tags if they are necessary to make a distinction you need.</strong> As RFC 4646 co-author, Addison
Phillips, writes, "For virtually any content that does not use a script tag today, it remains the best practice not to use one in the future".</p>
<p><strong>You should only use script subtags if they are necessary to make a distinction you need.</strong> As RFC 4646 co-author, Addison
Phillips, writes, "For virtually any content that does not use a script subtag today, it remains the best practice not to use one in the future".</p>

<p>If you specifically want to indicate that content is not written, there is a subtag for that. For example, you could use <code class="kw" translate="no">en-Zxxx</code> to make it clear that an audio recording in English is not written content.</p>

<p>Actually, many language subtag entries in the registry strongly discourage the use of script tags by including a <code class="kw" translate="no">Suppress script</code> field. There is such a field in the Spanish example above, which indicates that Spanish is normally written using Latin script, and so the <code class="kw" translate="no">Latn</code> subtag should normally not be used with <code class="kw" translate="no">es</code>.</p>
<p>Actually, many language subtag entries in the registry strongly discourage the use of script subtags by including a <code class="kw" translate="no">Suppress-script</code> field. There is such a field in the Spanish example above, which indicates that Spanish is normally written using Latin script, and so the <code class="kw" translate="no">Latn</code> subtag should normally not be used with <code class="kw" translate="no">es</code>.</p>

<p>This example shows the registry entry for Cyrillic script, <code class="kw" translate="no">Cyrl</code>, used for languages such as Russian:</p>

Expand Down Expand Up @@ -455,7 +456,7 @@ <h3>The region subtag</h3>
<p class="regionExample tagExampleHead">Region subtags</p>
<pre class="tagExample" translate="no"><span class="languageExample" title="Language subtag">en</span>-<span class="regionExample currentExample" title="Region subtag">GB</span></pre>
<pre class="tagExample" translate="no"><span class="languageExample" title="Language subtag">es</span>-<span class="regionExample currentExample" title="Region subtag">005</span></pre>
<pre class="tagExample" translate="no"><span class="languageExample" title="Language subtag">zh</span>-<span class="scriptExample" title="Script tag">Hant</span>-<span class="regionExample currentExample" title="Region subtag">HK</span></pre>
<pre class="tagExample" translate="no"><span class="languageExample" title="Language subtag">zh</span>-<span class="scriptExample" title="Script subtag">Hant</span>-<span class="regionExample currentExample" title="Region subtag">HK</span></pre>
<p class="readmore">Read more in the BCP 47 spec: </p>

<p><a class="print" href="https://www.rfc-editor.org/rfc/rfc5646.html#section-2.2.4">2.2.4 Region Subtag</a></p>
Expand All @@ -473,7 +474,7 @@ <h3>The region subtag</h3>

<p>The region subtag in RFC 3066 took its values from the ISO 3166 country codes. These two-letter codes are still available from the new registry, but the registry also lists 3-digit UN M.49 region codes. The advantage of these codes is that they can represent more than just countries. For example, localization groups have for some time wanted to label their carefully crafted translations as Latin-American Spanish, rather than the Spanish of any particular country. With RFC 5646 this is possible; the appropriate language tag is <code class="kw" translate="no">es-419</code>.</p>

<p>Only one region subtag can appear in a language tag, and it must appear after the language subtag and any extlang and script tags. It is a two-letter alpha or 3-digit numeric code. You can have a language code immediately followed by a region code, just as you are used to for language tags such as <code class="kw" translate="no">en-US</code>.</p>
<p>Only one region subtag can appear in a language tag, and it must appear after the language subtag and any extlang and script subtags. It is a two-letter alpha or 3-digit numeric code. You can have a language code immediately followed by a region code, just as you are used to for language tags such as <code class="kw" translate="no">en-US</code>.</p>

<p>Once again, you should only use region subtags if they are necessary to make a distinction you need. Unless you specifically need to highlight that you are talking about Italian <em>as spoken in Italy</em> you should use <code class="kw" translate="no">it</code> for Italian, and not <code class="kw" translate="no">it-IT</code>. The same goes for any other possible combination.</p>

Expand Down Expand Up @@ -544,7 +545,7 @@ <h3>Variant subtags</h3>
</pre>
</figure>

<p>In the registry these subtags are tied to a specific language (and possibly additional subtags between this subtag and the primary language subtag) by the 'Prefix' field. The <code class="kw" translate="no">nedis</code> example shown above should only be used with Slovenian. </p>
<p>In the registry these subtags are tied to a specific language (and possibly additional subtags between this subtag and the primary language subtag) by the <code class='kw' translate='no'>Prefix</code> field. The <code class="kw" translate="no">nedis</code> example shown above should only be used with Slovenian. </p>

<p>If you need to express a particular dialectal or script nuance that is not currently available, you should propose a variant subtag or subtags for inclusion in the
registry using the registration procedure outlined in RFC 5646.</p>
Expand All @@ -562,7 +563,7 @@ <h3>Extension and private-use subtags</h3>

<div class="insidenote" style="width: 44%; margin-left: 2em;">
<p class="extensionExample tagExampleHead">Extension subtags</p>
<pre class="tagExample" translate="no"><span class="languageExample" title="Language subtag">de</span>-<span class="regionExample" title="Region subtag">DE</span>-<span class="extensionExample currentExample" title="Private use subtag">u-co-phonebk</span></pre>
<pre class="tagExample" translate="no"><span class="languageExample" title="Language subtag">de</span>-<span class="regionExample" title="Region subtag">DE</span>-<span class="extensionExample currentExample" title="Extension subtag">u-co-phonebk</span></pre>
<p class="puExample tagExampleHead">Private use subtags</p>
<pre class="tagExample" translate="no"><span class="languageExample" title="Language subtag">en</span>-<span class="regionExample" title="Region subtag">US</span>-<span class="puExample currentExample" title="Private use subtag">x-twain</span></pre>
<p class="readmore">Read more in the BCP 47 spec: </p>
Expand All @@ -578,9 +579,12 @@ <h3>Extension and private-use subtags</h3>

<p>Extension and private use subtags are introduced by a single letter tag, or 'singleton'. An organization can propose a singleton for an extension. Its intended use must be described by an RFC (IETF specification). The singleton will be added to the registry if it successfully passes a review. The singleton <code class="kw" translate="no">x</code> is reserved for private use. Multiple subtags are allowed after the singleton; however, as for all subtags, they must each be 8 or less characters in length.</p>

<div class="insideinfonote">
<p>A locale is an identifier (such as a language tag) for a set of international preferences. Usually this identifier indicates the preferred language of the user and possibly includes other information, such as a geographic region (such as a country). A locale is passed in APIs or set in the operating environment to obtain culturally-affected behavior within a system or process.</p>
</div>
<p>Extension subtags allow for extensions to the language tag. For example, the extension subtag <code class="kw" translate="no">u</code> has been registered by the Unicode Consortium to add information about language or locale behavior. Many locale identifiers require additional "tailorings" or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange.</p>

<p>For example, the following indicates that phonebook collation order should be used by an application, that sorted data in a document is sorted according to this collation, and so on.</p>
<p>For example, in the following tag, the <code class='kw' translate='no'>u-co-phonebk</code> extension indicates that phonebook collation order should be used by an application, that sorted data in a document is sorted according to this collation, and so on.</p>

<ul>
<li><code class="kw" translate="no">de-DE-u-co-phonebk</code></li>
Expand Down Expand Up @@ -616,11 +620,11 @@ <h3>Grandfathered and redundant subtags</h3>
<p><a class="print" href="https://www.rfc-editor.org/rfc/rfc5646.html#section-2.2.8">2.2.8 Grandfathered and Redundant Registrations</a></p>
</div>

<p>Grandfathered tags are special cases, provided for backwards compatibility. They are subtags that were registered before RFC 4646 that cannot be completely composed from the subtags in the current registry, or do not fit the syntax currently defined for language tags. </p>
<p>Grandfathered tags are special cases, provided for backwards compatibility. They are tags that were registered before RFC 4646 that cannot be completely composed from the subtags in the current registry, or do not fit the syntax currently defined for language tags. </p>

<p>Redundant tags are language tags composed of a sequence of subtags and registered before RFC 4646 that can now be formed by combining separate subtags from the current registry. The original registrations remain in the registry mostly 'as a matter of historical curiosity'.</p>

<p>Many grandfathered tags have been superceded by subtags or combinations of subtags in the registry. Such grandfathered tags are now deprecated, and usually contain a <code class="kw" translate="no">Preferred-Value</code> field that indicates how you ought to represent that language instead. For instance, the following example of a grandfathered tag indicates that you should use the <code class="kw" translate="no">jbo</code> language subtag instead of <code class="kw" translate="no">art-lojban</code>.</p>
<p>Many grandfathered tags have been replaced by subtags or combinations of subtags in the registry. Such grandfathered tags are now deprecated, and usually contain a <code class="kw" translate="no">Preferred-Value</code> field that indicates how you ought to represent that language instead. For instance, the following example of a grandfathered tag indicates that you should use the <code class="kw" translate="no">jbo</code> language subtag instead of <code class="kw" translate="no">art-lojban</code>.</p>


<figure class="example">
Expand Down Expand Up @@ -666,9 +670,10 @@ <h2 id="matching" style="margin-bottom: 0;">Matching language tags</h2>
</blockquote>

<p>the word 'SALE' should <em>not</em> be red in the following code.</p>
<blockquote>

<pre lang="fr" translate="no"><code class="language-html">&lt;p&gt;En janvier, toutes les boutiques de Londres affichent des panneaux &lt;span lang="en"&gt;SALE&lt;/span&gt;, mais en fait ces magasins sont bien propres!&lt;/p&gt;</code></pre>
<blockquote>
<pre lang="fr" translate="no"><code class="language-html">&lt;p&gt;En janvier, toutes les boutiques de Londres affichent des panneaux
&lt;span lang="en"&gt;SALE&lt;/span&gt;, mais en fait ces magasins sont bien propres!&lt;/p&gt;</code></pre>
</blockquote>

<p>With the availability of additional tags in RFC 5646, matching is a little more complicated. In addition, its companion, <a class="print" href="http://www.rfc-editor.org/rfc/rfc4647.txt"><cite>RFC 4647 Matching of Language Tags</cite></a>, describes more than one possible approach to matching.
Expand All @@ -685,7 +690,7 @@ <h2 id="matching" style="margin-bottom: 0;">Matching language tags</h2>
<section id="bytheway">
<h2>By the way</h2>

<p>Language tags for HTML were first formally defined in RFC 2070, F. Yergeau, et.al. <cite><a class="print" href="http://www.ietf.org/rfc/rfc2070.txt">Internationalization of the Hypertext Markup Language</a></cite>. RFC 2070 was incorporated into <a class="print" href="/TR/html4">HTML 4</a>, and has been reclassified as historic.</p>
<p>Language tags for HTML were first formally defined in RFC 2070, F. Yergeau, et al. <cite><a class="print" href="http://www.ietf.org/rfc/rfc2070.txt">Internationalization of the Hypertext Markup Language</a></cite>. RFC 2070 was incorporated into <a class="print" href="/TR/html4">HTML 4</a>, and has been reclassified as historic.</p>

<p>Note there have been <a class="print" href="http://www.loc.gov/standards/iso639-2/codechanges.html"><strong>changes</strong></a> to ISO language codes. In 1989 <code class="kw" translate="no">iw</code>, <code class="kw" translate="no">in</code>, and <code class="kw" translate="no">ji</code> were withdrawn and replaced by <code class="kw" translate="no">he</code>, <code class="kw" translate="no">id</code>, and <code class="kw" translate="no">yi</code>. More recently, the ISO country code <code class="kw" translate="no">cs</code>, that used to represent Czechoslovakia, was changed to represent Serbia and Montenegro. Such changes can lead to confusion when comparing codes that were assigned to text over a long period. The new IANA subtag registry allows for tags to be deprecated and superseded by new tags, but will never remove or change the meaning of a subtag. It is expected that ISO will also follow a similar policy for the future.</p>

Expand Down
Loading

0 comments on commit 72113b1

Please sign in to comment.