Skip to content

Commit

Permalink
Update the multilingual web addresses article
Browse files Browse the repository at this point in the history
Partial fix for #564.
  • Loading branch information
xfq committed Dec 8, 2023
1 parent b69637e commit 1051670
Showing 1 changed file with 22 additions and 45 deletions.
67 changes: 22 additions & 45 deletions articles/idn-and-iri/index.en.html
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,13 @@
f.modifiers = '' // people making substantive changes, and their affiliation
f.searchString = 'article-idn-and-iri' // blog search string - usually the filename without extensions
f.firstPubDate = '2005-01-14' // date of the first publication of the document (after review)
f.lastSubstUpdate = { date:'2008-05-09', time:'15:32'} // date and time of latest substantive changes to this document
f.lastSubstUpdate = { date:'2023-12-08', time:'15:32'} // date and time of latest substantive changes to this document
f.status = 'published' // should be one of draft, review, published, or notreviewed
f.path = '../../' // what you need to prepend to a URL to get to the /International directory

// AUTHORS AND TRANSLATORS should fill in these assignments:
f.thisVersion = { date:'2022-05-25', time:'11:00'} // date and time of latest edits to this document/translation
f.contributors = '' // people providing useful contributions or feedback during review or at other times
f.thisVersion = { date:'2023-12-08', time:'11:00'} // date and time of latest edits to this document/translation
f.contributors = 'Michael Monaghan, Greg Aaron' // people providing useful contributions or feedback during review or at other times

// TRANSLATORS should fill in these assignments:
f.translators = '' // translator(s) and their affiliation - a elements allowed, but use double quotes for attributes
Expand Down Expand Up @@ -65,7 +65,7 @@ <h1>An Introduction to Multilingual Web Addresses</h1>
<section id="why">
<h2>Why multilingual Web addresses?</h2>

<p>Web addresses are typically expressed using <dfn>Uniform Resource Identifiers</dfn> or <dfn>URIs</dfn>. The URI syntax defined in <a href="http://www.ietf.org/rfc/rfc3986">RFC 3986 STD 66</a> (<cite>Uniform Resource Identifier
<p>Web addresses are typically expressed using <dfn>Uniform Resource Identifiers</dfn> or <dfn>URIs</dfn>. The URI syntax defined in <a href="https://www.rfc-editor.org/info/rfc3986">RFC 3986 STD 66</a> (<cite>Uniform Resource Identifier
(URI): Generic Syntax</cite>) essentially restricts Web addresses to a small number of characters: basically, just upper and lower case letters of the
English alphabet, European numerals and a small number of symbols.</p>
<p>The original reason for this was to aid in transcription and usability, both in computer systems and in non-computer communications, to
Expand Down Expand Up @@ -103,7 +103,7 @@ <h2>Basic concepts</h2>
<li>it must be possible to successfully match the string of characters in your Web address against the name of the target resource on the
file system or registry where it is stored.</li>
</ol>
<p>Various document formats and specifications already support IRIs. Examples include HTML 4.0, XML 1.0 system identifiers, the XLink <code class="kw" translate="no">href</code> attribute, XMLSchema's <code class="kw" translate="no">anyURI</code> datatype, etc. We will also see later that major browsers support
<p>Various document formats and specifications already support IRIs. Examples include HTML, XML 1.0 system identifiers, the XLink <code class="kw" translate="no">href</code> attribute, XMLSchema's <code class="kw" translate="no">anyURI</code> datatype, etc. We will also see later that major browsers support
the use of IRIs already.</p>
<p>Unfortunately, not so many protocols allow IRIs to pass through unchanged. Typically they require that the address be specified using the
ASCII characters defined for URIs. There are, however, well specified ways around this, and we will describe them briefly in this article.</p>
Expand Down Expand Up @@ -143,15 +143,15 @@ <h2>Basic concepts</h2>
<h2>Handling the domain name</h2>

<p>Domain names are allocated and managed by domain name registration organizations spread around the world.</p>
<p>A standard approach to dealing with multilingual domain names was agreed by the IETF in March 2003. It is defined in RFCs <a href="http://www.faqs.org/rfcs/rfc3490.html" title="Internationalizing Domain Names in Applications (IDNA)">3490</a>, <a href="http://www.faqs.org/rfcs/rfc3491.html" title="Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)">3491</a>, <a href="http://www.faqs.org/rfcs/rfc3492.html" title="Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)">3492</a> and <a href="http://www.faqs.org/rfcs/rfc3454.html" title="Preparation of Internationalized Strings ('stringprep')">3454</a>, and is based on <a href="http://www.unicode.org/">Unicode 3.2</a>. One refers to this using the term <dfn>Internationalized Domain Name</dfn> or <dfn><abbr title="Internationalized Domain Names">IDN</abbr></dfn>.</p>
<p>A standard approach to dealing with multilingual domain names was agreed by the IETF in March 2003. It is defined in RFCs <a href="https://www.rfc-editor.org/info/rfc3490" title="Internationalizing Domain Names in Applications (IDNA)">3490</a>, <a href="https://www.rfc-editor.org/info/rfc3491" title="Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)">3491</a>, <a href="https://www.rfc-editor.org/info/rfc3492" title="Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)">3492</a> and <a href="https://www.rfc-editor.org/info/rfc3454" title="Preparation of Internationalized Strings ('stringprep')">3454</a>, and is based on <a href="https://home.unicode.org/">Unicode 3.2</a>. One refers to this using the term <dfn>Internationalized Domain Name</dfn> or <dfn><abbr title="Internationalized Domain Names">IDN</abbr></dfn>.</p>




<section id="socially">
<h3>Domain registration</h3>

<p>The domain name registrar fixes the list of characters that people can request to be used in their country or top level domains.
<p>The domain name registrar fixes the list of characters that people can request to be used in their country or top-level domains.
However, when a person requests a domain name using these characters they are actually allocated the equivalent of the domain name using a
representation called punycode. </p>
<p><dfn>Punycode</dfn> is a way of representing Unicode codepoints using only ASCII characters.</p>
Expand Down Expand Up @@ -198,7 +198,7 @@ <h3>Resolving a domain name</h3>
<p>Next, the punycode is resolved by the domain name server into a numeric IP address (just like any other domain name is resolved).</p>
<p>Finally the user agent sends the request for the page. Since punycode contains no characters outside those normally allowed for
protocols such as HTTP, there is no issue with the transmission of the address. This should simply match against a registered domain name.</p>
<p>Note that most top-level country codes, for example, the <code translate="no">.jp</code> at the end of <code lang="ja">JP納豆.例.jp</code>, still has to be in Latin characters at the moment. Since 2010, however, IANA has been progressively introducing internationalized country code top level domains, such as مصر. for Egypt, and .рф for Russia.</p>
<p>Note that most top-level country codes, for example, the <code translate="no">.jp</code> at the end of <code lang="ja">JP納豆.例.jp</code>, still has to be in Latin characters at the moment. Since 2010, however, IANA has been progressively introducing internationalized country code top-level domains, such as مصر. for Egypt, and .рф for Russia.</p>
<p>In practice, it makes sense to register two names for your domain. One in your native script, and one using just regular ASCII
characters. The latter will be more memorable and easier to type for people who do not read and write your language. For example, you could
additionally register a transcription of the Japanese in Latin script, such as the following:</p>
Expand All @@ -215,7 +215,7 @@ <h2>Handling the path</h2>
<p>Whereas the domain registration authorities can all agree to accept domain names in a particular form and encoding (ASCII-based
punycode), multi-script <dfn>path names</dfn> identify resources located on many kinds of platforms, whose file systems do and will continue to
use many different encodings. This makes the path much more difficult to handle than the domain name.</p>
<p>Having dealt with the domain name using punycode, we now need to deal with the path part of an IRI. The IETF Proposed Standard <a href="http://www.ietf.org/rfc/rfc3987">RFC 3987</a> (Internationalized Resource Identifiers (IRIs)) defines how to deal with this.</p>
<p>Having dealt with the domain name using punycode, we now need to deal with the path part of an IRI. The IETF Proposed Standard <a href="https://www.rfc-editor.org/info/rfc3987">RFC 3987</a> (Internationalized Resource Identifiers (IRIs)) defines how to deal with this.</p>



Expand All @@ -232,7 +232,7 @@ <h3>The string matching challenge</h3>
</div>
<p>Apart from the fact that this is not terribly user friendly, there is a bigger issue here. Another person may want to follow the same
link from a page that uses a Shift-JIS character encoding, rather than UTF-8. In this case, if we were to use percent-escaping to transform the (same)
characters in the address so that they to conform to the URI requirements, we would base the escapes on the bytes that represent <span lang="ja">引き割り.html</span> in
characters in the address so that they conform to the URI requirements, we would base the escapes on the bytes that represent <span lang="ja">引き割り.html</span> in
Shift-JIS. There are only two bytes per Japanese character in Shift-JIS, and they are different bytes from those used in UTF-8. So this would yield
the totally different sequence of byte escapes shown below.</p>
<div class="example">
Expand Down Expand Up @@ -293,15 +293,10 @@ <h3>Resolving a path</h3>
just passed through without change, since these characters are encoded in the same way in both ASCII and UTF-8.</p>
<p>The user agent sends the request for the page.</p>
<p>When this request hits the server, one of two things need to happen:</p>
<div class="sidenoteGroup">
<ul>
<li>if the server exposes the file names in UTF-8, the server simply accesses the resource</li>
<li>if the server uses another encoding, the server needs to convert from UTF-8.</li>
</ul>
<div class="sideinfonote">
<p class="info">Martin Dürst has written an Apache module called <a
href="http://www.w3.org/2003/06/mod_fileiri/">mod_fileiri</a> to convert requests from UTF-8 to the encoding of the server.</p>
</div>
</div>
<p>This covers the basics. There are some additional parts of the specification that deal with finer points, such as how to handle
bidirectional text in IRIs, and so on.</p>
Expand Down Expand Up @@ -344,16 +339,12 @@ <h2>Does it work?</h2>
<h3>Domain Name lookup</h3>

<p><a href="http://en.wikipedia.org/wiki/Internationalized_domain_name#DNS_registries_known_to_have_adopted_IDNA">Numerous domain name
authorities</a> already offer registration of internationalized domain names. These include providers for top level country domains as .cn, .jp, .kr,
etc., and global top level domains such as .info, .org and .museum.</p>
<p>Client-side support for IDN is appears in the recent versions of major browsers, including Internet Explorer 7, Firefox, Mozilla,
Netscape, Opera, and Safari. It only works in Internet Explorer 6 if you download a plug-in (Microsoft support pages provide some <a href="http://support.microsoft.com/?kbid=842848">suggestions</a>). This means that you can use IDNs in href values or the address bar, and the
authorities</a> already offer registration of internationalized domain names. These include providers for top-level country domains as .cn, .jp, .kr,
etc., and global top-level domains such as .info, .org and .museum.</p>
<p>Client-side support for IDN is appears in the recent versions of major browsers, including Chrome, Safari, Edge, and Firefox. This means that you can use IDNs in href values or the address bar, and the
browser will convert the IDN to punycode and look up the host.</p>
<p>You can run a basic check to see whether IDNs work on your system using this <a href="/International/tests/test-incubator/oldtests/sec-idn-1">simple
test</a>.</p>
<p>It has been an issue, until now, that IDN is not natively supported by Internet Explorer, with its huge market share. Although
plug-ins are available, not all people will know how to, will want to, or will be able to install them. However, IE7 or its successors, which do support IDN, will,
over time, replace most IE6 installs.</p>
<p>Note that, as a simple fallback solution until IDN is widely supported, content authors who want to point to a resource using an IDN
could write the link text in native characters, and put a punycode representation in the href attribute. This guarantees that the user would be able
to link to the resource, whatever platform they used.</p>
Expand All @@ -365,13 +356,8 @@ <h3>Domain Name lookup</h3>

<section id="phishing">
<h3>Domain names and phishing</h3>
<div class="sidenoteGroup">
<p>One of the problems associated with IDN support in browsers is that it can facilitate phishing through what are called 'homograph
attacks'. Consequently, most browsers that support IDN also put in place some safeguards to protect users from such fraud.</p>
<div class="sideinfonote">
<p class="info">Special thanks to Michael Monaghan and Greg Aaron for their contributions to this section.</p>
</div>
</div>
<div class="sidenoteGroup">
<p>The way browsers typically alert the user to a possible homograph attack is to display the URI in the address bar and the status
bar using punycode, rather than in the original Unicode characters. Users should therefore always check the address bar after the page has loaded, or
Expand All @@ -385,7 +371,7 @@ <h3>Domain names and phishing</h3>
</div>
<ul>
<li>
<p>Different browsers use different strategies to determine whether the URI should be shown in Unicode or punycode.</p>
<p>Different browsers use different strategies to determine whether the IRI should be shown in Unicode or punycode.</p>
</li>
<li>
<p>If an address appears as punycode, it doesn't necessarily mean that this is a bogus site – simply 'user beware'. It's up to the
Expand Down Expand Up @@ -423,7 +409,7 @@ <h3>Domain names and phishing</h3>
the user. It also uses a clickable icon at the end of the address bar to notify you when an URL contains a non-ASCII character. It also displays the
address bar in all windows.</p>
<p><b class="leadin">Firefox 2.x</b> uses a different approach. It only displays domain names in Unicode for certain
whitelisted top level domains. Firefox selects Top Level Domains (TLDs) that have established policies on the domain names they <em>allow to be
whitelisted top-level domains. Firefox selects Top-Level Domains (TLDs) that have established policies on the domain names they <em>allow to be
registered</em> and then relies on the registration process to create safe IDNs. You can find a <a href="http://www.mozilla.org/projects/security/tld-idn-policy-list.html">list of supported TLDs</a> on the Mozilla site. If an IDN is from a TLD
that is not on the list, the web address will appear in punycode form in the status and address bars. In some cases the TLD policy statements should
include rules about managing visually similar characters within the set of characters allowed.</p>
Expand Down Expand Up @@ -498,9 +484,7 @@ <h3>Paths</h3>
resource name are in the same encoding), but technically-aware users can turn on an option to support this (set network.standard-url.encode-utf8 to true in about:config).</p>
<p>Whether or not the resource is found on the server, however, is a different question. If the file system is in UTF-8, there should be no
problem. If not, and no mechanism is available to convert addresses from UTF-8 to the appropriate encoding, the request will fail.</p>
<p>Files are normally exposed as UTF-8 by servers such as IIS and Apache 2 on Windows and Mac OS X. Unix and Linux users can store file
names in UTF-8, or use the <a href="http://www.w3.org/2003/06/mod_fileiri/">mod_fileiri module</a> mentioned earlier. Version 1 of the Apache server
doesn't yet expose filenames as UTF-8.</p>
<p>Files are normally exposed as UTF-8 by servers such as Nginx, Apache 2, and IIS.</p>
<p>You can run a basic check whether it works for your client and resource using this <a href="/International/tests/test-incubator/oldtests/sec-iri-3">simple
test</a>.</p>
<p class="ednote">Note that, while the basics may work, there are other somewhat more complicated aspects of IRI support, such as
Expand Down Expand Up @@ -530,18 +514,11 @@ <h2>Further reading</h2>

<ul id="full-links">
<li>
<p><a href="http://idnsearch.net/domains/index/1">Examples of registered IDNs</a></p>
</li>
<li>
<p><a href="http://download.microsoft.com/download/a/6/0/a60decbd-9044-42f1-b9c5-1c90c7a5a8ce/a6.pdf"><cite>IDN and URI</cite> [PDF]</a>, Michel
Suignard</p>
</li>
<li>
<p><a href="http://www.ietf.org/rfc/rfc3987"><cite>RFC 3987 Internationalized Resource Identifiers (IRIs)</cite></a>, IETF Proposed Standard,
<p><a href="https://www.rfc-editor.org/info/rfc3987"><cite>RFC 3987 Internationalized Resource Identifiers (IRIs)</cite></a>, IETF Proposed Standard,
Martin Dürst, Michel Suignard</p>
</li>
<li>
<p><a href="http://www.ietf.org/rfc/rfc3986"><cite>RFC 3986 STD 66 Uniform Resource Identifier (URI): Generic Syntax</cite></a>, IETF Standard, T.
<p><a href="https://www.rfc-editor.org/info/rfc3986"><cite>RFC 3986 STD 66 Uniform Resource Identifier (URI): Generic Syntax</cite></a>, IETF Standard, T.
Berners-Lee, R. Fielding, L. Masinter</p>
</li>
<li>
Expand All @@ -551,17 +528,17 @@ <h2>Further reading</h2>
<p><a href="http://www.icann.org/announcements/announcement-31oct06.htm">IDNA Protocol Review and Proposals for Changes</a></p>
</li>
<li>
<p><a href="http://www.ietf.org/rfc/rfc4690.txt"><cite>RFC 4690: Review and Recommendations for Internationalized Domain Names</cite></a> Issues
<p><a href="https://www.rfc-editor.org/info/rfc4690"><cite>RFC 4690: Review and Recommendations for Internationalized Domain Names</cite></a> Issues
related to language specific character issues where the same script is used across different language, issues related to cases where languages can be
expressed by using more than one script, bi-directional cases, and the topic of visually confusing characters.</p>
</li>
<li>
<p><a href="http://www.icann.org/general/idn-guidelines-22feb06.htm"><cite>ICANN Guidelines for the Implementation of Internationalized
Domain Names Version 2.1</cite></a> The Guidelines apply directly to the gTLD registries, and are intended to be suitable for implementation in other
<p><a href="https://www.icann.org/resources/pages/idn-guidelines-2011-09-02-en"><cite>ICANN Guidelines for the Implementation of Internationalized
Domain Names Version 3.0</cite></a> The Guidelines apply directly to the gTLD and ccTLD registries, and are intended to be suitable for implementation in other
registries on the second and lower levels.</p>
</li>
<li>
<p><a href="/International/tests/#other">IDN and IRI test pages</a></p>
<p><a href="/International/i18n-tests/#other">IDN and IRI test pages</a></p>
</li>
<li>
<p><a href="http://www.w3.org/2003/06/mod_fileiri/">Martin Dürst's fileiri Apache module</a></p>
Expand Down

0 comments on commit 1051670

Please sign in to comment.