Update the multilingual web addresses article

Partial fix for #564.
w3c · Dec 8, 2023 · 1051670 · 1051670
1 parent b69637e
commit 1051670
Showing 1 changed file with 22 additions and 45 deletions.
diff --git a/articles/idn-and-iri/index.en.html b/articles/idn-and-iri/index.en.html
@@ -15,13 +15,13 @@
 f.modifiers = '' // people making substantive changes, and their affiliation
 f.searchString = 'article-idn-and-iri' // blog search string - usually the filename without extensions
 f.firstPubDate = '2005-01-14' // date of the first publication of the document (after review)
-f.lastSubstUpdate = { date:'2008-05-09', time:'15:32'}  // date and time of latest substantive changes to this document
+f.lastSubstUpdate = { date:'2023-12-08', time:'15:32'}  // date and time of latest substantive changes to this document
 f.status = 'published'  // should be one of draft, review, published, or notreviewed
 f.path = '../../' // what you need to prepend to a URL to get to the /International directory 
 
 // AUTHORS AND TRANSLATORS should fill in these assignments:
-f.thisVersion = { date:'2022-05-25', time:'11:00'} // date and time of latest edits to this document/translation
-f.contributors = '' // people providing useful contributions or feedback during review or at other times
+f.thisVersion = { date:'2023-12-08', time:'11:00'} // date and time of latest edits to this document/translation
+f.contributors = 'Michael Monaghan, Greg Aaron' // people providing useful contributions or feedback during review or at other times
 
 // TRANSLATORS should fill in these assignments:
 f.translators = '' // translator(s) and their affiliation - a elements allowed, but use double quotes for attributes
@@ -65,7 +65,7 @@ <h1>An Introduction to Multilingual Web Addresses</h1>
 <section id="why">
 <h2>Why multilingual Web addresses?</h2>
 
-<p>Web addresses are typically expressed using <dfn>Uniform Resource Identifiers</dfn> or <dfn>URIs</dfn>. The URI syntax defined in <a href="http://www.ietf.org/rfc/rfc3986">RFC 3986 STD 66</a> (<cite>Uniform Resource Identifier
+<p>Web addresses are typically expressed using <dfn>Uniform Resource Identifiers</dfn> or <dfn>URIs</dfn>. The URI syntax defined in <a href="https://www.rfc-editor.org/info/rfc3986">RFC 3986 STD 66</a> (<cite>Uniform Resource Identifier
 (URI): Generic Syntax</cite>) essentially restricts Web addresses to a small number of characters: basically, just upper and lower case letters of the
 English alphabet, European numerals and a small number of symbols.</p>
 <p>The original reason for this was to aid in transcription and usability, both in computer systems and in non-computer communications, to
@@ -103,7 +103,7 @@ <h2>Basic concepts</h2>
 <li>it must be possible to successfully match the string of characters in your Web address against the name of the target resource on the
 file system or registry where it is stored.</li>
 </ol>
-<p>Various document formats and specifications already support IRIs. Examples include HTML 4.0, XML 1.0 system identifiers, the XLink <code class="kw" translate="no">href</code> attribute, XMLSchema's <code class="kw" translate="no">anyURI</code> datatype, etc. We will also see later that major browsers support
+<p>Various document formats and specifications already support IRIs. Examples include HTML, XML 1.0 system identifiers, the XLink <code class="kw" translate="no">href</code> attribute, XMLSchema's <code class="kw" translate="no">anyURI</code> datatype, etc. We will also see later that major browsers support
 the use of IRIs already.</p>
 <p>Unfortunately, not so many protocols allow IRIs to pass through unchanged. Typically they require that the address be specified using the
 ASCII characters defined for URIs. There are, however, well specified ways around this, and we will describe them briefly in this article.</p>
@@ -143,15 +143,15 @@ <h2>Basic concepts</h2>
 <h2>Handling the domain name</h2>
 
 <p>Domain names are allocated and managed by domain name registration organizations spread around the world.</p>
-<p>A standard approach to dealing with multilingual domain names was agreed by the IETF in March 2003. It is defined in RFCs <a href="http://www.faqs.org/rfcs/rfc3490.html" title="Internationalizing Domain Names in Applications (IDNA)">3490</a>, <a href="http://www.faqs.org/rfcs/rfc3491.html" title="Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)">3491</a>, <a href="http://www.faqs.org/rfcs/rfc3492.html" title="Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)">3492</a> and <a href="http://www.faqs.org/rfcs/rfc3454.html" title="Preparation of Internationalized Strings ('stringprep')">3454</a>, and is based on <a href="http://www.unicode.org/">Unicode 3.2</a>. One refers to this using the term <dfn>Internationalized Domain Name</dfn> or <dfn><abbr title="Internationalized Domain Names">IDN</abbr></dfn>.</p>
+<p>A standard approach to dealing with multilingual domain names was agreed by the IETF in March 2003. It is defined in RFCs <a href="https://www.rfc-editor.org/info/rfc3490" title="Internationalizing Domain Names in Applications (IDNA)">3490</a>, <a href="https://www.rfc-editor.org/info/rfc3491" title="Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)">3491</a>, <a href="https://www.rfc-editor.org/info/rfc3492" title="Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)">3492</a> and <a href="https://www.rfc-editor.org/info/rfc3454" title="Preparation of Internationalized Strings ('stringprep')">3454</a>, and is based on <a href="https://home.unicode.org/">Unicode 3.2</a>. One refers to this using the term <dfn>Internationalized Domain Name</dfn> or <dfn><abbr title="Internationalized Domain Names">IDN</abbr></dfn>.</p>
 
 
 
 
 <section id="socially">
 <h3>Domain registration</h3>
 
-<p>The domain name registrar fixes the list of characters that people can request to be used in their country or top level domains.
+<p>The domain name registrar fixes the list of characters that people can request to be used in their country or top-level domains.
 However, when a person requests a domain name using these characters they are actually allocated the equivalent of the domain name using a
 representation called punycode. </p>
 <p><dfn>Punycode</dfn> is a way of representing Unicode codepoints using only ASCII characters.</p>
@@ -198,7 +198,7 @@ <h3>Resolving a domain name</h3>
 <p>Next, the punycode is resolved by the domain name server into a numeric IP address (just like any other domain name is resolved).</p>
 <p>Finally the user agent sends the request for the page. Since punycode contains no characters outside those normally allowed for
 protocols such as HTTP, there is no issue with the transmission of the address. This should simply match against a registered domain name.</p>
-<p>Note that most top-level country codes, for example, the <code translate="no">.jp</code> at the end of <code lang="ja">JP納豆.例.jp</code>, still has to be in Latin characters at the moment. Since 2010, however, IANA has been progressively introducing internationalized country code top level domains, such as مصر. for Egypt, and .рф for Russia.</p>
+<p>Note that most top-level country codes, for example, the <code translate="no">.jp</code> at the end of <code lang="ja">JP納豆.例.jp</code>, still has to be in Latin characters at the moment. Since 2010, however, IANA has been progressively introducing internationalized country code top-level domains, such as مصر. for Egypt, and .рф for Russia.</p>
 <p>In practice, it makes sense to register two names for your domain. One in your native script, and one using just regular ASCII
 characters. The latter will be more memorable and easier to type for people who do not read and write your language. For example, you could
 additionally register a transcription of the Japanese in Latin script, such as the following:</p>
@@ -215,7 +215,7 @@ <h2>Handling the path</h2>
 <p>Whereas the domain registration authorities can all agree to accept domain names in a particular form and encoding (ASCII-based
 punycode), multi-script <dfn>path names</dfn> identify resources located on many kinds of platforms, whose file systems do and will continue to
 use many different encodings. This makes the path much more difficult to handle than the domain name.</p>
-<p>Having dealt with the domain name using punycode, we now need to deal with the path part of an IRI. The IETF Proposed Standard <a href="http://www.ietf.org/rfc/rfc3987">RFC 3987</a> (Internationalized Resource Identifiers (IRIs)) defines how to deal with this.</p>
+<p>Having dealt with the domain name using punycode, we now need to deal with the path part of an IRI. The IETF Proposed Standard <a href="https://www.rfc-editor.org/info/rfc3987">RFC 3987</a> (Internationalized Resource Identifiers (IRIs)) defines how to deal with this.</p>
 
 
 
@@ -232,7 +232,7 @@ <h3>The string matching challenge</h3>
 </div>
 <p>Apart from the fact that this is not terribly user friendly, there is a bigger issue here. Another person may want to follow the same
 link from a page that uses a Shift-JIS character encoding, rather than UTF-8. In this case, if we were to use percent-escaping to transform the (same)
-characters in the address so that they to conform to the URI requirements, we would base the escapes on the bytes that represent <span lang="ja">引き割り.html</span> in
+characters in the address so that they conform to the URI requirements, we would base the escapes on the bytes that represent <span lang="ja">引き割り.html</span> in
 Shift-JIS. There are only two bytes per Japanese character in Shift-JIS, and they are different bytes from those used in UTF-8. So this would yield
 the totally different sequence of byte escapes shown below.</p>
 <div class="example">
@@ -293,15 +293,10 @@ <h3>Resolving a path</h3>
 just passed through without change, since these characters are encoded in the same way in both ASCII and UTF-8.</p>
 <p>The user agent sends the request for the page.</p>
 <p>When this request hits the server, one of two things need to happen:</p>
-<div class="sidenoteGroup">
 <ul>
 <li>if the server exposes the file names in UTF-8, the server simply accesses the resource</li>
 <li>if the server uses another encoding, the server needs to convert from UTF-8.</li>
 </ul>
-<div class="sideinfonote">
-<p class="info">Martin Dürst has written an Apache module called <a
-href="http://www.w3.org/2003/06/mod_fileiri/">mod_fileiri</a> to convert requests from UTF-8 to the encoding of the server.</p>
-</div>
 </div>
 <p>This covers the basics. There are some additional parts of the specification that deal with finer points, such as how to handle
 bidirectional text in IRIs, and so on.</p>
@@ -344,16 +339,12 @@ <h2>Does it work?</h2>
 <h3>Domain Name lookup</h3>
 
 <p><a href="http://en.wikipedia.org/wiki/Internationalized_domain_name#DNS_registries_known_to_have_adopted_IDNA">Numerous domain name
-authorities</a> already offer registration of internationalized domain names. These include providers for top level country domains as .cn, .jp, .kr,
-etc., and global top level domains such as .info, .org and .museum.</p>
-<p>Client-side support for IDN is appears in the recent versions of major browsers, including Internet Explorer 7, Firefox, Mozilla,
-Netscape, Opera, and Safari. It only works in Internet Explorer 6 if you download a plug-in (Microsoft support pages provide some <a href="http://support.microsoft.com/?kbid=842848">suggestions</a>). This means that you can use IDNs in href values or the address bar, and the
+authorities</a> already offer registration of internationalized domain names. These include providers for top-level country domains as .cn, .jp, .kr,
+etc., and global top-level domains such as .info, .org and .museum.</p>
+<p>Client-side support for IDN is appears in the recent versions of major browsers, including Chrome, Safari, Edge, and Firefox. This means that you can use IDNs in href values or the address bar, and the
 browser will convert the IDN to punycode and look up the host.</p>
 <p>You can run a basic check to see whether IDNs work on your system using this <a href="/International/tests/test-incubator/oldtests/sec-idn-1">simple
 test</a>.</p>
-<p>It has been an issue, until now, that IDN is not natively supported by Internet Explorer, with its huge market share. Although
-plug-ins are available, not all people will know how to, will want to, or will be able to install them. However, IE7 or its successors, which do support IDN, will,
-over time, replace most IE6 installs.</p>
 <p>Note that, as a simple fallback solution until IDN is widely supported, content authors who want to point to a resource using an IDN
 could write the link text in native characters, and put a punycode representation in the href attribute. This guarantees that the user would be able
 to link to the resource, whatever platform they used.</p>
@@ -365,13 +356,8 @@ <h3>Domain Name lookup</h3>
 
 <section id="phishing">
 <h3>Domain names and phishing</h3>
-<div class="sidenoteGroup">
 <p>One of the problems associated with IDN support in browsers is that it can facilitate phishing through what are called 'homograph
 attacks'. Consequently, most browsers that support IDN also put in place some safeguards to protect users from such fraud.</p>
-<div class="sideinfonote">
-<p class="info">Special thanks to Michael Monaghan and Greg Aaron for their contributions to this section.</p>
-</div>
-</div>
 <div class="sidenoteGroup">
 <p>The way browsers typically alert the user to a possible homograph attack is to display the URI in the address bar and the status
 bar using punycode, rather than in the original Unicode characters. Users should therefore always check the address bar after the page has loaded, or
@@ -385,7 +371,7 @@ <h3>Domain names and phishing</h3>
 </div>
 <ul>
 <li>
-<p>Different browsers use different strategies to determine whether the URI should be shown in Unicode or punycode.</p>
+<p>Different browsers use different strategies to determine whether the IRI should be shown in Unicode or punycode.</p>
 </li>
 <li>
 <p>If an address appears as punycode, it doesn't necessarily mean that this is a bogus site – simply 'user beware'. It's up to the
@@ -423,7 +409,7 @@ <h3>Domain names and phishing</h3>
 the user. It also uses a clickable icon at the end of the address bar to notify you when an URL contains a non-ASCII character. It also displays the
 address bar in all windows.</p>
 <p><b class="leadin">Firefox 2.x</b> uses a different approach. It only displays domain names in Unicode for certain
-whitelisted top level domains. Firefox selects Top Level Domains (TLDs) that have established policies on the domain names they <em>allow to be
+whitelisted top-level domains. Firefox selects Top-Level Domains (TLDs) that have established policies on the domain names they <em>allow to be
 registered</em> and then relies on the registration process to create safe IDNs. You can find a <a href="http://www.mozilla.org/projects/security/tld-idn-policy-list.html">list of supported TLDs</a> on the Mozilla site. If an IDN is from a TLD
 that is not on the list, the web address will appear in punycode form in the status and address bars. In some cases the TLD policy statements should
 include rules about managing visually similar characters within the set of characters allowed.</p>
@@ -498,9 +484,7 @@ <h3>Paths</h3>
 resource name are in the same encoding), but technically-aware users can turn on an option  to support this (set network.standard-url.encode-utf8 to true in about:config).</p>
 <p>Whether or not the resource is found on the server, however, is a different question. If the file system is in UTF-8, there should be no
 problem. If not, and no mechanism is available to convert addresses from UTF-8 to the appropriate encoding, the request will fail.</p>
-<p>Files are normally exposed as UTF-8 by servers such as IIS and Apache 2 on Windows and Mac OS X. Unix and Linux users can store file
-names in UTF-8, or use the <a href="http://www.w3.org/2003/06/mod_fileiri/">mod_fileiri module</a> mentioned earlier. Version 1 of the Apache server
-doesn't yet expose filenames as UTF-8.</p>
+<p>Files are normally exposed as UTF-8 by servers such as Nginx, Apache 2, and IIS.</p>
 <p>You can run a basic check whether it works for your client and resource using this <a href="/International/tests/test-incubator/oldtests/sec-iri-3">simple
 test</a>.</p>
 <p class="ednote">Note that, while the basics may work, there are other somewhat more complicated aspects of IRI support, such as
@@ -530,18 +514,11 @@ <h2>Further reading</h2>
 
 <ul id="full-links">
 <li>
-<p><a href="http://idnsearch.net/domains/index/1">Examples of registered IDNs</a></p>
-</li>
-<li>
-<p><a href="http://download.microsoft.com/download/a/6/0/a60decbd-9044-42f1-b9c5-1c90c7a5a8ce/a6.pdf"><cite>IDN and URI</cite> [PDF]</a>, Michel
-Suignard</p>
-</li>
-<li>
-<p><a href="http://www.ietf.org/rfc/rfc3987"><cite>RFC 3987 Internationalized Resource Identifiers (IRIs)</cite></a>, IETF Proposed Standard,
+<p><a href="https://www.rfc-editor.org/info/rfc3987"><cite>RFC 3987 Internationalized Resource Identifiers (IRIs)</cite></a>, IETF Proposed Standard,
 Martin Dürst, Michel Suignard</p>
 </li>
 <li>
-<p><a href="http://www.ietf.org/rfc/rfc3986"><cite>RFC 3986 STD 66 Uniform Resource Identifier (URI): Generic Syntax</cite></a>, IETF Standard, T.
+<p><a href="https://www.rfc-editor.org/info/rfc3986"><cite>RFC 3986 STD 66 Uniform Resource Identifier (URI): Generic Syntax</cite></a>, IETF Standard, T.
 Berners-Lee, R. Fielding, L. Masinter</p>
 </li>
 <li>
@@ -551,17 +528,17 @@ <h2>Further reading</h2>
 <p><a href="http://www.icann.org/announcements/announcement-31oct06.htm">IDNA Protocol Review and Proposals for Changes</a></p>
 </li>
 <li>
-<p><a href="http://www.ietf.org/rfc/rfc4690.txt"><cite>RFC 4690: Review and Recommendations for Internationalized Domain Names</cite></a> Issues
+<p><a href="https://www.rfc-editor.org/info/rfc4690"><cite>RFC 4690: Review and Recommendations for Internationalized Domain Names</cite></a> Issues
 related to language specific character issues where the same script is used across different language, issues related to cases where languages can be
 expressed by using more than one script, bi-directional cases, and the topic of visually confusing characters.</p>
 </li>
 <li>
-<p><a href="http://www.icann.org/general/idn-guidelines-22feb06.htm"><cite>ICANN Guidelines for the Implementation of Internationalized
-Domain Names Version 2.1</cite></a> The Guidelines apply directly to the gTLD registries, and are intended to be suitable for implementation in other
+<p><a href="https://www.icann.org/resources/pages/idn-guidelines-2011-09-02-en"><cite>ICANN Guidelines for the Implementation of Internationalized
+Domain Names Version 3.0</cite></a> The Guidelines apply directly to the gTLD and ccTLD registries, and are intended to be suitable for implementation in other
 registries on the second and lower levels.</p>
 </li>
 <li>
-<p><a href="/International/tests/#other">IDN and IRI test pages</a></p>
+<p><a href="/International/i18n-tests/#other">IDN and IRI test pages</a></p>
 </li>
 <li>
 <p><a href="http://www.w3.org/2003/06/mod_fileiri/">Martin Dürst's fileiri Apache module</a></p>