IDNA #53

Open
annevk opened this Issue Jul 30, 2015 · 14 comments

Projects

None yet

7 participants

@annevk
Member
annevk commented Jul 30, 2015

This issue tracks faults in http://www.unicode.org/reports/tr46/ since Unicode doesn't really do that well. If you find an issue, use http://www.unicode.org/reporting.html to report it and then report back here.

@SimonSapin
Contributor

I’ve just submitted the following to http://www.unicode.org/reporting.html. I’ll update this when I get a response.

Subject: xn-- prefix never added in UTS # 46

In http://www.unicode.org/reports/tr46/tr46-15.html#ProcessingStepConvertValidate , the algorithm looks for a xn-- prefix and decodes the rest of the label per Punycode when it is present.

In http://www.unicode.org/reports/tr46/tr46-15.html#ToASCII however, the xn-- prefix is never added:

Convert each label with non-ASCII characters into Punycode [RFC3492]. This may record an error.

This should probably be replaced with something like:

For each label with non-ASCII characters, replace the label with “xn--” followed by the encoding of the label according to Punycode [RFC3492]. This may record an error.

@SimonSapin SimonSapin referenced this issue in servo/rust-url Jul 30, 2015
Merged

IDNA support #119

@Sebmaster

My report to Unicode from some time ago which seems to not be fixed yet:

The Format section (8.1) under Conformance Testing in UTS46 is confusing.

The explanation for the toASCII and toUnicode explains to use the provided processing_option for toUnicode, and always use nontransitional for toASCII.
However, in the implementation section of toUnicode (4.3), it explains to always call the processing step with nontransitional. The toASCII parameter list provides a processing_option, though.

It looks to me, as if the descriptions for toASCII and toUnicode in the conformance testing section got mixed up. This also applies to the descriptions in the header of IdnaTest.txt.

The other thing is that there's only a single IdnaTest file, but there's no explanation to which algorithm it applies. Is it for IDNA2008, IDNA2003 or UTS46? It seems to be categorized according to Unicode standard instead of IDNA reference, which makes this really confusing. Haven't reported that one yet though.

@SimonSapin
Contributor

@Sebmaster regarding the other thing, http://www.unicode.org/reports/tr46/#Conformance_Testing explains how "To test for conformance to UTS46" using IdnaTest.txt.

@Sebmaster

@SimonSapin I'm not sure that's totally correct either since:

Bn for Bidi Rule #n from Section 2. The Bidi Rule, in Right-to-Left Scripts for Internationalized Domain Names for Applications (IDNA) [IDNA2008]
Cn for ContextJ tests in, Appendix A.n in The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) [IDNA2008]. Thus C1 = Appendix A.1. ZERO WIDTH NON-JOINER, and C2 = Appendix A.2. ZERO WIDTH JOINER. The CONTEXTO tests are optional for client software, and not tested here.

is not described in TR46 at all. It's imported from the IDNA2008 standard, which has no relevance in the TR46 spec... I think 😕

@Sebmaster

Got a mail today from Unicode (regarding conformance test description):

This was discussed at the UTC meeting in July, and has been forwarded to the author of the UTS for consideration in a subsequent version.

So that's pretty sweet.

@jcranmer

Oh yeah, I came back into this and recall that the IdnaTest.txt is really bad at telling you how to process it.

@Sebmaster:
The ToASCII column uses nontransitional processing (read IdnaTest.txt's commented header) and UseSTD3ASCIIRules=true (see §8 of the input). However, they definitely appear to have some extra rules not described in their algorithm (for example, ToUnicode should never produce an [A4_1] or [A4_2] error, since those are specific to the ToASCII regime and ToUnicode never calls ToASCII, yet you can clearly see for yourself that they do).

@SimonSapin
Contributor

I got a response to #53 (comment):

This has been added to the feedback document for next week's meeting.

@SimonSapin
Contributor

… and today:

I was directed by the UTC to let you know that this has been sent to the editor for review during the next update cycle.

@valenting

As per servo/rust-url#160 I submitted feedback regarding Validation rule no. 2 - "2. The label must not contain a U+002D HYPHEN-MINUS character in both the third and fourth positions."
This isn't being enforced by all UAs, as it's being used on youtube which uses domains such as https://r3---sn-2gb7ln7k.googlevideo.com/videoplayback?... This domain breaks that rule.

@domenic domenic referenced this issue in jsdom/whatwg-url Apr 15, 2016
Open

Bug in parsing URLs #50

@srl295
srl295 commented May 12, 2016 edited

@valenting Your feedback is tracked as part of PRI317 http://www.unicode.org/review/pri317/ (being discussed now).

By the way @SimonSapin I'd think the right way to track is via UTC agenda items http://www.unicode.org/L2/L-curdoc.htm

@Sebmaster

It seems like Unicode has closed that ticket without removing the -- validity requirement 😞

Does anybody have the ability to look into the Unicode ... process to see what's going on there?

@annevk annevk added the parser label Dec 20, 2016
@domenic
Member
domenic commented Jan 6, 2017 edited

The validation rule problem mentioned at #53 (comment) doesn't seem to have made it into https://docs.google.com/document/d/11PEww2N0PbXyPhbsCdW_PjD3BNgZMy5XHUv02SSXNqY/edit#heading=h.p7mmdt3ofe3 by my reading. What's the latest? It's item A, nevermind

@bagder bagder referenced this issue Jan 30, 2017
Closed

IDNA2008 #223

@annevk annevk added the idna label Feb 10, 2017
@annevk
Member
annevk commented Feb 13, 2017

Going forward, rather than tracking all UTS 46 feedback here, I suggest we just create new issues against this repository, so we can discuss each problem in isolation. I created an idna label that we can use to group them all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment