New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDNA #53

Closed
annevk opened this Issue Jul 30, 2015 · 26 comments

Comments

8 participants
@annevk
Member

annevk commented Jul 30, 2015

This issue tracks faults in http://www.unicode.org/reports/tr46/ since Unicode doesn't really do that well. If you find an issue, use http://www.unicode.org/reporting.html to report it and then report back here.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Jul 30, 2015

Contributor

I’ve just submitted the following to http://www.unicode.org/reporting.html. I’ll update this when I get a response.

Subject: xn-- prefix never added in UTS # 46

In http://www.unicode.org/reports/tr46/tr46-15.html#ProcessingStepConvertValidate , the algorithm looks for a xn-- prefix and decodes the rest of the label per Punycode when it is present.

In http://www.unicode.org/reports/tr46/tr46-15.html#ToASCII however, the xn-- prefix is never added:

Convert each label with non-ASCII characters into Punycode [RFC3492]. This may record an error.

This should probably be replaced with something like:

For each label with non-ASCII characters, replace the label with “xn--” followed by the encoding of the label according to Punycode [RFC3492]. This may record an error.

Contributor

SimonSapin commented Jul 30, 2015

I’ve just submitted the following to http://www.unicode.org/reporting.html. I’ll update this when I get a response.

Subject: xn-- prefix never added in UTS # 46

In http://www.unicode.org/reports/tr46/tr46-15.html#ProcessingStepConvertValidate , the algorithm looks for a xn-- prefix and decodes the rest of the label per Punycode when it is present.

In http://www.unicode.org/reports/tr46/tr46-15.html#ToASCII however, the xn-- prefix is never added:

Convert each label with non-ASCII characters into Punycode [RFC3492]. This may record an error.

This should probably be replaced with something like:

For each label with non-ASCII characters, replace the label with “xn--” followed by the encoding of the label according to Punycode [RFC3492]. This may record an error.

@SimonSapin SimonSapin referenced this issue Jul 30, 2015

Merged

IDNA support #119

@Sebmaster

This comment has been minimized.

Show comment
Hide comment
@Sebmaster

Sebmaster Aug 16, 2015

My report to Unicode from some time ago which seems to not be fixed yet:

The Format section (8.1) under Conformance Testing in UTS46 is confusing.

The explanation for the toASCII and toUnicode explains to use the provided processing_option for toUnicode, and always use nontransitional for toASCII.
However, in the implementation section of toUnicode (4.3), it explains to always call the processing step with nontransitional. The toASCII parameter list provides a processing_option, though.

It looks to me, as if the descriptions for toASCII and toUnicode in the conformance testing section got mixed up. This also applies to the descriptions in the header of IdnaTest.txt.

The other thing is that there's only a single IdnaTest file, but there's no explanation to which algorithm it applies. Is it for IDNA2008, IDNA2003 or UTS46? It seems to be categorized according to Unicode standard instead of IDNA reference, which makes this really confusing. Haven't reported that one yet though.

Sebmaster commented Aug 16, 2015

My report to Unicode from some time ago which seems to not be fixed yet:

The Format section (8.1) under Conformance Testing in UTS46 is confusing.

The explanation for the toASCII and toUnicode explains to use the provided processing_option for toUnicode, and always use nontransitional for toASCII.
However, in the implementation section of toUnicode (4.3), it explains to always call the processing step with nontransitional. The toASCII parameter list provides a processing_option, though.

It looks to me, as if the descriptions for toASCII and toUnicode in the conformance testing section got mixed up. This also applies to the descriptions in the header of IdnaTest.txt.

The other thing is that there's only a single IdnaTest file, but there's no explanation to which algorithm it applies. Is it for IDNA2008, IDNA2003 or UTS46? It seems to be categorized according to Unicode standard instead of IDNA reference, which makes this really confusing. Haven't reported that one yet though.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Aug 17, 2015

Contributor

@Sebmaster regarding the other thing, http://www.unicode.org/reports/tr46/#Conformance_Testing explains how "To test for conformance to UTS46" using IdnaTest.txt.

Contributor

SimonSapin commented Aug 17, 2015

@Sebmaster regarding the other thing, http://www.unicode.org/reports/tr46/#Conformance_Testing explains how "To test for conformance to UTS46" using IdnaTest.txt.

@Sebmaster

This comment has been minimized.

Show comment
Hide comment
@Sebmaster

Sebmaster Aug 17, 2015

@SimonSapin I'm not sure that's totally correct either since:

Bn for Bidi Rule #n from Section 2. The Bidi Rule, in Right-to-Left Scripts for Internationalized Domain Names for Applications (IDNA) [IDNA2008]
Cn for ContextJ tests in, Appendix A.n in The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) [IDNA2008]. Thus C1 = Appendix A.1. ZERO WIDTH NON-JOINER, and C2 = Appendix A.2. ZERO WIDTH JOINER. The CONTEXTO tests are optional for client software, and not tested here.

is not described in TR46 at all. It's imported from the IDNA2008 standard, which has no relevance in the TR46 spec... I think 😕

Sebmaster commented Aug 17, 2015

@SimonSapin I'm not sure that's totally correct either since:

Bn for Bidi Rule #n from Section 2. The Bidi Rule, in Right-to-Left Scripts for Internationalized Domain Names for Applications (IDNA) [IDNA2008]
Cn for ContextJ tests in, Appendix A.n in The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) [IDNA2008]. Thus C1 = Appendix A.1. ZERO WIDTH NON-JOINER, and C2 = Appendix A.2. ZERO WIDTH JOINER. The CONTEXTO tests are optional for client software, and not tested here.

is not described in TR46 at all. It's imported from the IDNA2008 standard, which has no relevance in the TR46 spec... I think 😕

@Sebmaster

This comment has been minimized.

Show comment
Hide comment
@Sebmaster

Sebmaster Aug 18, 2015

Got a mail today from Unicode (regarding conformance test description):

This was discussed at the UTC meeting in July, and has been forwarded to the author of the UTS for consideration in a subsequent version.

So that's pretty sweet.

Sebmaster commented Aug 18, 2015

Got a mail today from Unicode (regarding conformance test description):

This was discussed at the UTC meeting in July, and has been forwarded to the author of the UTS for consideration in a subsequent version.

So that's pretty sweet.

@jcranmer

This comment has been minimized.

Show comment
Hide comment
@jcranmer

jcranmer Sep 15, 2015

Oh yeah, I came back into this and recall that the IdnaTest.txt is really bad at telling you how to process it.

@Sebmaster:
The ToASCII column uses nontransitional processing (read IdnaTest.txt's commented header) and UseSTD3ASCIIRules=true (see §8 of the input). However, they definitely appear to have some extra rules not described in their algorithm (for example, ToUnicode should never produce an [A4_1] or [A4_2] error, since those are specific to the ToASCII regime and ToUnicode never calls ToASCII, yet you can clearly see for yourself that they do).

jcranmer commented Sep 15, 2015

Oh yeah, I came back into this and recall that the IdnaTest.txt is really bad at telling you how to process it.

@Sebmaster:
The ToASCII column uses nontransitional processing (read IdnaTest.txt's commented header) and UseSTD3ASCIIRules=true (see §8 of the input). However, they definitely appear to have some extra rules not described in their algorithm (for example, ToUnicode should never produce an [A4_1] or [A4_2] error, since those are specific to the ToASCII regime and ToUnicode never calls ToASCII, yet you can clearly see for yourself that they do).

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Nov 1, 2015

Contributor

I got a response to #53 (comment):

This has been added to the feedback document for next week's meeting.

Contributor

SimonSapin commented Nov 1, 2015

I got a response to #53 (comment):

This has been added to the feedback document for next week's meeting.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Nov 26, 2015

Contributor

… and today:

I was directed by the UTC to let you know that this has been sent to the editor for review during the next update cycle.

Contributor

SimonSapin commented Nov 26, 2015

… and today:

I was directed by the UTC to let you know that this has been sent to the editor for review during the next update cycle.

@valenting

This comment has been minimized.

Show comment
Hide comment
@valenting

valenting Feb 8, 2016

As per servo/rust-url#160 I submitted feedback regarding Validation rule no. 2 - "2. The label must not contain a U+002D HYPHEN-MINUS character in both the third and fourth positions."
This isn't being enforced by all UAs, as it's being used on youtube which uses domains such as https://r3---sn-2gb7ln7k.googlevideo.com/videoplayback?... This domain breaks that rule.

valenting commented Feb 8, 2016

As per servo/rust-url#160 I submitted feedback regarding Validation rule no. 2 - "2. The label must not contain a U+002D HYPHEN-MINUS character in both the third and fourth positions."
This isn't being enforced by all UAs, as it's being used on youtube which uses domains such as https://r3---sn-2gb7ln7k.googlevideo.com/videoplayback?... This domain breaks that rule.

@srl295

This comment has been minimized.

Show comment
Hide comment
@srl295

srl295 May 12, 2016

@valenting Your feedback is tracked as part of PRI317 http://www.unicode.org/review/pri317/ (being discussed now).

By the way @SimonSapin I'd think the right way to track is via UTC agenda items http://www.unicode.org/L2/L-curdoc.htm

srl295 commented May 12, 2016

@valenting Your feedback is tracked as part of PRI317 http://www.unicode.org/review/pri317/ (being discussed now).

By the way @SimonSapin I'd think the right way to track is via UTC agenda items http://www.unicode.org/L2/L-curdoc.htm

@Sebmaster

This comment has been minimized.

Show comment
Hide comment
@Sebmaster

Sebmaster Nov 20, 2016

It seems like Unicode has closed that ticket without removing the -- validity requirement 😞

Does anybody have the ability to look into the Unicode ... process to see what's going on there?

Sebmaster commented Nov 20, 2016

It seems like Unicode has closed that ticket without removing the -- validity requirement 😞

Does anybody have the ability to look into the Unicode ... process to see what's going on there?

@annevk

This comment has been minimized.

Show comment
Hide comment

@annevk annevk added the parser label Dec 20, 2016

@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic Jan 6, 2017

Member

The validation rule problem mentioned at #53 (comment) doesn't seem to have made it into https://docs.google.com/document/d/11PEww2N0PbXyPhbsCdW_PjD3BNgZMy5XHUv02SSXNqY/edit#heading=h.p7mmdt3ofe3 by my reading. What's the latest? It's item A, nevermind

Member

domenic commented Jan 6, 2017

The validation rule problem mentioned at #53 (comment) doesn't seem to have made it into https://docs.google.com/document/d/11PEww2N0PbXyPhbsCdW_PjD3BNgZMy5XHUv02SSXNqY/edit#heading=h.p7mmdt3ofe3 by my reading. What's the latest? It's item A, nevermind

@bagder bagder referenced this issue Jan 30, 2017

Closed

IDNA2008 #223

@annevk annevk added the idna label Feb 10, 2017

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Feb 13, 2017

Member

Going forward, rather than tracking all UTS 46 feedback here, I suggest we just create new issues against this repository, so we can discuss each problem in isolation. I created an idna label that we can use to group them all.

Member

annevk commented Feb 13, 2017

Going forward, rather than tracking all UTS 46 feedback here, I suggest we just create new issues against this repository, so we can discuss each problem in isolation. I created an idna label that we can use to group them all.

@TimothyGu

This comment has been minimized.

Show comment
Hide comment
@TimothyGu

TimothyGu May 11, 2017

Member

As an update to the original issue, it seems the proposed changes to UTS#46 have been incorporated into its latest draft: http://www.unicode.org/reports/tr46/proposed.html.

Since traditionally UTS#46 updates are synced with Unicode Standard updates, a new version of UTS#46 with the CheckHyphens hook should be published next month, when Unicode 10.0.0 is scheduled to be released as well.

Member

TimothyGu commented May 11, 2017

As an update to the original issue, it seems the proposed changes to UTS#46 have been incorporated into its latest draft: http://www.unicode.org/reports/tr46/proposed.html.

Since traditionally UTS#46 updates are synced with Unicode Standard updates, a new version of UTS#46 with the CheckHyphens hook should be published next month, when Unicode 10.0.0 is scheduled to be released as well.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk May 11, 2017

Member

Yeah, but it's unclear if that's what we want as I commented in the document @domenic pointed to and discussed in email (primarily www-archive) and also at #267. However, it's been very hard to get implementers to give feedback on exactly what they're willing to try out here and what makes sense.

Member

annevk commented May 11, 2017

Yeah, but it's unclear if that's what we want as I commented in the document @domenic pointed to and discussed in email (primarily www-archive) and also at #267. However, it's been very hard to get implementers to give feedback on exactly what they're willing to try out here and what makes sense.

@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic May 11, 2017

Member

I think we should go for a quick-fix first so that people who are trying to use spec-complaint libraries like Node.js's URL and jsdom/whatwg-url don't continue to suffer. We can use #267 to figure out a longer-term browser-compatible plan.

Member

domenic commented May 11, 2017

I think we should go for a quick-fix first so that people who are trying to use spec-complaint libraries like Node.js's URL and jsdom/whatwg-url don't continue to suffer. We can use #267 to figure out a longer-term browser-compatible plan.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk May 11, 2017

Member
  1. This is the generic IDNA issue. It's not for the hyphen case specifically.
  2. Is there actual suffering? Demonstrated compatibility issues would be really valuable.
Member

annevk commented May 11, 2017

  1. This is the generic IDNA issue. It's not for the hyphen case specifically.
  2. Is there actual suffering? Demonstrated compatibility issues would be really valuable.
@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic May 11, 2017

Member

Sure, I meant we should go for a quick-fix for the hyphens.

Examples of suffering:

Member

domenic commented May 11, 2017

Sure, I meant we should go for a quick-fix for the hyphens.

Examples of suffering:

annevk added a commit that referenced this issue May 12, 2017

Clearly indicate a known issue with ToASCII
As reported at
#53 (comment) this is
causing issues in non-browser implementations.

annevk added a commit to web-platform-tests/wpt that referenced this issue May 18, 2017

URL: ToASCII
Tests for whatwg/url#53 and friends.

@annevk annevk referenced this issue May 18, 2017

Merged

URL: ToASCII #5976

annevk added a commit to web-platform-tests/wpt that referenced this issue May 19, 2017

URL: ToASCII
Tests for whatwg/url#53 and friends.
@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk May 24, 2017

Member

FWIW, I changed my mind after seeing #309 (comment). I think what UTS46 revision 18 defines is reasonable and that's what we should go with.

Member

annevk commented May 24, 2017

FWIW, I changed my mind after seeing #309 (comment). I think what UTS46 revision 18 defines is reasonable and that's what we should go with.

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk May 24, 2017

Member

@Sebmaster did your testing issue ever got addressed? If not, could you file a new issue on that? I'm happy to help investigate that as I've made some attempts myself as well now.

Member

annevk commented May 24, 2017

@Sebmaster did your testing issue ever got addressed? If not, could you file a new issue on that? I'm happy to help investigate that as I've made some attempts myself as well now.

annevk added a commit that referenced this issue May 24, 2017

IDNA: use proposed UTS46 flags to avoid breaking YouTube
Fixes #53 and fixes #267 by no longer breaking on on hyphens in the
3rd and 4th position of a domain label. This is known to break
YouTube: r3---sn-2gb7ln7k.googlevideo.com. This is done by setting
the proposed CheckHyphens flag to false.

Fixes #110 by clarifying that BIDI and CONTEXTJ checks are to be done
by setting the proposed CheckBidi and CheckJoiners flags to true.

Follow-up #313 is filed to remove the proposed bits once Unicode is
updated.

annevk added a commit that referenced this issue May 24, 2017

IDNA: use proposed UTS46 flags to avoid breaking YouTube
Tests: web-platform-tests/wpt#5976.

Fixes #53 and fixes #267 by no longer breaking on on hyphens in the
3rd and 4th position of a domain label. This is known to break
YouTube: r3---sn-2gb7ln7k.googlevideo.com. This is done by setting
the proposed CheckHyphens flag to false.

Fixes #110 by clarifying that BIDI and CONTEXTJ checks are to be done
by setting the proposed CheckBidi and CheckJoiners flags to true.

Follow-up #313 is filed to remove the proposed bits once Unicode is
updated.
@Sebmaster

This comment has been minimized.

Show comment
Hide comment
@Sebmaster

Sebmaster May 26, 2017

It's not addressed yet, but the latest draft contains a TODO for it.

Sebmaster commented May 26, 2017

It's not addressed yet, but the latest draft contains a TODO for it.

@domenic domenic closed this in dc9d831 Jun 1, 2017

domenic added a commit to web-platform-tests/wpt that referenced this issue Jun 1, 2017

URL: ToASCII
Tests for whatwg/url#53 and friends, as fixed by whatwg/html#2627.
@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic Jun 17, 2017

Member

I reported the following editorial issue:

I help maintain a JavaScript library for implementing UTS 46. In the process of revising our public API for the upcoming proposed revision (http://www.unicode.org/reports/tr46/proposed.html#ToASCII), we noticed how strange it is that all other inputs to ToASCII, besides the input string, are booleans. Whereas processing_option is an enumeration with two values.

For editorial consistency, would it make sense to switch the processing option to a boolean flag, e.g. UseTransitionalProcessing?

Member

domenic commented Jun 17, 2017

I reported the following editorial issue:

I help maintain a JavaScript library for implementing UTS 46. In the process of revising our public API for the upcoming proposed revision (http://www.unicode.org/reports/tr46/proposed.html#ToASCII), we noticed how strange it is that all other inputs to ToASCII, besides the input string, are booleans. Whereas processing_option is an enumeration with two values.

For editorial consistency, would it make sense to switch the processing option to a boolean flag, e.g. UseTransitionalProcessing?

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Jun 22, 2017

Member

@Sebmaster it seems http://www.unicode.org/reports/tr46/#Conformance_Testing was updated.

@domenic I asked for that change too. Note that if we want to track it here it should become its own issue. This is no longer a meta issue for all things IDNA as it got too unwieldy.

Member

annevk commented Jun 22, 2017

@Sebmaster it seems http://www.unicode.org/reports/tr46/#Conformance_Testing was updated.

@domenic I asked for that change too. Note that if we want to track it here it should become its own issue. This is no longer a meta issue for all things IDNA as it got too unwieldy.

@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic Jun 22, 2017

Member

No need to really track it. I do think though having a public archive of feedback we've submitted is good, and my bad for my part in derailing the thread away from that. Maybe www-archive is OK?

Member

domenic commented Jun 22, 2017

No need to really track it. I do think though having a public archive of feedback we've submitted is good, and my bad for my part in derailing the thread away from that. Maybe www-archive is OK?

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Jun 22, 2017

Member

Yeah or just a new issue for each piece of feedback. I don't think that would get crowded and if it does we can figure out a better approach.

Member

annevk commented Jun 22, 2017

Yeah or just a new issue for each piece of feedback. I don't think that would get crowded and if it does we can figure out a better approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment