Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for IDNA 2008 (RFC 5891) #6910

Closed
j-bernard opened this issue Nov 12, 2021 · 10 comments
Closed

Add support for IDNA 2008 (RFC 5891) #6910

j-bernard opened this issue Nov 12, 2021 · 10 comments
Labels
bug Bug in existing code

Comments

@j-bernard
Copy link

I'm aware of issue #1615 but I think there is an easier way to fix this.

There is an trustworthy IDN implementation that actually offers IDNA 2008 compatibility, ICU4J, see IDNA class.

The Android platform already replaced the java.net.IDN methods with ICU4J however they kept using the static methods that are deprecated as only compliant with IDNA 2003.

See my bug report on Android issue tracker for more details and snippet to correctly use ICU4J with the right flags.

@j-bernard j-bernard added the bug Bug in existing code label Nov 12, 2021
@yschimke
Copy link
Collaborator

If this is part of the public Android APIs, we could potentially use the better implementation on Android, while leaving JDK as is.

How common are these domains? What are some examples that are currently broken?

image

@j-bernard
Copy link
Author

See https://unicode.org/reports/tr46/#IDNAComparison for differences between the new and old standards.

I don't have statistics about these domains but at least one pretty common German character (Eszet: ß) is impacted.
I found a domain that seems to have been created especially to illustrate the problems of non-compliance with IDNA 2008, see this post: http://straße.de/. You won't get the same page with Google Chrome (IDNA 2003) and Firefox (IDNA 2008).

@yschimke
Copy link
Collaborator

yschimke commented Nov 14, 2021

@j-bernard Care to take a look at #6912

What flags should I be passing to getUTS46Instance?

It seems this is safe down to Android 24, is that your reading?

@j-bernard
Copy link
Author

@yschimke thanks, I'll have a look at the PR.

You should pass the following flags
IDNA.CHECK_BIDI, IDNA.CHECK_CONTEXTJ, IDNA.CHECK_CONTEXTO, IDNA.USE_STD3_RULES and depending on the way you convert IDNA.NONTRANSITIONAL_TO_ASCII or IDNA.NONTRANSITIONAL_TO_UNICODE.

Android doc seems to say API level 24 but I did not conduct any test on old versions.

@yschimke
Copy link
Collaborator

Going to close this out as the strictness of IDNA 2008 is likely to cause more visible issues than this solves, particularly as this isn't uniformly supported or implemented by clients and servers.

@arnt
Copy link

arnt commented Jan 2, 2023

JFYI: IDNA2008 solves more problems than it creates. When I saw this PR I became curious and checked.

I checked about 1.5m IDN domains in more than a thousand TLDs. When you have an IDNA2003 implementation, IDNA2008 maintains compatibility with >99,999% of IDN domains, gains compatibility with the ones that din't work yet in IDNA2003, and breaks almost nothing. It's difficult to judge precisely due to the existence of test domains, but saying that it breaks one domain in a million isn't unreasonable.

The main domain broken is ≠.net, which is broken because IDNA says that slashes aren't allowed and IDNA2008 extends that to ban composed codepoints that decompose to slashes. I'm not sure whether that domain's a test domain. If not, then moving to IDNA2008 would break use of it.

There's also ·.shop, a dozen or more than combine latin letters and a CJK hyphen, and some that couldn't be entered here because they're not legal unicode. Those can't be shown on-screen, are forbidden by IDNA2008, but are sort of legal according to IDNA2003.

So: don't worry about breaking anything with a move to IDNA2008. Wanting to stay with the base platform's IDN implementation is IMO a very good argument against a move. Worry about breakage is not.

I can rerun the analysis later, if anyone wants up-to-date results.

@yschimke yschimke reopened this Jan 2, 2023
@swankjesse
Copy link
Member

@arnt great timing, thanks for the explanation. I’ve been doing HTTP stuff for a long time but I’m new to IDN so advice here is very welcome.

Any advice on how I could source my own list of 1.5M IDNs for testing? Good tests make it much easier to write good code.

@arnt
Copy link

arnt commented Jan 2, 2023

My advice is to get a CZDS account and not bother with anything else. You can get more zone files if you approach each registry separately, but using just the CZDS provides a lot of domains for little effort. ICANN also provides code to download the files in bulk.

FWIW I'll need more zone files for various work chores, and will rerun my analysis as I get them. But I don't expect to uncover many new problems. A problem has to be very rare indeed to not show up in the first million cases.

Your next challenge will be to use IDNA2003 and -2008 in the same program at the same time, with no cross-contamination. I decided on ICU4J.

@swankjesse
Copy link
Member

Done! See #7008

@arnt
Copy link

arnt commented May 12, 2023

I looked at the code recently, great work and thanks. And while it was perhaps not your top priority, you made a frustrating day end well for me ;)

I had to think a little about whether your choice to use nontransitional is right, but overall I think it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug in existing code
Projects
None yet
Development

No branches or pull requests

4 participants