Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize method can't handle URLs with punycoded TLD #28

Closed
walro opened this issue Apr 28, 2015 · 4 comments
Closed

normalize method can't handle URLs with punycoded TLD #28

walro opened this issue Apr 28, 2015 · 4 comments
Labels

Comments

@walro
Copy link
Contributor

walro commented Apr 28, 2015

[12] pry(main)> Twingly::URL::Normalizer.normalize("http://xn--80aesdcplhhhb0k.xn--p1ai/")
=> []

Page loads fine in Chrome though, whois works fine too.

@walro walro added the bug label Apr 28, 2015
@walro
Copy link
Contributor Author

walro commented Apr 28, 2015

It was pointed out that the TLD is really funky here.

@dentarg
Copy link
Collaborator

dentarg commented Apr 28, 2015

Strange, xn--p1ai (or рф) is in the public suffix list

I tested too, with public_suffix 1.5.1, got the same as above.

Don't really know of public_suffix work, if that above should be enough to support xn--p1ai or if they are missing something.

Could it be an encoding issue?

@walro
Copy link
Contributor Author

walro commented Apr 28, 2015

Could it be an encoding issue?

Yeah, to get it working (with public-suffix) we need to go from punycode back to utf-8:

irb(main):005:0> PublicSuffix.valid?("xn--80aesdcplhhhb0k.xn--p1ai")
=> false
irb(main):006:0> PublicSuffix.valid?("domain.рф")
=> true

Public suffix won't add support for it: weppos/publicsuffix-ruby#24

We could use https://github.com/mmriis/simpleidn (I found other, even less maintained, alternatives too) to do this ourselves.

@walro walro changed the title normalize method can't handle certain IDNs normalize method can't handle Punycoded urls May 6, 2015
@walro walro added the critical label May 6, 2015
@jage jage removed the critical label May 11, 2015
@jage
Copy link
Contributor

jage commented May 12, 2015

We should analyze our data and how many punycode TLDs do we have.

@dentarg dentarg changed the title normalize method can't handle Punycoded urls normalize method can't handle URLs with punycoded TLD May 12, 2015
@roback roback closed this as completed in d39a959 Sep 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants