Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure normalized IDNA domains return ASCII strings #90

Merged
merged 23 commits into from
Sep 15, 2016
Merged

Conversation

dentarg
Copy link
Collaborator

@dentarg dentarg commented Sep 9, 2016

This PR ensure correct behaviour for normalized URLs (fixes #89), but it also changes the behaviour for non- normalized URL objects, see these snippets (original comment):

Previous version (4.2.0):

[1] pry(main)> u1 = Twingly::URL.parse("https://www.foo.ایران.ir/bar")
=> #<Twingly::URL:0x3fd876a72dd8 https://www.foo.ایران.ir/bar>
[2] pry(main)> u2 = Twingly::URL.parse("https://www.foo.xn--mgba3a4f16a.ir/bar")
=> #<Twingly::URL:0x3fd8767d277c https://www.foo.xn--mgba3a4f16a.ir/bar>
[3] pry(main)>
[4] pry(main)> u1.to_s
=> "https://www.foo.ایران.ir/bar"
[5] pry(main)> u2.to_s
=> "https://www.foo.xn--mgba3a4f16a.ir/bar"
[6] pry(main)>
[7] pry(main)> u1.domain
=> "foo.ایران.ir"
[8] pry(main)> u2.domain
=> "foo.ایران.ir"
[9] pry(main)>
[10] pry(main)> u1.tld
=> "ایران.ir"
[11] pry(main)> u2.tld
=> "ایران.ir"

This version:

[1] pry(main)> u1 = Twingly::URL.parse("https://www.foo.ایران.ir/bar")
=> #<Twingly::URL:0x3fc621fd5454 https://www.foo.ایران.ir/bar>
[2] pry(main)> u2 = Twingly::URL.parse("https://www.foo.xn--mgba3a4f16a.ir/bar")
=> #<Twingly::URL:0x3fc621fb855c https://www.foo.xn--mgba3a4f16a.ir/bar>
[3] pry(main)>
[4] pry(main)> u1.to_s
=> "https://www.foo.ایران.ir/bar"
[5] pry(main)> u2.to_s
=> "https://www.foo.xn--mgba3a4f16a.ir/bar"
[6] pry(main)>
[7] pry(main)> u1.domain
=> "foo.ایران.ir"
[8] pry(main)> u2.domain
=> "foo.xn--mgba3a4f16a.ir"
[9] pry(main)>
[10] pry(main)> u1.tld
=> "ایران.ir"
[11] pry(main)> u2.tld
=> "xn--mgba3a4f16a.ir"

Original description

So, yes, this increases the load time quite a bit... :) Not true anymore (see 2bd6447) Haven't run the profiler yet, but we could do that.

I'm open to suggestions about how we should deal with the extending of the default list, I just did something to get the discussion started.


Related to #71. Close #85 and #89.

Related to #89 and #71.

Close #85.
@roback
Copy link
Member

roback commented Sep 9, 2016

I'm open to suggestions about how we should deal with the extending of the default list, I just did something to get the discussion started.

We could also create a new list instead of changing the default one:

    CUSTOM_SUFFIX_LIST = PublicSuffix::List.
      parse(File.read(PublicSuffix::List::DEFAULT_LIST_PATH), private_domains: false).
      tap do |list|
        list.indexes.keys.
        map    { |name| Addressable::IDNA.to_ascii(name) }.
        select { |name| name =~ /xn\-\-/ }.
        each   { |name| list << PublicSuffix::Rule.factory(name) }
      end

# ...

public_suffix_domain = PublicSuffix.parse(host, default_rule: nil, list: CUSTOM_SUFFIX_LIST)

File.read(PublicSuffix::List::DEFAULT_LIST_PATH), private_domains: false)
PublicSuffix::List.default.indexes.keys.
map { |name| Addressable::IDNA.to_ascii(name) }.
select { |name| name =~ /xn\-\-/ }.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add ^, I think it's always in the start of the host?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same expressions as https://github.com/weppos/publicsuffix-ruby/blob/v2.0.2/test/psl_test.rb#L37-L38, but that expression might be constructed to handle more variants (https://github.com/weppos/publicsuffix-ruby/blob/v2.0.2/test/tests.txt)

we can check if we get the same count using ^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think /\Axn\-\-/i is the correct regex.

The ToASCII and ToUnicode operations MUST recognize the ACE prefix in a case-insensitive manner.

The ACE prefix for IDNA is "xn--" or any capitalization thereof.

https://tools.ietf.org/html/rfc3490#section-5

@roback
Copy link
Member

roback commented Sep 9, 2016

This doesn't look right:

[1] pry(main)> Twingly::URL.parse("https://www.foo.ایران.ir/bar").tld
=> "ایران.ir"
[2] pry(main)> Twingly::URL.parse("https://www.foo.ایران.ir/bar").normalized.tld
=> "ir"

@roback
Copy link
Member

roback commented Sep 9, 2016

This doesn't look right:

Made a few changes that fixes things:

diff --git a/lib/twingly/url.rb b/lib/twingly/url.rb
index 7e7fef6..46ac6ca 100644
--- a/lib/twingly/url.rb
+++ b/lib/twingly/url.rb
@@ -8,10 +8,12 @@ require_relative "version"

 PublicSuffix::List.default = PublicSuffix::List.parse(
   File.read(PublicSuffix::List::DEFAULT_LIST_PATH), private_domains: false)
-PublicSuffix::List.default.indexes.keys.
-  map { |name| Addressable::IDNA.to_ascii(name) }.
+PublicSuffix::List.default.each.
+  map { |rule| Addressable::IDNA.to_ascii(rule.value) }.
   select { |name| name =~ /xn\-\-/ }.
-  each { |name| PublicSuffix::List.default << PublicSuffix::Rule.factory(name) }
+  each { |name| PublicSuffix::List.default.add(PublicSuffix::Rule.factory(name), reindex: false) }
+
+PublicSuffix::List.default.reindex!

 module Twingly
   class URL

Had to add reindex: false to add (and PublicSuffix::List.default.reindex!) otherwise it took ~5 seconds to load the file into pry 😄

Edit: I don't know why I added the each before map though, that should not be there...

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 9, 2016

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 9, 2016

But I like the suggestions above about our own list, looks nice to be able to pass default_rule: nil when creating the list

@roback
Copy link
Member

roback commented Sep 9, 2016

@roback << is alias for add, https://github.com/weppos/publicsuffix-ruby/blob/v2.0.2/lib/public_suffix/list.rb#L195

Aha, I could have kept the << then :)

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 10, 2016

looks nice to be able to pass default_rule: nil when creating the list

I misread that, we still pass it when doing #parse

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 10, 2016

Aha, I could have kept the << then :)

Oops, no, we need to use #add, I thought reindex: false was the default, but it's reindex: true.

"bundle exec rspec spec/lib/twingly/url_spec.rb" went from

    files took 1.72 seconds to load

to

    files took 0.27164 seconds to load
@dentarg
Copy link
Collaborator Author

dentarg commented Sep 10, 2016

This doesn't look right:

Good catch, I used your fix

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 12, 2016

Hmm

https://en.wikipedia.org/wiki/Internationalized_domain_name#ToASCII_and_ToUnicode

The conversions between ASCII and non-ASCII forms of a domain name are accomplished by algorithms called ToASCII and ToUnicode. These algorithms are not applied to the domain name as a whole, but rather to individual labels. For example, if the domain name is www.example.com, then the labels are www, example, and com. ToASCII or ToUnicode are applied to each of these three separately.

I wonder if we do it correctly here...

My shell is a bit fancy, and with the old versions I got this:

    $ cat `bundle show public_suffix`/data/list.txt | wc -l
    preexec: command not found: bundle
       11618

# Add ASCII form of all internationalized domain names to the public suffix list
PublicSuffix::List.default.
map { |rule| Addressable::IDNA.to_ascii(rule.value) }.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could try with both "pure" and "libidn" implementations just so we know we won't run into something exciting once we drop libidn.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried manually with

diff --git a/lib/twingly/url.rb b/lib/twingly/url.rb
index 4193abd..9dd4581 100644
--- a/lib/twingly/url.rb
+++ b/lib/twingly/url.rb
@@ -1,5 +1,5 @@
 require "addressable/uri"
-require "addressable/idna/native"
+# require "addressable/idna/native"
 require "public_suffix"

 require_relative "url/null_url"
@@ -29,7 +29,7 @@ module Twingly
     ERRORS_TO_EXTEND = [
       Addressable::URI::InvalidURIError,
       PublicSuffix::DomainInvalid,
-      IDN::Idna::IdnaError,
+      # IDN::Idna::IdnaError,
     ]

     private_constant :ACCEPTED_SCHEMES, :ENDS_WITH_SLASH, :ERRORS_TO_EXTEND
diff --git a/twingly-url.gemspec b/twingly-url.gemspec
index f06d0d9..89f14cd 100644
--- a/twingly-url.gemspec
+++ b/twingly-url.gemspec
@@ -14,7 +14,7 @@ Gem::Specification.new do |s|

   s.add_dependency "addressable", "~> 2"
   s.add_dependency "public_suffix", "~> 2"
-  s.add_dependency "idn-ruby", "~> 0.1"
+  # s.add_dependency "idn-ruby", "~> 0.1"

   s.add_development_dependency "rake", "~> 10"
   s.add_development_dependency "rspec", "~> 3"
~/twingly/ruby/twingly-url public_suffix-2*
$ bundle show | grep idn-ruby

and all specs pass, but I'm not sure it says much, as we just have a few IDNs in our tests cases. But I think it's okay. :)

Catch valid URLs being normalized to something that's parsed as NullURL.
Currently, this URL isn't valid, but if we drop libidn (which we talked
about), it would be valid, due to the way Addressable 2.4.0 works.
See comment at: #66 (comment)

Adding this URL to the tests so we will notice the change in behaviour,
if/when we drop libidn.
@dentarg
Copy link
Collaborator Author

dentarg commented Sep 12, 2016

Continuing the hidden conversation at #90 (comment) here

Interesting results, the counts are different

[1] pry(main)> PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name =~ /xn\-\-/ }.count
=> 814
[2] pry(main)> PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name =~ /\Axn\-\-/i }.count
=> 812
[3] pry(main)> PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name =~ /^xn\-\-/i }.count
=> 812

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 12, 2016

[6] pry(main)> a_814 - a_812
=> ["sande.xn--mre-og-romsdal-qqb.no", "sande.xn--mre-og-romsdal-qqb.no"]

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 12, 2016

Hmm

[7] pry(main)> (a_814 - a_812).uniq
=> ["sande.xn--mre-og-romsdal-qqb.no"]
[8] pry(main)> PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name =~ /xn\-\-/ }.uniq.count
=> 407
[9] pry(main)> PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name =~ /\Axn\-\-/i }.uniq.count
=> 406

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 12, 2016

Oh, my bad, I'm in bundle console, so I've already loaded the code once... :)

@jage
Copy link
Contributor

jage commented Sep 12, 2016

Interesting results, the counts are different

Shouldn't we check each part? (#90 (comment))

> PublicSuffix::List.default.map { |rule| Addressable::IDNA.to_ascii(rule.value) }.select { |name| name.split(".").any? { |part| part =~ /\Axn\-\-/i } }.count
814

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 12, 2016

Yes

@jage
Copy link
Contributor

jage commented Sep 12, 2016

Shouldn't we check each part? (#90 (comment))

It could be done with a regex, like /(\A|\.)xn\-\-/i, but I think I prefer the explicit split.

list
end

private_class_method def self.punycoded_names(list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative could be:

private_class_method \
def self.punycoded_names(list)
  # ..
end

Not sure which I prefer though (probably one with newline).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dentarg dentarg changed the title Use PublicSuffix 2 Use PublicSuffix 2 to improve normalized behaviour (#89) Sep 14, 2016
@dentarg dentarg changed the title Use PublicSuffix 2 to improve normalized behaviour (#89) Use PublicSuffix 2 to improve normalized behaviour (fix #89) Sep 14, 2016
@dentarg
Copy link
Collaborator Author

dentarg commented Sep 14, 2016

@jage WDYT? Merge?

@jage
Copy link
Contributor

jage commented Sep 14, 2016

I think we need to add a comment (in the PR) as to what changes here. Previously we had som consistency as to what tld and domain returned (always unicode), now they return unicode if the input is unicode and ascii if the input is ascii. To be honest I think I would prefer that we always returned the same format, no matter what the input is.

Previous version:

[1] pry(main)> u1 = Twingly::URL.parse("https://www.foo.ایران.ir/bar")
=> #<Twingly::URL:0x3fd876a72dd8 https://www.foo.ایران.ir/bar>
[2] pry(main)> u2 = Twingly::URL.parse("https://www.foo.xn--mgba3a4f16a.ir/bar")
=> #<Twingly::URL:0x3fd8767d277c https://www.foo.xn--mgba3a4f16a.ir/bar>
[3] pry(main)>
[4] pry(main)> u1.to_s
=> "https://www.foo.ایران.ir/bar"
[5] pry(main)> u2.to_s
=> "https://www.foo.xn--mgba3a4f16a.ir/bar"
[6] pry(main)>
[7] pry(main)> u1.domain
=> "foo.ایران.ir"
[8] pry(main)> u2.domain
=> "foo.ایران.ir"
[9] pry(main)>
[10] pry(main)> u1.tld
=> "ایران.ir"
[11] pry(main)> u2.tld
=> "ایران.ir"

This version:

[1] pry(main)> u1 = Twingly::URL.parse("https://www.foo.ایران.ir/bar")
=> #<Twingly::URL:0x3fc621fd5454 https://www.foo.ایران.ir/bar>
[2] pry(main)> u2 = Twingly::URL.parse("https://www.foo.xn--mgba3a4f16a.ir/bar")
=> #<Twingly::URL:0x3fc621fb855c https://www.foo.xn--mgba3a4f16a.ir/bar>
[3] pry(main)>
[4] pry(main)> u1.to_s
=> "https://www.foo.ایران.ir/bar"
[5] pry(main)> u2.to_s
=> "https://www.foo.xn--mgba3a4f16a.ir/bar"
[6] pry(main)>
[7] pry(main)> u1.domain
=> "foo.ایران.ir"
[8] pry(main)> u2.domain
=> "foo.xn--mgba3a4f16a.ir"
[9] pry(main)>
[10] pry(main)> u1.tld
=> "ایران.ir"
[11] pry(main)> u2.tld
=> "xn--mgba3a4f16a.ir"

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 14, 2016

To be honest I think I would prefer that we always returned the same format, no matter what the input is.

Isn't that what we have normalization for?

@jage
Copy link
Contributor

jage commented Sep 14, 2016

Isn't that what we have normalization for?

Yes and no, normalization is useful to compare URLs (to avoid duplicates). But if I want to print the URL in my web UI, shouldn't we print the same string regardless of the input was in unicode or ascii?

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 14, 2016

Previously we had som consistency as to what tld and domain returned (always unicode),

That just happened to be the case, it wasn't documented.

now they return unicode if the input is unicode and ascii if the input is ascii.

That is true, we can add tests for that.

@jage
Copy link
Contributor

jage commented Sep 14, 2016

This just happened to be the case, it wasn't documented.

That doesn't change the fact.

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 14, 2016

But if I want to print the URL in my web UI, shouldn't we print the same string regardless of the input was in unicode or ascii?

Yes, that makes sense, but fortunately, we don't have the need to do that right now, so I think we are okay. I don't think that should be the default behaviour though, I think it should be a function that the user explicitly uses. Much like #71, "get URL (or parts of) in Unicode".

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 14, 2016

Previously we had som consistency as to what tld and domain returned (always unicode), now they return unicode if the input is unicode and ascii if the input is ascii.

One could argue we are more consistent now, what the methods return are based on the input. E.g. #host wasn't always unicode before.

Previous version:

$ bundle console
[1] pry(main)> u1 = Twingly::URL.parse("https://www.foo.ایران.ir/bar")
=> #<Twingly::URL:0x3ff0792d88b4 https://www.foo.ایران.ir/bar>
[2] pry(main)> u2 = Twingly::URL.parse("https://www.foo.xn--mgba3a4f16a.ir/bar")
=> #<Twingly::URL:0x3ff079d181f8 https://www.foo.xn--mgba3a4f16a.ir/bar>
[3] pry(main)> u1.host
=> "www.foo.ایران.ir"
[4] pry(main)> u2.host
=> "www.foo.xn--mgba3a4f16a.ir"

This version:

$ bundle console
Resolving dependencies...
[1] pry(main)> u1 = Twingly::URL.parse("https://www.foo.ایران.ir/bar")
=> #<Twingly::URL:0x3fe381504f48 https://www.foo.ایران.ir/bar>
[2] pry(main)> u2 = Twingly::URL.parse("https://www.foo.xn--mgba3a4f16a.ir/bar")
=> #<Twingly::URL:0x3fe381186484 https://www.foo.xn--mgba3a4f16a.ir/bar>
[3] pry(main)> u1.host
=> "www.foo.ایران.ir"
[4] pry(main)> u2.host
=> "www.foo.xn--mgba3a4f16a.ir"

As said, I can add tests for this.

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 14, 2016

Just a note about the Nameprep algorithm, from https://en.wikipedia.org/wiki/Internationalized_domain_name#ToASCII_and_ToUnicode

The function ToUnicode reverses the action of ToASCII, stripping off the ACE prefix and applying the Punycode decode algorithm. It does not reverse the Nameprep processing, since that is merely a normalization and is by nature irreversible.

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 14, 2016

I suggest we release this as 4.2.1 (bugfix).

Perhaps not, as @jage showed, we will change the behaviour for some methods, so 5.0.0 might make sense.

@dentarg
Copy link
Collaborator Author

dentarg commented Sep 15, 2016

Could have been part of the commit message in my latest commit (15414ed), but I can write it here.

This will print userinfo, user and password for all URLs, which it
didn't do before.
* DRY up to urls example
* Print the same example methods for all URLs in the README
Copy link
Contributor

@jage jage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to a new PR-title: "Ensure normalized IDNA domains return ASCII strings" since I think PublicSuffix is just a small technical detail to make it happen.

url.host # => "räksmörgås.макдональдс.рф"
url.origin # => "http://xn--rksmrgs-5wao1o.xn--80aalb1aicli8a5i.xn--p1ai"
url.path # => "/foo"
url.without_scheme # => "//räksmörgås.макдональдс.рф/foo"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we print the normalized version as well (for all parts).

url.without_scheme
url.normalized.without_scheme

Since they return different strings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was thinking of showing normalization examples of these URLs as another section below, but maybe inline is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's the same URL we compare non-normalized and normalized, I think inline (alternating) is easier to read, my assumption is that the reader wants to understand the potential difference between a normalized and a non-normalized URL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to not have to repeat all the parts for the normalized version, so I'm thinking a new section is easier, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with the current structure it might become too many rows. We can add a new section now.

I can open an issue about improving the output, I'm thinking some of the behavior needs a table:

Method Non-normalized Normalized
.host räksmörgås.макдональдс.рф xn--rksmrgs-5wao1o.xn--80aalb1aicli8a5i.xn--p1ai

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table seems like a nice solution!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can add them as you suggested. I think it will make the most clear examples.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, GitHub failed to show me your comments, saw them now when I did a full page reload...

Copy link
Collaborator Author

@dentarg dentarg Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our specs could also benefit from the "table-way" of doing things.

@dentarg dentarg changed the title Use PublicSuffix 2 to improve normalized behaviour (fix #89) Ensure normalized IDNA domains return ASCII strings Sep 15, 2016
@jage
Copy link
Contributor

jage commented Sep 15, 2016

LGTM!

Perhaps not, as @jage showed, we will change the behaviour for some methods, so 5.0.0 might make sense.

Yes, agree that this is a major version bump.

@dentarg dentarg merged commit 04870e0 into master Sep 15, 2016
@dentarg dentarg deleted the public_suffix-2 branch September 15, 2016 14:12
dentarg added a commit that referenced this pull request Sep 16, 2016
To release #90
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NormalizedURL#to_s returns punycode, other instance methods does not Use PublicSuffix 2.0
4 participants