Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling suffixes not in the public list #15

Closed
andersju opened this issue May 20, 2016 · 13 comments · Fixed by #17
Closed

Handling suffixes not in the public list #15

andersju opened this issue May 20, 2016 · 13 comments · Fixed by #17

Comments

@andersju
Copy link
Contributor

I think it would make sense for the library to handle domains that don't have a valid public suffix (possibly as an option).

Right now (with 0.2.0):

> PublicSuffix.registrable_domain("foobar.thistlddoesntexist")
"foobar.thistlddoesntexist"

Suggested:

> PublicSuffix.registrable_domain("foobar.thistlddoesntexist")
:undefined

Or something like that. This is how publicsuffix-erl does it, e.g.:

> :tld.domain("foobar.thistlddoesntexist")
:undefined
@myronmarston
Copy link
Contributor

The publicsuffix.org spec specifies what compliant libraries are supposed to do in this situation:

If no rules match, the prevailing rule is "*".

(See number 2 under "Algorithm"). We also rely on the test file they provide, which includes test cases for this case (see "Unlisted TLD" section). In order to pass those tests, we must follow the published spec.

@andersju
Copy link
Contributor Author

andersju commented May 20, 2016

Ah, I see! Still, perhaps it could be an optional parameter disabled by default? The library could then be used to answer a question like "Does this look like a sane public domain name?", which can sometimes be handy. I understand if you don't want to add it, though. Thanks for the module, anyway, I used publicsuffix-erl before but it gave me headaches with exrm.

Edit: Or just another function, e.g. PublicSuffix.is_suffix_valid?("foobar.commmm"). Would be great for validation purposes where you just need to do a "good enough" check.

@myronmarston
Copy link
Contributor

I used publicsuffix-erl before but it gave me headaches with exrm.

Yep, same thing happened with us. That's what led to us creating this library.

Anyhow, one of the issues with returning nil or a function like is_suffix_valid? is that we don't actually have a way to know if it's valid. As I understand the public suffix spec and rules, the reason for the If no rules match, the prevailing rule is "*" bit of the spec is that normally public suffixes have only one part. It creates for a smaller rules list if only the exceptions to that norm have to be listed. Given the language in the spec, I believe the people who maintain the public suffix list assume they do not have to list every one-part public suffix in the list--because compliant libraries will treat that as valid w/o needing a specific rule.

All that's to say: if a domain string does not match any rules in the list, I don't think that necessarily means it is invalid, and I wouldn't want to expose an API that claims it as such.

I think we can give you what you want, with a slightly different API, though:

> PublicSuffix.prevailing_rule("foobar.thistlddoesntexist")
["*"]

Basically, we can expose a function that gives you the prevailing rule. If we return ["*"] then it means we had to fallback because none of the explicitly defined rules matched. In your application, you could choose to interpret this as meaning the provided domain string is invalid.

On a side note, @benkirzhner have been discussing if we should switch to a single parse function that returns a struct. Something like:

> PublicSuffix.parse("mysite.foo.bar.co.uk")
%PublicSuffix.Domain{
  public_suffix: "co.uk",
  registrable_domain: "bar.co.uk",
  prevailing_rule: ["co", "uk"]
}

Thoughts?

@andersju
Copy link
Contributor Author

Basically, we can expose a function that gives you the prevailing rule. If we return ["*"] then it means we had to fallback because none of the explicitly defined rules matched. In your application, you could choose to interpret this as meaning the provided domain string is invalid.

That makes total sense and (since I've now learned a bit about the spec) it seems more reasonable and elegant than the solutions I've seen in similar libraries for other languages.

On a side note, @benkirzhner have been discussing if we should switch to a single parse function that returns a struct.

Sounds like a good idea! Fits well with how Elixir's URI module (which one might want to use at the same time) does it.

@myronmarston
Copy link
Contributor

Would you be interested in working up a PR to make the discussed changes, @andersju?

@andersju
Copy link
Contributor Author

Sure thing. I'm still fairly new to Elixir but I'll give it a go (might not have time until a week or two from now though).

@bkirz
Copy link
Contributor

bkirz commented May 24, 2016

Thanks for taking this on, @andersju! I just had a chat with @myronmarston, and we think a good plan for now would be to expose two new functions:

  • a general-purpose function, PublicSuffix.prevailing_rule(domain), that returns the prevailing rule as a string.
  • a predicate function, PublicSuffix.matches_explicit_rule?(domain), for explicitly checking whether or not the domain matches an existing rule.

This way, we get the flexibility of being able to see the prevailing rule (for debugging purposes, for example) while having a simpler API for the use case you originally described that doesn't require explicit knowledge of the fallback "*" case. How does that sound?

@andersju
Copy link
Contributor Author

andersju commented May 25, 2016

Sounds good! I already cooked up a quick solution (but need to write tests and docs). Question though: should PublicSuffix.matches_explicit_rule?(domain) extract just the public suffix (like public_suffix/2) and check whether that matches an explicit rule, or extract the registrable part of a domain (like registrable_domain/2) and check whether its public suffix matches an explicit rule?

I think the latter would be more useful:

iex(6)> PublicSuffix.registrable_domain("example.com")
"example.com"
iex(7)> PublicSuffix.registrable_domain("example")    
nil
iex(8)> PublicSuffix.matches_explicit_rule?("example.com")
true
iex(9)> PublicSuffix.matches_explicit_rule?("com")        
false

...but matches_explicit_rule? is perhaps then slightly misleading. What I'm after is really registrable_domain_matches_explicit_rule?, but that's a bit verbose :) Although, come to think of it, if we go with the former and matches_explicit_rule? handles nil and treats it as false you could do:

iex(7)> "example" |> PublicSuffix.registrable_domain |> PublicSuffix.matches_explicit_rule?    
false
iex(8)> "example.com" |> PublicSuffix.registrable_domain |> PublicSuffix.matches_explicit_rule?
true

What do you think?

@myronmarston
Copy link
Contributor

Question though: should PublicSuffix.matches_explicit_rule?(domain) extract just the public suffix (like public_suffix/2) and check whether that matches an explicit rule, or extract the registrable part of a domain (like registrable_domain/2) and check whether its public suffix matches an explicit rule?

I confess I don't understand the distinction you are making. Both public_suffix and registrable_domain begin by finding the prevailing rule for the given input, and they will always find the same rule given the same input. The difference between the functions is what they do with the prevailing rule: given a prevailing rule of n parts, public_suffix will return the last n parts of the input domain, whereas registrable_domain will return the last n + 1 parts.

So I don't see how public_suffix vs registrable_domain plays into whether or not a provided string matches an explicit rule.

But maybe I'm missing something?

@andersju
Copy link
Contributor Author

Basically, I want to determine "does the given string contain a registrable part and, if so, does its public suffix match an explicit rule?". So I want "example", "example.notatld", "com" to evaluate to false. But I figure a matches_explicit_rule? function ought to return false, false, true, respectively.

So, to answer the initial question, I think we could either create an additional function or simply make it possible to pipe the output from registrable_domain, as registrable_domain returns nil if there's no registrable part. I.e.:

> "example" |> PublicSuffix.registrable_domain
nil
> "example" |> PublicSuffix.registrable_domain |> PublicSuffix.matches_explicit_rule?
false

Make sense?

@andersju
Copy link
Contributor Author

Actually, PublicSuffix.matches_explicit_rule? treating nil as valid input might just make it unnecessarily ugly. For the use case in the previous post it's easy enough to handle it on your own if you want the piping ability (e.g. "example" |> PublicSuffix.registrable_domain |> (&(if &1 == nil, do: "", else: &1)).() |> PublicSuffix.matches_explicit_rule?).

@myronmarston
Copy link
Contributor

Thanks for explaining. I think I understand your use case better now. I was confused previously because it sounded like you were trying to decide if PublicSuffix.matches_explicit_rule? should first extract the registrable_domain or the public_suffix but it sounds like we're on the same page about doing neither :).

Actually, PublicSuffix.matches_explicit_rule? treating nil as valid input might just make it unnecessarily ugly.

With pattern matching, it's really not bad...we just need a function clause like:

def matches_explicit_rule?(nil), do: false

While I don't want to support every elixir type, nil seems OK; and the answer to "Does nil match an explicit rule?" is "no".

@myronmarston
Copy link
Contributor

I've released 0.3.0 with your changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants