Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing an empty host #79

Closed
SimonSapin opened this issue Jan 21, 2016 · 13 comments
Closed

Parsing an empty host #79

SimonSapin opened this issue Jan 21, 2016 · 13 comments

Comments

@SimonSapin
Copy link
Contributor

If I’m reading the parsing algorithm correctly, parsing some-non-special-scheme:/// causes the host parser to be called with an empty string. Since domain to ASCII sets VerifyDnsLength to false, asciiDomain is also empty (not failure) and IPv4 parsing is called with an empty string. There, parts is first a list of one empty string, then an empty list:

Let parts be input split on ".".
If the last item in parts is the empty string, set syntaxViolationFlag and remove the last item from parts.

Later, numbers stays empty and some of the steps are nonsensical when they talk about the last item of numbers.

So:

  • The IPv4 parser should probably return failure for the empty string as a first step
  • This would make the host of this example URL an empty domain. Should it be null instead?
@annevk
Copy link
Member

annevk commented Jan 21, 2016

Wow, I thought I accounted for this.

"host state" should account for this somehow. Making host null would make the serializer unhappy as far as I can tell. So I guess the best solution would be to just not invoke the host parser for the empty string. I.e., change "Let host be the result of host parsing buffer." to "Let host be the empty string if buffer is the empty string, and the result of host parsing buffer otherwise."

@SimonSapin
Copy link
Contributor Author

Host state already says (in 2 places):

  1. If url is special and buffer is the empty string, return failure.
  2. Let host be the result of host parsing buffer.
  3. If host is failure, return failure.
  4. Set url’s host to host,

Maybe a step could be added between 1 and 2: “If buffer is the empty string, let host be an empty domain” (an url’s host is not a string) and old 2 be made an “Otherwise”.

@annevk
Copy link
Member

annevk commented Jan 21, 2016

Yeah, that sounds pretty good.

@SimonSapin
Copy link
Contributor Author

Though I think it’s more robust to leave the host state alone and fix the IPv4 parser instead.

@annevk
Copy link
Member

annevk commented Jan 21, 2016

Why? In case something else invokes the IPv4 parser directly with the empty string?

@SimonSapin
Copy link
Contributor Author

Yes, but more generally it’s good for the thing that maintain an invariant (such as "this string is non-empty") to be close to the things that rely on that invariant.

The URL parser algorithm is full of things like that: one bit of pseudo-code relies on another bit of pseudo-code that doesn’t seem directly related to have had some non-obvious side effect. It’s not wrong, but it makes it harder for humans to read the algorithm without making mistakes.

@annevk
Copy link
Member

annevk commented Jan 22, 2016

Okay, how about we add "Assert: input is not the empty string." to the IPv4 parser and make the host parser return an empty domain if the input is the empty string.

@annevk
Copy link
Member

annevk commented Jan 22, 2016

(And you're right, fixed my way this would have tripped URL.domainToASCII().)

@annevk
Copy link
Member

annevk commented Feb 11, 2016

@SimonSapin, any thoughts?

@SimonSapin
Copy link
Contributor Author

This would work, but:

The IPv4 parser already has three different kind of outcomes: return a string (the input unchanged, used when the input is not in the expected format), return failure (used when numbers are too large) and return an IPv4 address. Why introduce a fourth? (Abort with an assertion error, a concept that this spec didn’t have so far.) Doesn’t returning failure for the empty string have the desired effect?

@nlevitt
Copy link

nlevitt commented Dec 2, 2016

Here is another way to fix. Currently this is step 3:

  1. If the last item in parts is the empty string, set syntaxViolationFlag and remove the last item from parts.

It could be changed to:

  1. If last item in parts is the empty string, set syntaxViolationFlag, and if parts has more than one item, remove the last item from parts.

After that, step 6.1 "If part is the empty string, return input." applies.

@annevk
Copy link
Member

annevk commented Dec 9, 2016

That sounds pretty good.

@annevk
Copy link
Member

annevk commented Dec 29, 2016

Note that the main problem here is being fixed by #185, but it seems good to also fix this in-depth as discussed.

annevk added a commit that referenced this issue Dec 29, 2016
This is a theoretical issue once #185 lands, but as discussed it seems
good to address this anyway. Fixes #79.
annevk added a commit that referenced this issue Jan 24, 2017
This is a theoretical issue once #185 lands, but as discussed it seems
good to address this anyway. Fixes #79.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants