Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is URL parser supposed to handle https://blah.com? #261

Closed
estark37 opened this issue Mar 2, 2017 · 5 comments
Closed

Is URL parser supposed to handle https://blah.com? #261

estark37 opened this issue Mar 2, 2017 · 5 comments

Comments

@estark37
Copy link
Contributor

estark37 commented Mar 2, 2017

In walking through the URL parser state machine (https://url.spec.whatwg.org/#concept-basic-url-parser), it seems to me that on an input of "https://blah.com", the parser will return a URL whose scheme is "https" and whose other components (including hostname) are empty.

Here's the sequence of states I see happening:

  1. scheme start state
  2. scheme state
  3. path or authority state
  4. authority state
    => at this point we eat up all the characters, adding them to buffer one by one. After eating the last character, we exit step 11 (the state machine) all together (because c is EOF) and do not re-execute the authority state, meaning we never reach step 2.2 of authority state that continues on to host state.

Am I missing something and/or is this intentional?

@sleevi
Copy link

sleevi commented Mar 2, 2017 via email

@estark37
Copy link
Contributor Author

estark37 commented Mar 2, 2017

@sleevi the authority state doesn't have a loop within it though. I read this algorithm as follows: after each execution of step 3 in the authority state, go back to the beginning of step 11 and switch on state again, re-entering the authority state, re-executing step 3, going back to beginning, etc. except after you've consumed the last character, you do not switch on state but instead carry on to step 12 ("If after a run pointer points to EOF code point, go to the next step.")

Maybe looping within the authority state is the intention (rather than switching on state after each completion of the steps within the state), but if so it's not clear to me from reading the algorithm.

@estark37
Copy link
Contributor Author

estark37 commented Mar 2, 2017

Further, c/pointer are never incremented within the authority state, so as written we must return to the beginning of step 11 after each execution of the steps within the state in order to make progress.

@sleevi
Copy link

sleevi commented Mar 2, 2017

You increment after the EOF check, not before. You're correct that you
continue looping to the overall step 11, so /blah.com is:

pointer = /
/ is EOF? No. pointer++
c = *pointer // "b"
Authority parsing
  buffer += c // "b"
pointer = b  // This is the start of the next step 11
b is EOF? No. Pointer++
c = *pointer
Authority parsing
  buffer += c  // "bl"
...  // Keep repeating step 11
pointer = m
m is EOF? No. pointer++
c = *pointer  // EOF
Authority parsing
  c is EOF, @ is not set, so:
    pointer -= len(buffer)
    buffer = ""
    c = *pointer  // "b"
    Goto host state
Host parsing
   ...
pointer = l  // This is the start of the next step 11

That's what I meant by 2.2 of authority parsing; the EOF check happens
before the increment check, so the EOF should be parsed by the authority
parsing state

@estark37
Copy link
Contributor Author

estark37 commented Mar 2, 2017

Ahh, I see now, thanks!

@estark37 estark37 closed this as completed Mar 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants