Skip to content

Conversation

Thomspoon
Copy link

@Thomspoon Thomspoon commented Apr 19, 2017

Fixed the issue with my last pull request.

The problem with the parse_userinfo right now is that it breaks for a fragment delimiter or query string. That's all fine and dandy for further parsing functions, however, this check shouldn't be done with looking for the username and password, because these values can and will be used in passwords.

For example: let url = Url::parse("mysql://root:pass#word@example.com");

Results in: thread 'main' panicked at 'called Result::unwrap()on anErr value: InvalidPort' Due to the early break at the # sign.

After the change, the result is as expected: mysql://root:pass%23word@example.com/

Unless I'm mistaken, a fragment or query will never appear this early on in a url.


This change is Reviewable

@Thomspoon Thomspoon changed the title Fixed userinfo breaking with complex passwords Fixed parse_userinfo breaking with complex passwords Apr 19, 2017
@SimonSapin
Copy link
Member

this check shouldn't be done […]

Why shouldn’t it?

This code implements the algorithm specified at https://url.spec.whatwg.org/#url-parsing. Interoperability with other implementations is important. If you think the algorithm should be changed, it should be changed in the specification first. The specification’s repository is https://github.com/whatwg/url. When asking for a change, expect to be asked for justification of why this change is desirable and why it won’t break existing content.

@Thomspoon
Copy link
Author

Why shouldn’t it?

Can you think of an instance where either of those elements would be used in the user_info section of a url?

It's obvious they cannot be because if the algorithm is identical, then the algorithm would have a bug. It's not uncommon to have an octothorp in a password, so I don't understand how this isn't more of an issue.

@SimonSapin
Copy link
Member

a://b:c#d@e doesn’t have a # in its password, it has @ in its fragment. # is a delimiter in URLs like many other ASCII punctuation characters. If you want to include # without it being a delimiter, you need to percent-encode it as %23. That’s what percent-encoding is for. This is similar to using @ in a username, which is common when some system uses email addresses as usernames. You also have to percent-encode it.

@Thomspoon
Copy link
Author

Thomspoon commented Apr 19, 2017

Testing that address with your code throws a panic.

extern crate url;

use url::{Url, percent_encoding};

fn main() {
    let url = Url::parse("a://b:c#d@e");
    println!("{:?}", url.unwrap());
}

thread 'main' panicked at 'called Result::unwrap()on anErr value: InvalidPort'

The url a://b:c is not a valid address, so fragmenting to a section of it isn't valid.

In my opinion, the specification at https://url.spec.whatwg.org/#url-parsing is a little ambiguous, and in the relative state section, they should delineate that those flags are looking at the end of a url. They don't say it, but /, ?, # would only be found at the end for fragmentation, queries, and pathing.

@SimonSapin
Copy link
Member

In my opinion, the specification at https://url.spec.whatwg.org/#url-parsing is a little ambiguous

Ambiguity is a specification bug that needs to be fixed there.

@Thomspoon
Copy link
Author

Issue created: whatwg/url#294

@nox
Copy link
Contributor

nox commented Apr 19, 2017

@Thomspoon Why aren't you just passing %23?

@Thomspoon
Copy link
Author

Thomspoon commented Apr 19, 2017

That seems like a workaround, shouldn't be a permanent fix. I don't agree with the result of the issue I created, but I guess I don't have a choice.

@nox
Copy link
Contributor

nox commented Apr 19, 2017

What would http://foo:80#bar@blah mean if # was allowed unescaped in a password? The change you want would make the grammar ambiguous.

@Thomspoon
Copy link
Author

Thomspoon commented Apr 19, 2017

The @ is the delimiter that specifies if it should be a scheme://username:password@domain:port or scheme://domain:port#fragment.

From RFC1738:

3.1. Common Internet Scheme Syntax

While the syntax for the rest of the URL may vary depending on the
particular scheme selected, URL schemes that involve the direct use
of an IP-based protocol to a specified host on the Internet use a
common syntax for the scheme-specific data:

    //<user>:<password>@<host>:<port>/<url-path>

Some or all of the parts ":@", ":",
":", and "/" may be excluded. The scheme specific
data start with a double slash "//" to indicate that it complies with
the common Internet scheme syntax. The different components obey the
following rules:

user
    An optional user name. Some schemes (e.g., ftp) allow the
    specification of a user name.

password
    An optional password. If present, it follows the user
    name separated from it by a colon.

The user name (and password), if present, are followed by a
commercial at-sign "@". Within the user and password field, any ":",
"@", or "/" must be encoded.

Doesn't mention anything about the limitations of the password itself.

@nox
Copy link
Contributor

nox commented Apr 19, 2017

This RFC is updated by RFC 3986, but we follow the URL standard anyway.

Note that the fact that it doesn't say anything about which characters are allowed or not makes it ambiguous to parse http://foo:80#bar@blah, as I stated earlier. That's one of the many reasons why RFC 3986 and then the URL standard exist.

@Thomspoon
Copy link
Author

Well, the guys over at the URL Standard don't seem to think a retroactive change would be beneficial, so I guess this is a lost cause. I'll just accept the fact that I have to use percent-encoding 🤕

@nox
Copy link
Contributor

nox commented Apr 19, 2017

Thanks for your contribution anyway!

@nox nox closed this Apr 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants