IPv4 host parser + site definition seems potentially dangerous. #560

MattMenke2 · 2020-11-20T23:50:08Z

Per the IPv4 parser, http://127.0.0.127.1 is a valid, non-IPv4 host. Similarly, http://foo.0.127.1 is a valid, non-IPv4 host. Per the Public Suffix List spec, the prevailing rule that should apply to them (since they have no matches in the list) is "*", so their eTLD is just "1", and so their site (defined in the fetch spec) is "http", "127."1".

This is strange for two reasons:

"http://127.1" is, per the IPv4 parsing spec, equivalent to "http://0.0.127.1", which seems unsafe. I believe it's a reserved IP, anyways, but still not great. FireFox maps this to "http://127.0.0.1", at least when passed to the Javascript URL constructor, which seems even less safe. So trying to create a URL or origin from a such a site seems like it could potentially lead to security bugs. Admittedly, this is not something the web platform currently does, though can happen, e.g., when displaying the "URL" a cookie is set on to the user.
"http://127.0.0.0.1" looks a lot like "http://127.0.0.1", and we always tell users to look at the end of a hostname, so doesn't in general seem very safe to display to the user.

Now, admittedly, trying to resolve "127.0.0.0.1" at the DNS layer should generally fail, but the situation here seems a bit troubling.

I ran into this because Chromium is trying to store the Network Partition Key's site | origin as an origin object, but a fuzzer recently discovered an assert failure when creating an origin with a host of "0.1" (which is recognized as not being in the canonical form of "0.0.0.1"). Obviously, we could just make it hold an origin | (scheme + hostname), and we can adhere to the spec, the described behavior just doesn't seem that safe to me.

I'd rather just reject an host that ends in a number but isn't a valid IPv4 IP, so filing this to see if anyone else has any thoughts on this.

annevk · 2020-11-27T16:24:40Z

cc @whatwg/security

mozfreddyb · 2020-11-30T08:24:21Z

A bit on a tangent, but re 1), a missing dot is filled in so far that the last number is treated as an integer to fill the bytes that are still required. I.e., 127.1 ought to be "127.0.0.1".
However this may be resolved, it needs to take hexadecimal, octal, (dotted and undotted) into account.

mikewest · 2020-11-30T08:50:18Z

I'd rather just reject an host that ends in a number but isn't a valid IPv4 IP, so filing this to see if anyone else has any thoughts on this.

This sounds reasonable to me.

a missing dot is filled in so far that the last number is treated as an integer

Perhaps it's time to reevaluate more broadly all of the strange ways we accept spelling IP addresses?

annevk · 2020-11-30T11:02:34Z

It does seem problematic that not all registrable domains can be parsed as a domain. I don't know if that's also problematic for the public suffix (we don't currently defined that as a domain it seems), but I could be convinced.

I think that does argue for solving this in the URL parser somehow, rather than fetch, at least if the last two labels of a domain are numeric. (Although maybe we should block it there too if there are entry points that do not go through the URL parser. Unsure.)

Changing the IPv4 parser itself might be an interesting alternative, but also seems quite a bit riskier?

mikewest · 2020-11-30T11:05:21Z

Changing the IPv4 parser itself might be an interesting alternative, but also seems quite a bit riskier?

Riskier, but more ideal? Perhaps we could add some telemetry to see how often strange spellings are used. IMO, they're basically only footguns, and it would be great if we could guide developers towards a canonical spelling of everything.

annevk · 2020-11-30T12:09:22Z

I think even if we managed to be stricter there, we would still want to do something about 127.1 as the network itself is likely not resilient to such domains either (and interpret them as 127.0.0.1 as the specification does).

mikewest · 2020-11-30T12:16:22Z

MattMenke2 · 2020-11-30T20:16:54Z

@mozfreddyb: You are indeed correct - Chrome behaves that way as well. I must have made a mistake when testing it (Or thought I knew how it worked in Chrome without testing).

It looks to me like at least some paths in Chrome extracts a URL from a site for storage quota, so Chrome, at least, has some real bugs if non-IPv4 hostnames ending in two numbers in [0,255] ever resolve.

I'm not an expert in this space, but a solution that rejected these at the URL parsing layer seems safest to me.

It would be simpler to reject any with a final number that's not an IPv4 address, making for a more predictable web platform for developers, though a more targeted carveout might result in less fallout.

I think the simplest thing to do would be to update the IPv4 parser as follows:

...

If parsing the last item in parts using the IPv4 number parser is failure, return input.
If parts’s size is greater than 4, then return failure.
<Previous step 5 here, all further steps are unchanged, except replace "return input" cases with "return failure"

...

Also, I think the IPv4 number parser has a bug - it should fail if input is the empty string, not return 0. The claim that 0..0x300 is a domain and not an IPv4 address seems to require that behavior.

That having been said, I'm completely open to other ways to resolve this.

annevk · 2020-12-01T09:17:43Z

Are we sure there will never be digit TLDs? (I guess that would already be practically impossible given the status quo as that would allow two label domains with both labels being solely digits, leading to this problem.)

mikewest · 2020-12-01T10:30:42Z

I think we can make it significantly less-likely that there ever will be purely-digit TLDs by locking in browser behavior that makes them unworkable. :)

annevk · 2020-12-01T11:07:14Z

@MattMenke2 currently the host parser is never invoked with the empty string (note the assert in step 4) and I'm pretty sure that therefore the IPv4 parser isn't either. But also, if it was the empty string, why would it not return that and return 0 instead? (Note that the number parser empty string case can only be reached due to input starting with 0x, 0X, or 0.)

I don't think your suggestion works as we run the IPv4 parser on all domains. Your change would make all domains return failure basically.

The minimum change is probably that whenever we return input now from the IPv4 parser, we first check if input's last label can be parsed as a number (using the IPv4 number parser or something more strict that doesn't do 0x/0X/0 prefixes, though presumably larger than 255 would still be failure) and if so, we return failure instead.

annevk · 2020-12-01T11:33:02Z

I just realized that my suggestion would not actually catch 1.127 so indeed we need to do "dos" and adjust how IPv4 addresses are parsed in general. So perhaps the minimum change would that whenever there are not exactly four parts (after trailing dot elision which is probably fine still) we need to return input. And whenever we return input we perform validation on input's last label. (And perhaps performing validation on input's last label should happen in the host parser, after attempting to extract an IPv4 address. It doesn't really matter but it seems a little cleaner to do it there.)

MattMenke2 · 2020-12-01T22:54:16Z

@MattMenke2 currently the host parser is never invoked with the empty string (note the assert in step 4) and I'm pretty sure that therefore the IPv4 parser isn't either. But also, if it was the empty string, why would it not return that and return 0 instead? (Note that the number parser empty string case can only be reached due to input starting with 0x, 0X, or 0.)

@annevk: The IPv4 number parser (not the IPv4 parser) seems to be invoked on the empty string in some cases - e.g., in the case of "...." is split into 5 parts, the final empty one is removed, and then we invoke the IPv4 number parser 4 times on empty strings. The IPv4 number parser is invoked on each on, and is specified to return 0 for the empty string. No browser actually maps it to "0.0.0.0.", but that seems to be what the spec requires.

I don't think your suggestion works as we run the IPv4 parser on all domains. Your change would make all domains return failure basically.

I'm not seeing why this is? My step 4 would "return input" on hostnames that don't end in a number, regardless of number of components. The IPv4 parser is confusingly written, but its "return input" seems to be supposed to result in continuing host parsing as a non-IPv4 host (The host parser ignores the return value if it's not an IPv4 address, though it's not clear that "returning input" means returning something that is not an IPv4 address).

The minimum change is probably that whenever we return input now from the IPv4 parser, we first check if input's last label can be parsed as a number (using the IPv4 number parser or something more strict that doesn't do 0x/0X/0 prefixes, though presumably larger than 255 would still be failure) and if so, we return failure instead.

This is exactly what my change to step 4 of the IPv4 parser does, no?

annevk · 2020-12-02T08:28:42Z

and then we invoke the IPv4 number parser 4 times on empty strings

Why? The first step of the parts loop does:

If part is the empty string, then return input.

(I did misunderstand though as you were talking about a different empty string, i.e., one resulting from splitting a sequence of dots.)

I'm not seeing why this is?

Yeah, I clearly misread that. Sorry about that. I guess the one change needed to your set of steps is that we also want to return input if the last part (at that point, i.e., after removing an initial last part empty string, if any) is the empty string.

And we could also consider moving that into the host parser as being the decider between IPv4 and domain names. (Though I'm not sure how to restructure it nicely yet.)

MattMenke2 · 2020-12-03T05:32:14Z

and then we invoke the IPv4 number parser 4 times on empty strings

Why? The first step of the parts loop does:

If part is the empty string, then return input.

(I did misunderstand though as you were talking about a different empty string, i.e., one resulting from splitting a sequence of dots.)

Ah, you're right about that.

I'm not seeing why this is?

Yeah, I clearly misread that. Sorry about that. I guess the one change needed to your set of steps is that we also want to return input if the last part (at that point, i.e., after removing an initial last part empty string, if any) is the empty string.

And we could also consider moving that into the host parser as being the decider between IPv4 and domain names. (Though I'm not sure how to restructure it nicely yet.)

I agree that would make things cleaner. I think we should also consider anything that ends in all numbers IPv4, even if it doesn't parse with the current logic (e.g. 0.0.0.09, which my proposed modification considers a valid non-IPv4 hostname. I don't want that to go from a valid hostname to an invalid one if/when we remove base-8 support).

Before writing anything up, think I'll invest in instrumenting getting data to estimate breakage, which take a couple months to roll out. We'll need numbers before I can land any behavior changes, anyways.

I'm thinking I'll instrument the DNS layer, rather than the URL parser, to see how often we see such domain names, and how often they resolve, instead of how often they're typed. If they don't resolve to anything, I'm fine with failing with an invalid hostname error instead of a DNS error. May need to instrument both layers, though, as I suppose the Javascript URL API does expose what URLs we consider invalid to the web platform.

Thanks so much for the feedback!

MattMenke2 · 2021-07-07T00:04:02Z

I've added the logging to Chrome, and we now have some numbers. The numbers vary a lot by platform. Sorry for the delay - didn't have much time as I thought I'd have to do this.

Some numbers by platform. Values are % of total DNS attempts are for non-numeric domains with a numeric component, % of those lookups that succeeded, % of successful lookups that were for hostnames with only a single terminal non-numeric component (as the options here are ban 2+ such terminal components, ban 1+ of them, or do nothing). Values are rounded a fair bit:

Linux: 0.0002% lookups, 60% succeed, 99% one numeric component
OSX: 0.0003% of lookups, 90% succeed, 2% one numeric component.
Windows: < 0.0001% of lookups, 2% succeed.
Android: 0.0003% of lookups, 92% succeed, 1% one numeric component.
ChromeOS: Looks a fair bit like Windows - few lookups, almost all fail.

For file URLs, where Chrome applies the same logic on the domain portion (on Windows), the numbers are low enough not to be a concern (also unclear if other browsers apply the same hostname validation logic here as well).

So these are surprisingly common on Linux, Android, and OSX. If we only allowed a single terminal numeric component, we would only have minimal breakage on Linux, but we would see real breakage on OSX and Android, though admittedly, for under 3 in 1,000,000 DNS lookups. It's also unclear if these were DNS lookups for actual URL requests or for something else.

Note that we don't have numbers for how common URLs have these weird hostnames but don't make it to the DNS resolver (e.g., used by the Javascript URL API, or intercepted by a ServiceWorker and remapped to something more reasonable). Updating the URL spec would affect those use cases as well.

Since this can potentially cause real security issues, I do think we should go ahead and update the spec, but it is a breaking change, and looks like there are likely some dependencies on the current behavior.

Anyhow, thoughts/concerns/feedback welcome, but I plan to write a draft change to the spec. I will, of course, CC everyone who has commented here on that issue.

MattMenke2 · 2021-07-09T01:19:12Z

I've updated my proposed change at #619, feedback welcome. Tests coming.

achristensen07 · 2021-07-09T02:00:00Z

Could we limit this change to special schemes? I understand that there are no digit TLDs and the end of a host used for DNS is a TLD, but the host of a URL with a non-special scheme is possibly not going to be used for DNS and could have legitimate use for digits at the end.

MattMenke2 · 2021-07-09T02:17:07Z

@achristensen07: The spec already skips IPv4 parsing for non-"special" schemes, no? (Though not IPv6 parsing)

Text: "If isNotSpecial is true, then return the result of opaque-host parsing input." Special schemes are ftp, http, https, ws, wss, and file.

achristensen07 · 2021-07-09T03:11:00Z

Indeed! Sorry about the noise.
If other implementers are serious about this, implement it, and find no compatibility problems, then I see no reason why I wouldn't do the same

DNS names and IPv4 literals are a bit of a mess. RFC 1738 said: > The rightmost domain label will never start with a digit, though, > which syntactically distinguishes all domain names from the IP > addresses. This is sensible, but was replaced by RFC 3986 with a "first-match-wins" algorithm: > If host matches the rule for IPv4address, then it should be > considered an IPv4 address literal and not a reg-name. The WHATWG URL spec originally mirrored this, but with a *much* more permissive IPv4 parser, to reflect reality. ECH originally cited this, to be sure we don't accidentally mix up IPv4 literals and DNS names. The change in RFC 3986 was a mistake and introduced a security problem. The RFC 1738 formulation said a.1.2.3.4 was a syntax error, neither a DNS name nor IPv4 literal. The RFC 3986, along with the analog copied into WHATWG, says it's a DNS name because the IPv4 parser rejects it. But this means: - a.1.2.3.4 is a DNS name - 1.2.3.4 is an IPv4 literal See also whatwg/url#560 This means DNS names are no longer closed under suffixes! This causes problems for any systems (e.g. eTLD+1s) that need to suffix a DNS name. WHATWG since effectively reverted the RFC 3986 mistake. After whatwg/url@ab0e820, the rule in WHATWG is: if it ends in a number (all digits, or 0x + all hex), parse as an IPv4 literal. Otherwise it's a DNS name. This rule is finally simple enough that we can just lift it into ECH. (If we want, we can simplify this by just reverting to the RFC 1738 formulation, which should be safe in this context. It is a pity RFC 3986 made a mess here.)

annevk added topic: parser security-tracker Group bringing to attention of security, or tracked by the security Group but not needing response. security/privacy There are security or privacy implications labels Nov 27, 2020

w3cbot mentioned this issue Nov 30, 2020

IPv4 host parser + site definition seems potentially dangerous. w3c/security-review#74

Open

This was referenced Jul 8, 2021

Reject non-numeric hostnames that end in numbers. MattMenke2/url#1

Merged

Reject non-IPv4 hostnames that end in numbers. #619

Merged

annevk closed this as completed in ab0e820 Aug 5, 2021

karwa mentioned this issue Jan 3, 2022

Expand which hostnames are considered IPv4 addresses #679

Closed

davidben mentioned this issue Oct 18, 2023

Avoid a normative reference to WHATWG-IPV4 tlswg/draft-ietf-tls-esni#574

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPv4 host parser + site definition seems potentially dangerous. #560

IPv4 host parser + site definition seems potentially dangerous. #560

MattMenke2 commented Nov 20, 2020 •

edited

Loading

annevk commented Nov 27, 2020

mozfreddyb commented Nov 30, 2020

mikewest commented Nov 30, 2020

annevk commented Nov 30, 2020

mikewest commented Nov 30, 2020

annevk commented Nov 30, 2020

mikewest commented Nov 30, 2020

MattMenke2 commented Nov 30, 2020

annevk commented Dec 1, 2020

mikewest commented Dec 1, 2020

annevk commented Dec 1, 2020

annevk commented Dec 1, 2020

MattMenke2 commented Dec 1, 2020

annevk commented Dec 2, 2020

MattMenke2 commented Dec 3, 2020 •

edited

Loading

MattMenke2 commented Jul 7, 2021

MattMenke2 commented Jul 9, 2021

achristensen07 commented Jul 9, 2021

MattMenke2 commented Jul 9, 2021 •

edited

Loading

achristensen07 commented Jul 9, 2021

IPv4 host parser + site definition seems potentially dangerous. #560

IPv4 host parser + site definition seems potentially dangerous. #560

Comments

MattMenke2 commented Nov 20, 2020 • edited Loading

annevk commented Nov 27, 2020

mozfreddyb commented Nov 30, 2020

mikewest commented Nov 30, 2020

annevk commented Nov 30, 2020

mikewest commented Nov 30, 2020

annevk commented Nov 30, 2020

mikewest commented Nov 30, 2020

MattMenke2 commented Nov 30, 2020

annevk commented Dec 1, 2020

mikewest commented Dec 1, 2020

annevk commented Dec 1, 2020

annevk commented Dec 1, 2020

MattMenke2 commented Dec 1, 2020

annevk commented Dec 2, 2020

MattMenke2 commented Dec 3, 2020 • edited Loading

MattMenke2 commented Jul 7, 2021

MattMenke2 commented Jul 9, 2021

achristensen07 commented Jul 9, 2021

MattMenke2 commented Jul 9, 2021 • edited Loading

achristensen07 commented Jul 9, 2021

MattMenke2 commented Nov 20, 2020 •

edited

Loading

MattMenke2 commented Dec 3, 2020 •

edited

Loading

MattMenke2 commented Jul 9, 2021 •

edited

Loading