-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPv4 host parser + site definition seems potentially dangerous. #560
Comments
cc @whatwg/security |
A bit on a tangent, but re 1), a missing dot is filled in so far that the last number is treated as an integer to fill the bytes that are still required. I.e., 127.1 ought to be "127.0.0.1". |
This sounds reasonable to me.
Perhaps it's time to reevaluate more broadly all of the strange ways we accept spelling IP addresses? |
It does seem problematic that not all registrable domains can be parsed as a domain. I don't know if that's also problematic for the public suffix (we don't currently defined that as a domain it seems), but I could be convinced. I think that does argue for solving this in the URL parser somehow, rather than fetch, at least if the last two labels of a domain are numeric. (Although maybe we should block it there too if there are entry points that do not go through the URL parser. Unsure.) Changing the IPv4 parser itself might be an interesting alternative, but also seems quite a bit riskier? |
Riskier, but more ideal? Perhaps we could add some telemetry to see how often strange spellings are used. IMO, they're basically only footguns, and it would be great if we could guide developers towards a canonical spelling of everything. |
I think even if we managed to be stricter there, we would still want to do something about |
@mozfreddyb: You are indeed correct - Chrome behaves that way as well. I must have made a mistake when testing it (Or thought I knew how it worked in Chrome without testing). It looks to me like at least some paths in Chrome extracts a URL from a site for storage quota, so Chrome, at least, has some real bugs if non-IPv4 hostnames ending in two numbers in [0,255] ever resolve. I'm not an expert in this space, but a solution that rejected these at the URL parsing layer seems safest to me. It would be simpler to reject any with a final number that's not an IPv4 address, making for a more predictable web platform for developers, though a more targeted carveout might result in less fallout. I think the simplest thing to do would be to update the IPv4 parser as follows: ...
... Also, I think the IPv4 number parser has a bug - it should fail if input is the empty string, not return 0. The claim that 0..0x300 is a domain and not an IPv4 address seems to require that behavior. That having been said, I'm completely open to other ways to resolve this. |
Are we sure there will never be digit TLDs? (I guess that would already be practically impossible given the status quo as that would allow two label domains with both labels being solely digits, leading to this problem.) |
I think we can make it significantly less-likely that there ever will be purely-digit TLDs by locking in browser behavior that makes them unworkable. :) |
@MattMenke2 currently the host parser is never invoked with the empty string (note the assert in step 4) and I'm pretty sure that therefore the IPv4 parser isn't either. But also, if it was the empty string, why would it not return that and return 0 instead? (Note that the number parser empty string case can only be reached due to input starting with 0x, 0X, or 0.) I don't think your suggestion works as we run the IPv4 parser on all domains. Your change would make all domains return failure basically. The minimum change is probably that whenever we return input now from the IPv4 parser, we first check if input's last label can be parsed as a number (using the IPv4 number parser or something more strict that doesn't do 0x/0X/0 prefixes, though presumably larger than 255 would still be failure) and if so, we return failure instead. |
I just realized that my suggestion would not actually catch |
@annevk: The IPv4 number parser (not the IPv4 parser) seems to be invoked on the empty string in some cases - e.g., in the case of "...." is split into 5 parts, the final empty one is removed, and then we invoke the IPv4 number parser 4 times on empty strings. The IPv4 number parser is invoked on each on, and is specified to return 0 for the empty string. No browser actually maps it to "0.0.0.0.", but that seems to be what the spec requires.
I'm not seeing why this is? My step 4 would "return input" on hostnames that don't end in a number, regardless of number of components. The IPv4 parser is confusingly written, but its "return input" seems to be supposed to result in continuing host parsing as a non-IPv4 host (The host parser ignores the return value if it's not an IPv4 address, though it's not clear that "returning input" means returning something that is not an IPv4 address).
This is exactly what my change to step 4 of the IPv4 parser does, no? |
Why? The first step of the parts loop does:
(I did misunderstand though as you were talking about a different empty string, i.e., one resulting from splitting a sequence of dots.)
Yeah, I clearly misread that. Sorry about that. I guess the one change needed to your set of steps is that we also want to return input if the last part (at that point, i.e., after removing an initial last part empty string, if any) is the empty string. And we could also consider moving that into the host parser as being the decider between IPv4 and domain names. (Though I'm not sure how to restructure it nicely yet.) |
Ah, you're right about that.
I agree that would make things cleaner. I think we should also consider anything that ends in all numbers IPv4, even if it doesn't parse with the current logic (e.g. 0.0.0.09, which my proposed modification considers a valid non-IPv4 hostname. I don't want that to go from a valid hostname to an invalid one if/when we remove base-8 support). Before writing anything up, think I'll invest in instrumenting getting data to estimate breakage, which take a couple months to roll out. We'll need numbers before I can land any behavior changes, anyways. I'm thinking I'll instrument the DNS layer, rather than the URL parser, to see how often we see such domain names, and how often they resolve, instead of how often they're typed. If they don't resolve to anything, I'm fine with failing with an invalid hostname error instead of a DNS error. May need to instrument both layers, though, as I suppose the Javascript URL API does expose what URLs we consider invalid to the web platform. Thanks so much for the feedback! |
I've added the logging to Chrome, and we now have some numbers. The numbers vary a lot by platform. Sorry for the delay - didn't have much time as I thought I'd have to do this. Some numbers by platform. Values are % of total DNS attempts are for non-numeric domains with a numeric component, % of those lookups that succeeded, % of successful lookups that were for hostnames with only a single terminal non-numeric component (as the options here are ban 2+ such terminal components, ban 1+ of them, or do nothing). Values are rounded a fair bit: Linux: 0.0002% lookups, 60% succeed, 99% one numeric component For file URLs, where Chrome applies the same logic on the domain portion (on Windows), the numbers are low enough not to be a concern (also unclear if other browsers apply the same hostname validation logic here as well). So these are surprisingly common on Linux, Android, and OSX. If we only allowed a single terminal numeric component, we would only have minimal breakage on Linux, but we would see real breakage on OSX and Android, though admittedly, for under 3 in 1,000,000 DNS lookups. It's also unclear if these were DNS lookups for actual URL requests or for something else. Note that we don't have numbers for how common URLs have these weird hostnames but don't make it to the DNS resolver (e.g., used by the Javascript URL API, or intercepted by a ServiceWorker and remapped to something more reasonable). Updating the URL spec would affect those use cases as well. Since this can potentially cause real security issues, I do think we should go ahead and update the spec, but it is a breaking change, and looks like there are likely some dependencies on the current behavior. Anyhow, thoughts/concerns/feedback welcome, but I plan to write a draft change to the spec. I will, of course, CC everyone who has commented here on that issue. |
I've updated my proposed change at #619, feedback welcome. Tests coming. |
Could we limit this change to special schemes? I understand that there are no digit TLDs and the end of a host used for DNS is a TLD, but the host of a URL with a non-special scheme is possibly not going to be used for DNS and could have legitimate use for digits at the end. |
@achristensen07: The spec already skips IPv4 parsing for non-"special" schemes, no? (Though not IPv6 parsing) Text: "If isNotSpecial is true, then return the result of opaque-host parsing input." Special schemes are ftp, http, https, ws, wss, and file. |
Indeed! Sorry about the noise. |
DNS names and IPv4 literals are a bit of a mess. RFC 1738 said: > The rightmost domain label will never start with a digit, though, > which syntactically distinguishes all domain names from the IP > addresses. This is sensible, but was replaced by RFC 3986 with a "first-match-wins" algorithm: > If host matches the rule for IPv4address, then it should be > considered an IPv4 address literal and not a reg-name. The WHATWG URL spec originally mirrored this, but with a *much* more permissive IPv4 parser, to reflect reality. ECH originally cited this, to be sure we don't accidentally mix up IPv4 literals and DNS names. The change in RFC 3986 was a mistake and introduced a security problem. The RFC 1738 formulation said a.1.2.3.4 was a syntax error, neither a DNS name nor IPv4 literal. The RFC 3986, along with the analog copied into WHATWG, says it's a DNS name because the IPv4 parser rejects it. But this means: - a.1.2.3.4 is a DNS name - 1.2.3.4 is an IPv4 literal See also whatwg/url#560 This means DNS names are no longer closed under suffixes! This causes problems for any systems (e.g. eTLD+1s) that need to suffix a DNS name. WHATWG since effectively reverted the RFC 3986 mistake. After whatwg/url@ab0e820, the rule in WHATWG is: if it ends in a number (all digits, or 0x + all hex), parse as an IPv4 literal. Otherwise it's a DNS name. This rule is finally simple enough that we can just lift it into ECH. (If we want, we can simplify this by just reverting to the RFC 1738 formulation, which should be safe in this context. It is a pity RFC 3986 made a mess here.)
Per the IPv4 parser, http://127.0.0.127.1 is a valid, non-IPv4 host. Similarly, http://foo.0.127.1 is a valid, non-IPv4 host. Per the Public Suffix List spec, the prevailing rule that should apply to them (since they have no matches in the list) is "*", so their eTLD is just "1", and so their site (defined in the fetch spec) is "http", "127."1".
This is strange for two reasons:
Now, admittedly, trying to resolve "127.0.0.0.1" at the DNS layer should generally fail, but the situation here seems a bit troubling.
I ran into this because Chromium is trying to store the Network Partition Key's site | origin as an origin object, but a fuzzer recently discovered an assert failure when creating an origin with a host of "0.1" (which is recognized as not being in the canonical form of "0.0.0.1"). Obviously, we could just make it hold an origin | (scheme + hostname), and we can adhere to the spec, the described behavior just doesn't seem that safe to me.
I'd rather just reject an host that ends in a number but isn't a valid IPv4 IP, so filing this to see if anyone else has any thoughts on this.
The text was updated successfully, but these errors were encountered: