Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that multiline input doesn't cause Rack::Request to break #1607

Closed
wants to merge 3 commits into from

Conversation

pvande
Copy link
Contributor

@pvande pvande commented Feb 16, 2020

As written, the AUTHORITY regex intends to match the entire input string, and the sole usage of it expects that it will always return a match. After some consideration, there is a way to feed it input that will not match, causing errors on the subsequent line.

The issue at hand is that . in a regular expression matches everything except newlines by default, meaning that even the most generous regular expression with start- and end-of-string anchors (/\A.*\z/) cannot match a multiline string. The solution, then, is to use the m (multiline) flag on the regex, which allows . to match newlines as well.

This change also adds a test for the behavior of the \Z anchor (end-of-chomped-string) vs the \z anchor (actual end-of-string).

As written, the `AUTHORITY` regex intends to match the entire input string, and the sole usage of it expects that it will always return a match.  After some consideration, there *is* a way to feed it input that will not match, causing errors on the subsequent line.

The issue at hand is that `.` in a regular expression matches everything *except newlines* by default, meaning that even the most generous regular expression with start- and end-of-string anchors (`/\A.*\z/`) cannot match a multiline string.  The solution, then, is to use the `m` (multiline) flag on the regex, which allows `.` to match newlines as well.
@jeremyevans
Copy link
Contributor

I'm not sure if we want to accept hosts that are definitely incorrect. More relaxing parsing than the RFC requires may be fine, but accepting newlines may be going a bit too far. Are there clients are actually sending such hosts?

Is the current behavior when such hosts are given an exception? If so, that sounds wrong, we should probably just skip that authority in such cases and use the next authority.

@pvande
Copy link
Contributor Author

pvande commented Feb 16, 2020

I have no idea whether there are clients sending such hosts, and I would consider them broken if there were. (You might be able to fudge it through curl.) It can absolutely happen through the Ruby API, though.

I took this approach is because the current implementation assumes no newlines (and throws exceptions otherwise), and it’s backwards-compatible with the previous implementation.

I think that using fallback authorities sounds great in principle, but would break backwards compatibility in subtle ways. As someone who lost over half a day to debugging why a trivial Rails upgrade was causing asset serving to fail (tl;dr: Webpacker depends on Rack::Proxy (neither updated), which uses Rack::Request#host to determine which server to proxy to, and the Rails version bump also bumped Rack, returning different results in my environment), I find maintaining backwards compatibility fair important, even when it’s “incorrect”.

@ioquatix
Copy link
Member

How would you send a host header including newlines with HTTP/1? I think it’s invalid.

@pvande
Copy link
Contributor Author

pvande commented Feb 16, 2020

@ioquatix There's no question that it's invalid, and you're very likely correct that it cannot be sent over HTTP/1. Rack::Request does get used sometimes with data that doesn't come directly from an HTTP request, however, and retaining newlines is important for maintaining compatibility for that case, despite the absurdity.

To be clear, I really only see this as an issue for the 2.x line of releases – it seems completely reasonable to do more/better validations in a new major version, including:

  • IPv6 address contains non-hex characters
  • IPv6 address contains more than one elision (::)
  • IPv6 address segment contains more that four digits
  • IPv6 address contains more than eight segments
  • IPv6 address contains invalid IPv4 address
  • IPv4 address does not contain four segments
  • IPv4 segment contains a number greater than 255
  • DNS name contains a newline
  • DNS name contains a colon

@ioquatix
Copy link
Member

ioquatix commented Feb 16, 2020

Fair enough I get your point. Previously the parser was very loose. But now it’s a bit more strict. The key point is do we have some definition of what a host is, as in Request#host. My internal definition is a string that fits the authority/host part of an absolute URI. I don’t think it was documented or specced clearly but under this definition we shouldn’t accept newline characters (or any non-printing characters actually). Because #host should return a string suitable for making a URI, as in other methods of Request.

@pvande
Copy link
Contributor Author

pvande commented Feb 16, 2020 via email

If the address cannot reasonably be used to reconstruct a URI, we return nil.  This includes generous capture patterns for IPv6 and IPv4 addresses, with validation of the captured data happening as a secondary pass.

Additionally added some testing with extreme examples of valid inputs, and a few invalid ones.
@pvande
Copy link
Contributor Author

pvande commented Feb 17, 2020

The latest commit narrows acceptable inputs somewhat, and performs some additional validations on the machine-facing classes.

  • Any address containing blank characters is rejected.
  • Any address wrapped in square brackets is captured as an IPv6 address.
  • Any address comprised entirely of decimal digits and dots is captured as an IPv4 address.
  • Any other address is captured as a hostname.
  • IPv6 and IPv4 captures are tested against the appropriate regular expression from resolv (IPv6, IPv4), and are rejected if they do not match.
    • This seemed appropriate given the machine-facing nature of these addresses, the relaxed constraints implied by the regular expressions, and the goal of ultimately producing functional URIs.

@ioquatix
Copy link
Member

The URI RFCs specifically allow future versions of ip addresses in square brackets. Do we handle this correctly?

@pvande
Copy link
Contributor Author

pvande commented Feb 17, 2020 via email

@ioquatix
Copy link
Member

This in theory looks good to me but I need to do a closer review. Can you please rebase on master?

Copy link
Contributor

@jeremyevans jeremyevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay in reviewing this.

I don't think this warrants requiring the entire resolv library if we are just using a couple regexps in it. Can we copy the regexp constants and define them in lib/rack/request.rb, so we don't need to require resolv? It may be possible to inline the IPv6/IPv4 regexps into the AUTHORITY regexp so we don't need the separate checks for it in split_authority. If so, that would be best.

@ioquatix
Copy link
Member

I agree with Jeremy on this point, and also just for the sake of efficiency, a single Regexp should be enough.

jeremyevans added a commit to jeremyevans/rack that referenced this pull request Mar 21, 2022
Tighten up IPv6 parsing rules using regexp extracted from resolv in stdlib,
simplified to avoid creating additional groups.

Tighten up hostname matching to graphical characters, except square brackets
(so it doesn't overlap with IPv6 parsing).

Avoid unnecessary IPv4 matching, since anything that matches as an IPv4
address would match as a hostname.

Remove unnecessary named group creation.

Don't allow trailing newlines in host names.

Fixes rack#1607

Co-authored-by: Pieter van de Bruggen <pvande@gmail.com>
@jeremyevans
Copy link
Contributor

I did another review of this and I don't think all of this complexity is needed. Handling IPvFuture seems unnecessary. The odds are low that rack will have to deal with anything beyond IPv6. I don't think it makes sense to allow trailing newlines in hostnames. We don't actually use the IPv4 vs. hostname distinction, so that can be removed. We can tighten the IPv6 parsing by inlining the resolv values, so this doesn't need to depend on resolv at runtime. I submitted a pull request for my recommended approach: #1835

jeremyevans added a commit that referenced this pull request Mar 21, 2022
Tighten up IPv6 parsing rules using regexp extracted from resolv in stdlib,
simplified to avoid creating additional groups.

Tighten up hostname matching to graphical characters, except square brackets
(so it doesn't overlap with IPv6 parsing).

Avoid unnecessary IPv4 matching, since anything that matches as an IPv4
address would match as a hostname.

Remove unnecessary named group creation.

Don't allow trailing newlines in host names.

Fixes #1607

Co-authored-by: Pieter van de Bruggen <pvande@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants