Header spoofing and CGI header name encoding. #13

GrahamDumpleton · 2014-10-15T01:11:55Z

As originally raised in:

https://mail.python.org/pipermail/web-sig/2014-October/005336.html

the CGI header name encoding convention can leave a WSGI application open to header spoofing problems.

This in particular will be an issue where a proxy is setting some special custom headers related to authentication or which includes information about a client. For example, using X-Forwarded-* headers.

The spoofing problem is because of the CGI rule around how header names are converted. That is:

Meta-variables with names beginning with "HTTP_" contain values read
from the client request header fields, if the protocol used is HTTP.
The HTTP header field name is converted to upper case, has all
occurrences of "-" replaced with "" and has "HTTP" prepended to
give the meta-variable name. The header data can be presented as
sent by the client, or can be rewritten in ways which do not change
its semantics. If multiple header fields with the same field-name
are received then the server MUST rewrite them as a single value
having the same semantics. Similarly, a header field that spans
multiple lines MUST be merged onto a single line. The server MUST,
if necessary, change the representation of the data (for example, the
character set) to be appropriate for a CGI meta-variable.

So this means that X-Forwarded-For is translated to HTTP_X_FOWARDED_FOR. The problem is that if a client itself sends X_Forwarded_For, then it would also map to the same thing.

By the rules above the two values would be concatenated if a proxy set one and the client sent the other, usually separating the values with a comma. If you are attempting to block certain clients based on this, then the header value could be poisoned and cause problems for such a scheme.

Apache 2.4, and possibly recent versions of nginx as well, will discards headers when translating to CGI names if they contain characters other than alpha numerics and dashes.

For Apache 2.4, mod_wsgi at least would therefore be protected against this issue, although latest version of mod_wsgi also applies same strategy to Apache 2.2 as well to avoid problems.

Various other WSGI servers do not currently enforce such restrictions and so would be affected.

An updated WSGI specification should perhaps say something specific about this issue and say that WSGI servers should avoid the problem by applying the same restriction.

Enforcing such a restriction may have the consequence of blocking some WSGI applications from working which rely on custom headers from working if they fall afoul of the restriction though.

rbtcollins · 2014-10-15T01:14:21Z

Perhaps we can have an unnormalised section in the environ - e.g. raw_headers[] which has the headers in whatever form the gateway received them, to permit that case?

chadwhitacre · 2015-01-13T16:36:52Z

We could also break tradition and not use CGI header encoding at all. Is there a positive reason to keep CGI header encoding apart from tradition?

gvanrossum · 2015-01-13T16:42:16Z

+1. Header encoding is a hack that's nearly as old as Python. It's rife
with issues. Let's just use case-insensitive actual header names. And some
object with attributes for the other standardized stuff. (Custom stuff can
use a subclass or an extension API.)

On Tue, Jan 13, 2015 at 8:36 AM, Chad Whitacre notifications@github.com
wrote:

We could also break tradition and not use CGI header encoding at all. Is
there a positive reason to keep CGI header encoding apart from tradition?

—
Reply to this email directly or view it on GitHub
#13 (comment)
.

--Guido van Rossum (python.org/~guido)

chadwhitacre · 2015-01-13T16:44:45Z

I propose a tuple (or list) of two-tuples of bytestrings for headers:

( (b'Host', b'www.example.com')
, (b'Accept', b'text/plain')
 )

Per #16 these should contain all the bytes that were received on the wire, in the order they were received.

chadwhitacre · 2015-01-13T16:48:43Z

Let's just use case-insensitive actual header names. And some object with attributes for the other standardized stuff. (Custom stuff can use a subclass or an extension API.)

This sounds like the turf currently occupied by the various request objects. Are we intending to tackle "one true request object" with WSGI NG? In proposing a tuples-of-bytestrings data structure for headers, I was assuming that we would still have a request object ecosystem.

gvanrossum · 2015-01-13T16:49:22Z

Make it a list; everyone and their aunt wants to manipulate this so an
immutable sequence is just a nuisance (and at works an O(N) liability).
Possibly use a namedtuple for the values with header and value attributes.
Describe how you handle whitespace (esp. around continuation lines).

On Tue, Jan 13, 2015 at 8:44 AM, Chad Whitacre notifications@github.com
wrote:

I propose a tuple (or list) of two-tuples of bytestrings for headers:

( (b'Host': b'www.example.com')
, (b'Accept': b'text/plain')
)

Per #16 #16 these
should contain all the bytes that were received on the wire, in the order
they were received.

—
Reply to this email directly or view it on GitHub
#13 (comment)
.

--Guido van Rossum (python.org/~guido)

gvanrossum · 2015-01-13T16:50:17Z

It could be Request objects all the way down. :-)

On Tue, Jan 13, 2015 at 8:48 AM, Chad Whitacre notifications@github.com
wrote:

Let's just use case-insensitive actual header names. And some object with
attributes for the other standardized stuff. (Custom stuff can use a
subclass or an extension API.)

This sounds like the turf currently occupied by the various request
objects. Are we intending to tackle "one true request object" with WSGI NG?
In proposing a tuples-of-bytestrings data structure for headers, I was
assuming that we would still have a request object ecosystem.

—
Reply to this email directly or view it on GitHub
#13 (comment)
.

--Guido van Rossum (python.org/~guido)

chadwhitacre · 2015-01-13T17:12:19Z

Describe how you handle whitespace (esp. around continuation lines).

I'm trying to think of how to do this while remaining lossless (#16), and it's causing me to think about why I care about losslessness in the first place. Because at some point, yes: we have to jettison that information. Which point?

If we try to preserve whitespace then we'd have something like:

[(b'Foo', b' bar\n baz')]

But to what end? If somebody cares about the whitespace in a header (they're researching usage of whitespace in HTTP headers? they're debugging some esoteric bug that I can't imagine?) then I suppose they could/would just drop below WSGI anyway. WSGI is for server and framework authors, not researchers. I guess that removing whitespace from tuples-of-bytestrings would result in:

[(b'Foo', b'bar baz')]

chadwhitacre · 2015-01-13T17:36:06Z

It could be Request objects all the way down. :-)

I took a quick look over at Node-land for comparison: they have a core IncomingMessage prototype that gets passed to http.Server callbacks as request, and then frameworks wrap that in their own request or req objects: Express, Sails (wraps Express, actually), Total, Koa. Fundamentally, though, every framework is using the same request object.

I guess the same is true in Python-land, we just call our base object environ for some reason. :-P

s/environ/request/g reticketed as #17.

chadwhitacre · 2015-01-13T17:46:14Z

Also: a list of [named]tuples of bytestrings would be parallel to how current WSGI handles response headers. I think we should keep them symmetrical to reduce mental overhead.

chadwhitacre · 2015-01-13T17:50:49Z

Frameworks generally implement headers objects as mapping types, and lists of tuples of bytestrings (LTBs?) are perfect for feeding to mapping types.

gvanrossum · 2015-01-13T18:15:40Z

We're in total agreement, right?

On Tue, Jan 13, 2015 at 9:50 AM, Chad Whitacre notifications@github.com
wrote:

Frameworks generally implement headers objects as mapping types, and lists
of tuples of bytestrings (LTBs?) are perfect for feeding to mapping types.

—
Reply to this email directly or view it on GitHub
#13 (comment)
.

--Guido van Rossum (python.org/~guido)

chadwhitacre · 2015-01-13T21:36:07Z

We're in total agreement, right?

Pretty close, anyway. :-) I had to change from hearing "request object" as "one true request" to "requests all the way down". I like "requests all the way down" (#17).

As to the API of said WSGI NG request object, I see some convergence on a list of namedtuples for headers, though earlier you said, "Let's just use case-insensitive actual header names," which indicates the sort of mapping that I've been thinking is best implemented at the framework layer. On the other hand, to the extent that a given WSGI server wants to inspect request headers (not just parse them), it'd seem a shame to have multiple implementations of a case-insensitive header mapping (one at the server layer, another at the framework layer).

Maybe the rubric for designing the base request object is: what request API do server authors need?

Do existing server implementations inspect request headers?

gvanrossum · 2015-01-13T21:47:18Z

Is the Pope Catholic? :-)

TBH I'm not sure I'm fully with the program where it comes to the
distinction between server and framework. (And what is middleware?)

On Tue, Jan 13, 2015 at 1:36 PM, Chad Whitacre notifications@github.com
wrote:

We're in total agreement, right?

Pretty close, anyway. :-) I had to change from hearing "request object" as
"one true request" to "requests all the way down". I like "requests all the
way down" (#17 #17).

As to the API of said WSGI NG request object, I see some convergence on a
list of namedtuples for headers, though earlier you said, "Let's just use
case-insensitive actual header names," which indicates the sort of mapping
that I've been thinking is best implemented at the framework layer. On the
other hand, to the extent that a given WSGI server wants to inspect
request headers (not just parse them), it'd seem a shame to have multiple
implementations of a case-insensitive header mapping (one at the server
layer, another at the framework layer).

Maybe the rubric for designing the base request object is: what request
API do server authors need?

Do existing server implementations inspect request headers?

—
Reply to this email directly or view it on GitHub
#13 (comment)
.

--Guido van Rossum (python.org/~guido)

chadwhitacre · 2015-01-13T22:07:29Z

Do existing server implementations inspect request headers?

Yes. Cursory inspection of a couple implementations indicates that it's mostly around body parsing, and I'm not sure mod_wsgi counts since it's in C, so it wouldn't be using a Python request API anyway (or would it?).

https://github.com/benoitc/gunicorn/blob/cc1ddf16e285460c409fe381843cae87022d9644/gunicorn/http/message.py#L102-L108
https://github.com/benoitc/gunicorn/blob/cc1ddf16e285460c409fe381843cae87022d9644/gunicorn/http/message.py#L126-L133

https://github.com/GrahamDumpleton/mod_wsgi/blob/ba67c740005ff657311809affbc3d820e86e96e1/src/server/mod_wsgi.c#L6436-L6516
https://github.com/GrahamDumpleton/mod_wsgi/blob/ba67c740005ff657311809affbc3d820e86e96e1/src/server/mod_wsgi.c#L11137-L11175

chadwhitacre · 2015-01-13T22:12:12Z

I'm not sure I'm fully with the program where it comes to the distinction between server and framework.

Servers: mod_wsgi, uwsgi, gunicorn, etc.
Frameworks: Django, Flask, Zope, Pyramid, etc.

It's not a logically necessary distinction, but it seems to be pretty evolutionarily stable.

(And what is middleware?)

I don't think middleware panned out as originally promised (an ecosystem of interchangeable components), but I think the chainability of WSGI has been a plus for scabbing together multiple frameworks, e.g.

Is this chicken scratch readable at all?

rbtcollins · 2015-01-13T23:02:38Z

Continuation lines don't exist in HTTP/2. (Continuation frames are different). Since the point of the WSGI and WSGI-NG is to be the abstraction between servers and applications, including stuff like that would be harmful.

OTOH the mangling is certainly annoying, and I agree that we can and should separate out the header fields.

https://tools.ietf.org/html/rfc7230#section-3.2 - fieldnames are just a case insensitive token, for which the BNF is:
token = 1*tchar

 tchar          = "!" / "#" / "$" / "%" / "&" / "'" / "*"
                / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
                / DIGIT / ALPHA
                ; any VCHAR, except delimiters

(delimiters are (DQUOTE and "(),/:;<=>?@[]{}").)

HTTP_FOO is case insensitive (folded to uppercase) and, with the HTTP_ prefix added. But CGI goes further as Graham notes, and that is where the issues arise IMO - and why some servers are filtering out valid headers.

We have a choice here: we could take the minimum set that any server will accept, or we could take the RFC specified set, and accept that some servers will choose not to pass on that full set.

The latter is better I think - it preserves choice for later should we need it, at little cost.

So the mangling has three aspects: case, non-alpha characters, and the HTTP_ prefix.

If we move the header fields to a separate data structure (whether a sub component of environ, or a new parameter or whatever) we can drop the HTTP_ prefix without confusing anything.

AFAIK there isn't a case insensitive string type in Python, and I kindof think that adding one for this would be overkill. We either then cause lots of 'header.lower() == "location"' comparisons into code, or we continue to do case normalisation. I think we should continue to do case normalisation, either upper or lower case because of this.

non-alpha mangling however was only needed for passing via the process environment block (and that was limited to the variable names that can be put in there sanely) - e.g. & in a name would be -bad-. As we're in-python we can drop that entirely.

As far as the structure goes, I'm personally fond of a dict of lines:

{'location': [b'foo', b'bar',b'quux']}

We should talk about value encoding, but thats a separate issue to this.

chadwhitacre · 2015-01-13T23:57:07Z

HTTP_FOO is case insensitive (folded to uppercase)

Ah, right, of course. Derp.

I think we should continue to do case normalisation, either upper or lower case

+1

As far as the structure goes, I'm personally fond of a dict of lines:

Sure, +0, as long as the value is always a list instead of sometimes a list and sometimes a bytestring. ;-)

We should talk about value encoding, but thats a separate issue to this.

Reticketed as #18.

chadwhitacre mentioned this issue Jan 13, 2015

call it request, not environ #17

Open

chadwhitacre mentioned this issue Jan 13, 2015

lossless environ #16

Closed

chadwhitacre mentioned this issue Jan 13, 2015

talk about value encoding #18

Open

chadwhitacre mentioned this issue Jan 14, 2015

dial back the hype about middleware #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Header spoofing and CGI header name encoding. #13

Header spoofing and CGI header name encoding. #13

GrahamDumpleton commented Oct 15, 2014

rbtcollins commented Oct 15, 2014

chadwhitacre commented Jan 13, 2015

gvanrossum commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

gvanrossum commented Jan 13, 2015

gvanrossum commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

gvanrossum commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

gvanrossum commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

rbtcollins commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

Header spoofing and CGI header name encoding. #13

Header spoofing and CGI header name encoding. #13

Comments

GrahamDumpleton commented Oct 15, 2014

rbtcollins commented Oct 15, 2014

chadwhitacre commented Jan 13, 2015

gvanrossum commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

gvanrossum commented Jan 13, 2015

gvanrossum commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

gvanrossum commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

gvanrossum commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015

rbtcollins commented Jan 13, 2015

chadwhitacre commented Jan 13, 2015