Do headers properly. #91

Lukasa · 2015-03-05T21:25:59Z

I'm sick of Python implementations insisting that headers are dictionaries. They aren't, it's a bad match. We can and should do better. This is a proposed initial implementation that's in the spirit of what Go does, which I think is a substantially better model.

Feedback is encouraged here. Note that this has been strongly influenced by urllib3/urllib3#561, urllib3/urllib3#562, urllib3/urllib3#563, and urllib3/urllib3#564.

seocam · 2015-03-05T16:29:40Z

I'd like to have more flexibility when setting headers. I use urllib3 in the core of reverse proxy implementation and I think that sending the headers in the same orders they came, or at least let the user play around with that could be good (or maybe not).

For sure what I'd like is that if a header comes twice I'd like to be able to send it in the same way it came instead of joining it in just one.

The ordered list of tuples you proposed looks to somehow meet those requirements. It always return a list and the insertion order is respected.

Lukasa · 2015-03-05T17:18:01Z

@seocam Note that the gisted implementation actually doesn't quite fit that bill. It's very opinionated: it splits all duplicate header fields out into their own line. It does this primarily to enforce some notion of 'canonical form', but it might be better to transform into canonical form at access time rather than in the underlying representation, and to allow a non-canonical representation of the header block as well.

Lukasa · 2015-03-05T21:27:10Z

Alright, I've turned this into a pull request with an initial implementation of the header mapping. Feedback is encouraged!

piotr-dobrogost · 2015-03-05T22:08:38Z

Do headers properly

You should see smile on my face after reading this... :)
Btw, is there really no Python project from which we could borrow the implementation and not reinvent the wheel?

Lukasa · 2015-03-05T22:09:16Z

@piotr-dobrogost Possibly. =) I'm open to suggestions!

dimaqq · 2015-03-06T07:40:07Z

btw., requests comes with case-insensitive http header field map, which automatically combines repeated header fields into one value.

The practice of automatic combining is questionable though, because some values, e.g. dates, already have a comma, thus the application must know which fields are single-item and which are lists.

P.S. while on the subject, let me clarify the terminology:
a request has a single header
header comprises multiple fields
each field has a single value
value may comprise multiple elements separated by commas.

While it's very common to see "headers," that's not actually correct.

Lukasa · 2015-03-06T08:39:26Z

@dimaqq Yes it does, but requests has a slightly different need than I have.

What I'm trying to write here is a header representation that maintains as much of the information as possible that came off the wire, while also representing the headers in a format that reflects their semantics. In practice, this data structure isn't there yet, but ideally it ought to be possible to put headers in from the wire and get them back out the other side in exactly the same form.

This isn't that important for requests because by-and-large requests does not need that kind of flexibility, but it's hugely important for a low-level library like hyper. In particular, it's important that when using hyper you can construct a web request a web server will actually use, which may involve some very specific ordering and casing requirements.

As to terminology, you are quite right, but in practice calling them HTTP headers leads to less confusion than calling them HTTP header fields. One of those unfortunate areas where the 'correct' usage is less clear than the 'incorrect' one.

Lukasa · 2015-03-06T09:00:14Z

Werkzeug has a fundamentally very similar data structure to this one, which suggests the sanity of the approach (if Armin is doing it, it feels like it's probably the right thing to do!). See it here.

It differs from mine in a couple of key areas:

__getitem__ returns the first header inserted with a given key. That keeps dict-like semantics, which is useful for Werkzeug, but I consider it to be an attractive nuisance. To get lists out of the Headers object requires calling getlist instead. I don't think I want to do this.
it allows indexing operations as well as named lookup. That's nice, it keeps the 'iterable of tuples' semantic in place, so I think we should do that.
It allows adding of arguments to headers. I think that's a nice shiny feature but fundamentally not that helpful in this case: let a higher-level library worry about it.
it rejects newlines in headers values. That's technically not the right thing to do, I'd rather not concern myself with it.
it has a set method, which is roughly the same as del h[key]; h[key] = value in my implementation. A nice helper and a trivial addition if it appears to be useful, though I don't think it will.
a to_wsgi_list method that turns the header fields into the correct types. Something like this would be useful for hyper.
also has a useful str method.

Altogether I don't think I want to take Werkzeug's implementation wholesale because it carries a lot of unnecessary complexity, but there are some good lessons to learn there.

Interestingly, I think Werzeug's implementation is much more likely to be useful for urllib3 and friends, where its additional function and dictiness might be really helpful. Worth thinking about.

wtolson · 2015-03-06T16:23:22Z

I've been working on a similar header data structure with a focus on first no loss of information about the header and second consistancy with urllib3. This approach also has the advantage of consistent behaviour between python 2/3 and maintains the original case/order which allows you to fully reconstruct header.

Lukasa · 2015-03-06T18:07:00Z

@wtolson I'd love to see it in urllib3. =) However, I think hyper will stick with a 'list of tuples'-based implementation for now, because it's easier to reason about the correctness. As I said above, I think higher-level libraries absolutely should use a different data structure that more correctly replicates their needs, and yours looks great for urllib3.

wtolson · 2015-03-06T18:15:40Z

Thanks for the feedback! Working on a urllib3 PR now. :)

ml31415 · 2015-03-07T10:17:29Z

I guess it might be worth to clarify the design goals a bit more. Some people might consider speed more important, than losing e.g. sort information in the header. Many client implementations won't really care about ordering, and maybe not even capitalization, as long as they can extract cookies and authentication successfully. Especially clients designed for high throughput, webspiders etc. will care more about speed than perfection. Anything that wastes considerably more time than usual dict lookups probably will be irrelevant there. I'd also like to note, that for clients not using the common python httplib, but a faster c-version, header handling can be one of the larger time wasters. Using httplib, this is so slow, that the header object doesn't matter that much anymore.

To me, it looks like you're looking at this more from an server or proxy perspective, constructing headers to be sent out over the wire, and you perfectly want to control that process. This is totally fine, but it might require a quite different header object, than clients. So this should be stated as a design goal.

Lukasa · 2015-03-07T10:22:10Z

@ml31415 Agreed, so let me specify the design goals:

It must be possible to exactly control the format of the headers emitted on the wire.
It must be possible to interact with the headers exactly as they arrived on the wire.
It must be possible to obtain all header values in a canonical form (list of values).
It must be possible to perform a 'lookup' on a header key and retrieve its values (there is no requirement that this be O(1)).

These are the goals for a low-level representation. In particular, hyper will allow any iterable of pairs as a header representation if chosen, but this will be available for those who require it.

ml31415 · 2015-03-07T10:57:37Z

This sounds absolutely reasonable for hyper. For urllib3, I'd definitely would like to see O(1) for field lookup. What I'd also like to see, when different client implementations would share the same header objects, instead of copying the fields again and again, to create their own header objects, doing 99% the same as the original one. Would be nice, if there would be some kind of reliable API for a header, so I could just pass such a header from geventhttpclient right into urllib3 or requests, without any monkeypatching.

Lukasa · 2015-03-07T11:48:48Z

What I'd also like to see, when different client implementations would share the same header objects, instead of copying the fields again and again, to create their own header objects, doing 99% the same as the original one.

Would be nice, if there would be some kind of reliable API for a header, so I could just pass such a header from geventhttpclient right into urllib3 or requests, without any monkeypatching.

IMO, the only acceptable reliable API is a list of tuples in canonical form (one header, one value, comma-separated lists broken out). Any other API represents a loss of information, and any really intelligent header implementation should be able to produce exactly that. There is no requirement that they maintain all the information that such a representation provides (e.g. ordering), but they really need to be able to consume that form, because it's the lowest-common denominator structure. hyper's header map will consider that form to be the default result of iterating over it, which means it should be safe to immediately pass it into another header structure.

hyper will also always be able to cope with any implementation that provides it such an iterable: in fact, it will guarantee that the ordering and structure returned by that iterable will be represented in its wire-encoding, both for HTTP/1.1 and HTTP/2. Any higher-level layer that wants to transform headers (e.g. to allow O(1) lookup) needs to be able to consume that representation, ideally in its constructor.

Lukasa · 2015-03-07T11:50:50Z

IMO, the only acceptable reliable API is a list of tuples in canonical form (one header, one value, comma-separated lists broken out).

Note that, in practice, hyper will be more generous than that, and will allow headers in non-canonical form to be provided to it and will preserve that structure internally. However, it will only emit headers in canonical form in the default iteration model, requiring an alternative iteration mode to emit the 'raw' headers.

Lukasa · 2015-03-07T18:00:36Z

This initial implementation is good enough for now, so let's use it. If anyone who has been tracking this has further suggestions or requests, please open another issue: your feedback is welcome!

Do headers properly.

ml31415 · 2015-03-07T19:31:54Z

Just for the sake of elaborating my motivation for a speedy client header implementation: I had written a first multidict implementation for geventhttpclient. Doing some benchmarking, it reduced the requests per second against a local nginx machine from 4k to something like 2.5k. Before it had just a plain dict, overwriting duplicates. It got back to around 3.6k after a bunch of tuning, with some similar approach as for urllib3 now. For urllib3 the impact is far less, as httplib is already slow as hell with header parsing, around 500 requests per second for my machine. (Yup, high time to upgrade ...).

For a client to provide reconstructability of the headers, I guess the cheapest and most general option would be, to simply have the header stored as a plain string. Or like httplib is already doing it, as a list of lines. Having another intermediate raw header implementation will just add overhead for a very rare usecase. Discussing client issues might be the wrong place here, so sorry if this is offtopic from a hyper point of view.

piotr-dobrogost · 2015-03-07T21:15:11Z

What I'd also like to see, when different client implementations would share the same header objects, instead of copying the fields again and again, to create their own header objects, doing 99% the same as the original one. Would be nice, if there would be some kind of reliable API for a header, so I could just pass such a header from geventhttpclient right into urllib3 or requests, without any monkeypatching.

This.

Lukasa · 2015-03-07T22:25:16Z

Discussing client issues might be the wrong place here, so sorry if this is offtopic from a hyper point of view.

Not at all. hyper is a client implementation, and it should do what it can to be fast. However, it will always be more important to me to be correct than to be fast: I rely on PyPy to make me fast. ;)

The rule with hyper's implementation is that I will have a test suite that judges correctness, and any conforming implementation will be acceptable. Performance enhancements are welcome and should be made, but not at the cost of correctness.

However, note again that for making requests hyper will always allow any iterable to be used. All it cares about is that it can get a sequence of name-value pairs. For responses, it will initially use the HTTPHeaderMap made available here, but the current implementation of that code stores exactly what it received from the wire with no extra work done on it, so it is basically trivial to replace that with any mapping the user wants. If you think it's worth doing, I can even make it a natural part of the API, hanging a 'HeaderClass' field on one of the classes.

ml31415 · 2015-03-08T07:44:40Z

That's a noble trust in PyPy :) Would be awesome indeed, if it would fix O(n) to O(1) automatically! I totally agree with correctness first, but the only term I'd apply this to is spec conformity here. In that light, full reconstructability might already be some additional desire, required maybe for debugging. And for that, I'd probably prefer raw data == plain header string over any other intermediate representation. In case of malformated headers which mess up the parsing to the intermediate representation, this might be required anyways.

Lukasa · 2015-03-08T08:22:45Z

Would be awesome indeed, if it would fix O(n) to O(1) automatically!

Haha, yes, that would be quite the trick!

It's worth noting that what I'm really relying on here is that even though the O(n) stuff is inarguably slower than the O(1) stuff, in practice on reasonably sized headers it's not that much slower. As they say, all algorithms are fast for small n, and most of the time headers are small in the number of header fields they contain (though they may have very long fields).

To test this assertion, let me provide you with the profiling script here. This script tests hyper's header map, urllib3's header dict (from 1.10.2, e.g. the one that is actually a bit broken), and a raw dict. It uses the following header list (drawn from a Facebook API request):

headers = [
    ("content-length", "14684"),
    ("x-content-type-options", "nosniff"),
    ("content-encoding", "gzip"),
    ("expires", "Tue, 29 Oct 2013 02:44:39 GMT"),
    ("vary", "Accept-Encoding"),
    ("last-modified", "Sun, 28 Oct 2012 21:37:35 GMT"),
    ("connection", "keep-alive"),
    ("cache-control", "public, max-age=31067635"),
    ("date", "Sat, 03 Nov 2012 12:50:44 GMT"),
    ("access-control-allow-origin", "*"),
    ("content-type", "text/css; charset=utf-8"),
    ("x-fb-debug", "Qc0GcUiwi3io8aSRIdXaahYr6KKhphvV6NlN8vo/bD4="),
]

I ran it twice, once on Python 3 and once on PyPy.

You should note several things about this test. Firstly, it provides the biggest advantage possible to urllib3 and the dict, because there are in fact no repeated fields here. Thus, hyper is doing a lot of extra work that it doesn't need to. Additionally, hyper will encode these all to bytestrings, and will preserve their original structure, meaning that it does a ton more work on all accesses.

Regardless, here are the results for 10,000 runs of each test:

Python 3.4.3:

'test_creation' ((<class 'hyper.common.headers.HTTPHeaderMap'>,), {}) 0.36 sec
'test_creation' ((<class 'urllib3._collections.HTTPHeaderDict'>,), {}) 0.15 sec
'test_creation' ((<class 'dict'>,), {}) 0.02 sec
'test_access' ((<hyper.common.headers.HTTPHeaderMap object at 0x108c4db38>,), {}) 0.10 sec
'test_access' ((HTTPHeaderDict({'date': 'Sat, 03 Nov 2012 12:50:44 GMT', 'cache-control': 'public, max-age=31067635', 'content-length': '14684', 'content-type': 'text/css; charset=utf-8', 'expires': 'Tue, 29 Oct 2013 02:44:39 GMT', 'content-encoding': 'gzip', 'last-modified': 'Sun, 28 Oct 2012 21:37:35 GMT', 'x-content-type-options': 'nosniff', 'vary': 'Accept-Encoding', 'x-fb-debug': 'Qc0GcUiwi3io8aSRIdXaahYr6KKhphvV6NlN8vo/bD4=', 'access-control-allow-origin': '*', 'connection': 'keep-alive'}),), {}) 0.01 sec
'test_access' (({'date': 'Sat, 03 Nov 2012 12:50:44 GMT', 'cache-control': 'public, max-age=31067635', 'content-length': '14684', 'content-type': 'text/css; charset=utf-8', 'expires': 'Tue, 29 Oct 2013 02:44:39 GMT', 'content-encoding': 'gzip', 'connection': 'keep-alive', 'x-content-type-options': 'nosniff', 'vary': 'Accept-Encoding', 'x-fb-debug': 'Qc0GcUiwi3io8aSRIdXaahYr6KKhphvV6NlN8vo/bD4=', 'last-modified': 'Sun, 28 Oct 2012 21:37:35 GMT', 'access-control-allow-origin': '*'},), {}) 0.00 sec

PyPy 2.5.0:

'test_creation' ((<class 'hyper.common.headers.HTTPHeaderMap'>,), {}) 0.12 sec
'test_creation' ((<class 'urllib3._collections.HTTPHeaderDict'>,), {}) 0.09 sec
'test_creation' ((<type 'dict'>,), {}) 0.01 sec
'test_access' ((<hyper.common.headers.HTTPHeaderMap object at 0x0000000107664fe0>,), {}) 0.02 sec
'test_access' ((HTTPHeaderDict({'content-length': '14684', 'x-content-type-options': 'nosniff', 'content-encoding': 'gzip', 'expires': 'Tue, 29 Oct 2013 02:44:39 GMT', 'vary': 'Accept-Encoding', 'last-modified': 'Sun, 28 Oct 2012 21:37:35 GMT', 'connection': 'keep-alive', 'cache-control': 'public, max-age=31067635', 'date': 'Sat, 03 Nov 2012 12:50:44 GMT', 'access-control-allow-origin': '*', 'content-type': 'text/css; charset=utf-8', 'x-fb-debug': 'Qc0GcUiwi3io8aSRIdXaahYr6KKhphvV6NlN8vo/bD4='}),), {}) 0.01 sec
'test_access' (({'content-length': '14684', 'x-content-type-options': 'nosniff', 'content-encoding': 'gzip', 'expires': 'Tue, 29 Oct 2013 02:44:39 GMT', 'vary': 'Accept-Encoding', 'last-modified': 'Sun, 28 Oct 2012 21:37:35 GMT', 'connection': 'keep-alive', 'cache-control': 'public, max-age=31067635', 'date': 'Sat, 03 Nov 2012 12:50:44 GMT', 'access-control-allow-origin': '*', 'content-type': 'text/css; charset=utf-8', 'x-fb-debug': 'Qc0GcUiwi3io8aSRIdXaahYr6KKhphvV6NlN8vo/bD4='},), {}) 0.01 sec

What conclusions can we draw?

Creating hyper's header map is expensive, particularly on Python 3. That's not really a surprise: on Python 3 the entire header map is strings, so hyper has to encode every single one. That's necessary because urllib3 cannot handle being given bytestrings on Python 3 (more strictly it can't handle returning them). Note that it's a bit more than twice the cost of urllib3 on Python 3 (where it has to encode every byte), and is only a tiny bit more expensive on PyPy (where it does not).
On Python 3, again, the lookup is quite a bit more expensive, but the PyPy result is revealing. The PyPy numbers make it quite clear that the reason the hyper lookup is expensive is that it is all Python code. Moving to the JIT compiler eliminates most of the overhead, and the hyper lookup becomes a tiny fraction more expensive than the dict.

This is what I was getting at when I said that I'd rely on PyPy to make me fast. hyper's header map does a lot more work, and so it will always be slower than a dict. We absolutely should make it as fast as possible without sacrificing that correctness.

However, it's misleading to see the O(n) lookup and conclude 'slow'. On representative header sets, it's simply not much slower than the dict lookup: certainly not enough to be the bottleneck in your average code. Additionally, users that care can feel free to replace the hyper map with something faster if they are prepared to make those tradeoffs.

We should always strive to be faster and more efficient, but we should also always know what's slowing us down. What's slowing hyper down is mostly the Python code in the stack, rather than the list-based approach.

ml31415 · 2015-03-08T11:07:11Z

You're surely right, that pypy is doing a great job in speeding up the lookup, and that O(n) might be even faster than O(1) for small enough n. Though, not everyone is using pypy for different reasons, so there might still be a desire for reasonably fast implementations for cpython.

Independent of the specific implementation, what I still would like to see is some interchangeably API for a header object, with some well defined guarantees, so everyone may be able to use his preferred implementation, and still be able to interoperate with other client implementations seamlessly.

With the recent change to urllib3.response, I can now monkeypatch / replace httplib with geventhttpclient and the nginx http parser, and feed a HTTPHeaderDict right into the response object. urllib3 checks, if it's already a HTTPHeaderDict passed to it from the response object. If it's a subclass of HTTPHeaderDict, then it just reuses the object, instead of sillily copying all the header content, and creating a new header object. With requests, as far as I understand it, the header object is reread again unfortunately, instead of reusing HTTPHeaderDict. This behaviour and interchangeability would be desireable for the whole toolchain imho.

Edit: Clarified above paragraph.

Lukasa · 2015-03-08T12:20:43Z

Though, not everyone is using pypy for different reasons

That's true, but unfortunate. I consider PyPy to be my preferred interpreter at this point, as do many other developers, and I strongly believe that it is the future of Python.

What I still would like to see is some interchangeably API for a header object.

Well, requests, urllib3 and hyper are a bit of a cabal: the same developers work on all three projects. So I'm sure we can get to an agreement on a common base class that defines an API we can all work from. @shazow @sigmavirus24, does that sound like a worthwhile idea?

ml31415 · 2015-03-08T12:28:01Z

I expressed myself in a somewhat confused way, glad you understood it anyways 👍

shazow · 2015-03-08T18:47:35Z

🐥

Lukasa mentioned this pull request Mar 5, 2015

duplicate/multiple headers are lost urllib3/urllib3#561

Closed

Initial header set implementation.

0ac6123

KeyErrors should have messages.

bab718f

Lukasa added 6 commits March 7, 2015 12:14

Switch to storing headers in 'raw' form

721a239

Allow retrieval of 'raw' headers.

36e0ee9

Move to bytestrings everywhere for headers.

2384ce0

Add documentation for the header map.

57cd931

Extra headers test: non-strings.

0a7c957

Unicode literals for headers tests.

3c7981d

Lukasa added a commit that referenced this pull request Mar 7, 2015

Merge pull request #91 from Lukasa/headerdict

8cab7eb

Do headers properly.

Lukasa merged commit 8cab7eb into development Mar 7, 2015

Lukasa deleted the headerdict branch March 7, 2015 18:00

Do headers properly. #91

Do headers properly. #91

Uh oh!

Conversation

Lukasa commented Mar 5, 2015

Uh oh!

seocam commented Mar 5, 2015

Uh oh!

Lukasa commented Mar 5, 2015

Uh oh!

Lukasa commented Mar 5, 2015

Uh oh!

piotr-dobrogost commented Mar 5, 2015

Uh oh!

Lukasa commented Mar 5, 2015

Uh oh!

dimaqq commented Mar 6, 2015

Uh oh!

Lukasa commented Mar 6, 2015

Uh oh!

Lukasa commented Mar 6, 2015

Uh oh!

wtolson commented Mar 6, 2015

Uh oh!

Lukasa commented Mar 6, 2015

Uh oh!

wtolson commented Mar 6, 2015

Uh oh!

ml31415 commented Mar 7, 2015

Uh oh!

Lukasa commented Mar 7, 2015

Uh oh!

ml31415 commented Mar 7, 2015

Uh oh!

Lukasa commented Mar 7, 2015

Uh oh!

Lukasa commented Mar 7, 2015

Uh oh!

Lukasa commented Mar 7, 2015

Uh oh!

ml31415 commented Mar 7, 2015

Uh oh!

piotr-dobrogost commented Mar 7, 2015

Uh oh!

Lukasa commented Mar 7, 2015

Uh oh!

ml31415 commented Mar 8, 2015

Uh oh!

Lukasa commented Mar 8, 2015

Uh oh!

ml31415 commented Mar 8, 2015

Uh oh!

Lukasa commented Mar 8, 2015

Uh oh!

ml31415 commented Mar 8, 2015

Uh oh!

shazow commented Mar 8, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants