-
Notifications
You must be signed in to change notification settings - Fork 196
Do headers properly. #91
Conversation
|
I'd like to have more flexibility when setting headers. I use urllib3 in the core of reverse proxy implementation and I think that sending the headers in the same orders they came, or at least let the user play around with that could be good (or maybe not). For sure what I'd like is that if a header comes twice I'd like to be able to send it in the same way it came instead of joining it in just one. The ordered list of tuples you proposed looks to somehow meet those requirements. It always return a list and the insertion order is respected. |
|
@seocam Note that the gisted implementation actually doesn't quite fit that bill. It's very opinionated: it splits all duplicate header fields out into their own line. It does this primarily to enforce some notion of 'canonical form', but it might be better to transform into canonical form at access time rather than in the underlying representation, and to allow a non-canonical representation of the header block as well. |
|
Alright, I've turned this into a pull request with an initial implementation of the header mapping. Feedback is encouraged! |
You should see smile on my face after reading this... :) |
|
@piotr-dobrogost Possibly. =) I'm open to suggestions! |
|
btw., The practice of automatic combining is questionable though, because some values, e.g. dates, already have a comma, thus the application must know which fields are single-item and which are lists. P.S. while on the subject, let me clarify the terminology: While it's very common to see "headers," that's not actually correct. |
|
@dimaqq Yes it does, but requests has a slightly different need than I have. What I'm trying to write here is a header representation that maintains as much of the information as possible that came off the wire, while also representing the headers in a format that reflects their semantics. In practice, this data structure isn't there yet, but ideally it ought to be possible to put headers in from the wire and get them back out the other side in exactly the same form. This isn't that important for requests because by-and-large requests does not need that kind of flexibility, but it's hugely important for a low-level library like As to terminology, you are quite right, but in practice calling them HTTP headers leads to less confusion than calling them HTTP header fields. One of those unfortunate areas where the 'correct' usage is less clear than the 'incorrect' one. |
|
Werkzeug has a fundamentally very similar data structure to this one, which suggests the sanity of the approach (if Armin is doing it, it feels like it's probably the right thing to do!). See it here. It differs from mine in a couple of key areas:
Altogether I don't think I want to take Werkzeug's implementation wholesale because it carries a lot of unnecessary complexity, but there are some good lessons to learn there. Interestingly, I think Werzeug's implementation is much more likely to be useful for urllib3 and friends, where its additional function and dictiness might be really helpful. Worth thinking about. |
|
I've been working on a similar header data structure with a focus on first no loss of information about the header and second consistancy with urllib3. This approach also has the advantage of consistent behaviour between python 2/3 and maintains the original case/order which allows you to fully reconstruct header. |
|
@wtolson I'd love to see it in urllib3. =) However, I think |
|
Thanks for the feedback! Working on a urllib3 PR now. :) |
|
I guess it might be worth to clarify the design goals a bit more. Some people might consider speed more important, than losing e.g. sort information in the header. Many client implementations won't really care about ordering, and maybe not even capitalization, as long as they can extract cookies and authentication successfully. Especially clients designed for high throughput, webspiders etc. will care more about speed than perfection. Anything that wastes considerably more time than usual dict lookups probably will be irrelevant there. I'd also like to note, that for clients not using the common python httplib, but a faster c-version, header handling can be one of the larger time wasters. Using httplib, this is so slow, that the header object doesn't matter that much anymore. To me, it looks like you're looking at this more from an server or proxy perspective, constructing headers to be sent out over the wire, and you perfectly want to control that process. This is totally fine, but it might require a quite different header object, than clients. So this should be stated as a design goal. |
|
@ml31415 Agreed, so let me specify the design goals:
These are the goals for a low-level representation. In particular, |
|
This sounds absolutely reasonable for |
IMO, the only acceptable reliable API is a list of tuples in canonical form (one header, one value, comma-separated lists broken out). Any other API represents a loss of information, and any really intelligent header implementation should be able to produce exactly that. There is no requirement that they maintain all the information that such a representation provides (e.g. ordering), but they really need to be able to consume that form, because it's the lowest-common denominator structure.
|
Note that, in practice, |
|
This initial implementation is good enough for now, so let's use it. If anyone who has been tracking this has further suggestions or requests, please open another issue: your feedback is welcome! |
|
Just for the sake of elaborating my motivation for a speedy client header implementation: I had written a first multidict implementation for geventhttpclient. Doing some benchmarking, it reduced the requests per second against a local nginx machine from 4k to something like 2.5k. Before it had just a plain dict, overwriting duplicates. It got back to around 3.6k after a bunch of tuning, with some similar approach as for urllib3 now. For urllib3 the impact is far less, as httplib is already slow as hell with header parsing, around 500 requests per second for my machine. (Yup, high time to upgrade ...). For a client to provide reconstructability of the headers, I guess the cheapest and most general option would be, to simply have the header stored as a plain string. Or like httplib is already doing it, as a list of lines. Having another intermediate raw header implementation will just add overhead for a very rare usecase. Discussing client issues might be the wrong place here, so sorry if this is offtopic from a hyper point of view. |
This. |
Not at all. hyper is a client implementation, and it should do what it can to be fast. However, it will always be more important to me to be correct than to be fast: I rely on PyPy to make me fast. ;) The rule with hyper's implementation is that I will have a test suite that judges correctness, and any conforming implementation will be acceptable. Performance enhancements are welcome and should be made, but not at the cost of correctness. However, note again that for making requests |
|
That's a noble trust in PyPy :) Would be awesome indeed, if it would fix O(n) to O(1) automatically! I totally agree with correctness first, but the only term I'd apply this to is spec conformity here. In that light, full reconstructability might already be some additional desire, required maybe for debugging. And for that, I'd probably prefer raw data == plain header string over any other intermediate representation. In case of malformated headers which mess up the parsing to the intermediate representation, this might be required anyways. |
Haha, yes, that would be quite the trick! It's worth noting that what I'm really relying on here is that even though the O(n) stuff is inarguably slower than the O(1) stuff, in practice on reasonably sized headers it's not that much slower. As they say, all algorithms are fast for small n, and most of the time headers are small in the number of header fields they contain (though they may have very long fields). To test this assertion, let me provide you with the profiling script here. This script tests I ran it twice, once on Python 3 and once on PyPy. You should note several things about this test. Firstly, it provides the biggest advantage possible to urllib3 and the dict, because there are in fact no repeated fields here. Thus, Regardless, here are the results for 10,000 runs of each test: Python 3.4.3: PyPy 2.5.0: What conclusions can we draw?
This is what I was getting at when I said that I'd rely on PyPy to make me fast. However, it's misleading to see the O(n) lookup and conclude 'slow'. On representative header sets, it's simply not much slower than the dict lookup: certainly not enough to be the bottleneck in your average code. Additionally, users that care can feel free to replace the We should always strive to be faster and more efficient, but we should also always know what's slowing us down. What's slowing |
|
You're surely right, that pypy is doing a great job in speeding up the lookup, and that O(n) might be even faster than O(1) for small enough n. Though, not everyone is using pypy for different reasons, so there might still be a desire for reasonably fast implementations for cpython. Independent of the specific implementation, what I still would like to see is some interchangeably API for a header object, with some well defined guarantees, so everyone may be able to use his preferred implementation, and still be able to interoperate with other client implementations seamlessly. With the recent change to Edit: Clarified above paragraph. |
That's true, but unfortunate. I consider PyPy to be my preferred interpreter at this point, as do many other developers, and I strongly believe that it is the future of Python.
Well, requests, urllib3 and hyper are a bit of a cabal: the same developers work on all three projects. So I'm sure we can get to an agreement on a common base class that defines an API we can all work from. @shazow @sigmavirus24, does that sound like a worthwhile idea? |
|
I expressed myself in a somewhat confused way, glad you understood it anyways 👍 |
|
🐥 |
I'm sick of Python implementations insisting that headers are dictionaries. They aren't, it's a bad match. We can and should do better. This is a proposed initial implementation that's in the spirit of what Go does, which I think is a substantially better model.
Feedback is encouraged here. Note that this has been strongly influenced by urllib3/urllib3#561, urllib3/urllib3#562, urllib3/urllib3#563, and urllib3/urllib3#564.