-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature suggestion: Preserve Header Order #2803
Comments
+1 to add this feature. There are actually two separate features:
I don't see any downsides in keeping response header order, +1 to make it default. (2) is a bit more problematic because this feature may increase memory usage of in-memory request queues. We also need to make sure order of request headers stored in disk queues is preserved after save/load cycle. Python 3.6 could make it easier and more memory efficient with its new dict implementation, but we shouldn't take this change lightly. I'm -0.5 on merging a pull request with an implementation which just uses OrderedDIct for all Request.headers unconditionally. Headers class is used both for Request and Response. If we are to make (2) optional and (1) default, there should be an option not to preserve headers order (maybe two Header class versions). See also: #2784 |
@kmike perhaps specify the headers as a sequence of tiples in the Request instance, then after getting one off the queue test for an instance of tuple/list and converting it to OrderedCaselessDict at that point? |
Seems a bit hackey but that could be the indicator for "I want the header order preserved" as well |
+1 for this, user specified header order for requests also seems a lot more important than response header order. This has become a popular bot detection mechanism lately. I'll take a look if I can fabricate an MR. Are we sure that any underlying libraries preserve the header order? I think we should also preserve casing instead of making it an OrderedCaselessDict, since this is also behaviour that differs between browsers. |
Any news on this or workaround? |
bump ;) ? |
I recently stumbled on a need to implement this, found one site fingerprinting HTTP clients based on header order and behaving different if you had unusual header order. I implemented feature 2 from @kmike comment
I did very basic implementation which just defines some hardcoded header order and then uses OrderedDict for all Request.headers unconditionally. You can see it here, so if anyone needs to use this, go and try, hope it'll work but cant guarantee, it worked for me, no guarantee it will work for you. To make it part of Scrapy we'll have to ensure a couple of things
Let me know what do you think about this. |
I know it has been brought up some time ago (issue #223 in 2013 I believe). And at that time it was seen as not necessary with the respondent asking for some example sites which relied on header order. While I cannot say for sure that some of the sites I've had issues with are giving me problems because of header order, there have been a few papers (I can only recall this one for now: http://www.letmetrackyou.org/paper.pdf) which mention header construction/order as a way to fingerprint browsers, or in this case determine whether the UA string has been spoofed. I have tried switching CaselessDict to inherit from OrderedDict switching all references in the class from dict to OrderedDict and the Headers class seems to (in this very limited example) construct the headers appropriately.
Would there be any drawbacks in making this change?
The text was updated successfully, but these errors were encountered: