Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature suggestion: Preserve Header Order #2803

Open
rjbks opened this issue Jun 23, 2017 · 7 comments
Open

Feature suggestion: Preserve Header Order #2803

rjbks opened this issue Jun 23, 2017 · 7 comments

Comments

@rjbks
Copy link

rjbks commented Jun 23, 2017

I know it has been brought up some time ago (issue #223 in 2013 I believe). And at that time it was seen as not necessary with the respondent asking for some example sites which relied on header order. While I cannot say for sure that some of the sites I've had issues with are giving me problems because of header order, there have been a few papers (I can only recall this one for now: http://www.letmetrackyou.org/paper.pdf) which mention header construction/order as a way to fingerprint browsers, or in this case determine whether the UA string has been spoofed. I have tried switching CaselessDict to inherit from OrderedDict switching all references in the class from dict to OrderedDict and the Headers class seems to (in this very limited example) construct the headers appropriately.

Would there be any drawbacks in making this change?

@kmike
Copy link
Member

kmike commented Jun 23, 2017

+1 to add this feature.

There are actually two separate features:

  1. preserve order of headers received from a website, and
  2. use user-specified header order when sending requests to websites.

I don't see any downsides in keeping response header order, +1 to make it default.

(2) is a bit more problematic because this feature may increase memory usage of in-memory request queues. We also need to make sure order of request headers stored in disk queues is preserved after save/load cycle. Python 3.6 could make it easier and more memory efficient with its new dict implementation, but we shouldn't take this change lightly. I'm -0.5 on merging a pull request with an implementation which just uses OrderedDIct for all Request.headers unconditionally.

Headers class is used both for Request and Response. If we are to make (2) optional and (1) default, there should be an option not to preserve headers order (maybe two Header class versions).

See also: #2784

@rjbks
Copy link
Author

rjbks commented Jun 23, 2017

@kmike perhaps specify the headers as a sequence of tiples in the Request instance, then after getting one off the queue test for an instance of tuple/list and converting it to OrderedCaselessDict at that point?

@rjbks
Copy link
Author

rjbks commented Jun 23, 2017

Seems a bit hackey but that could be the indicator for "I want the header order preserved" as well

@Glennvd
Copy link

Glennvd commented Mar 19, 2018

+1 for this, user specified header order for requests also seems a lot more important than response header order. This has become a popular bot detection mechanism lately. I'll take a look if I can fabricate an MR. Are we sure that any underlying libraries preserve the header order?

I think we should also preserve casing instead of making it an OrderedCaselessDict, since this is also behaviour that differs between browsers.

@doprdele
Copy link

Any news on this or workaround?

@TheRealAstroboy
Copy link

bump ;) ?

@pawelmhm
Copy link
Contributor

pawelmhm commented Dec 10, 2020

I recently stumbled on a need to implement this, found one site fingerprinting HTTP clients based on header order and behaving different if you had unusual header order. I implemented feature 2 from @kmike comment

use user-specified header order when sending requests to websites.

I did very basic implementation which just defines some hardcoded header order and then uses OrderedDict for all Request.headers unconditionally. You can see it here, so if anyone needs to use this, go and try, hope it'll work but cant guarantee, it worked for me, no guarantee it will work for you.
https://gist.github.com/pawelmhm/176a4d01aea93c65bd64155c761fcc7d

To make it part of Scrapy we'll have to ensure a couple of things

  • benchmark memory usage of in-memory request queues
  • make sure order of request headers stored in disk queues is preserved after save/load cycle
  • think about API for users to define their header order, should it be enabled by default, should there be some default header order for Scrapy? It is probably not needed by default, it could be some extra feature that will be enabled on request. Perhaps it could be even made part of some separate scrapy plugin that helps with dealing with fingerprinting?

Let me know what do you think about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants