[MRG] Referrer policies in RefererMiddleware #2306
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2306 +/- ##
=========================================
- Coverage 83.93% 83.34% -0.6%
=========================================
Files 161 161
Lines 8920 9060 +140
Branches 1316 1290 -26
=========================================
+ Hits 7487 7551 +64
- Misses 1176 1240 +64
- Partials 257 269 +12
Continue to review full report at Codecov.
|
if stripped is not None: | ||
return urlunparse(stripped) | ||
|
||
def strip_url(self, req_or_resp, origin_only=False): |
kmike
Oct 18, 2016
•
Member
What do you think about making it an utility function? It can be more reusable and easier to test this way (even doctests could be enough). I'd make it accept URL or ParseResult instead of scrapy Request or Response.
What do you think about making it an utility function? It can be more reusable and easier to test this way (even doctests could be enough). I'd make it accept URL or ParseResult instead of scrapy Request or Response.
redapple
Oct 18, 2016
Author
Contributor
Good idea @kmike , I even had a similar use-case for https://github.com/umrashrf/scrapy/pull/1/files#diff-c1b844ba341115f81f50b33fd691fcbbR14
Good idea @kmike , I even had a similar use-case for https://github.com/umrashrf/scrapy/pull/1/files#diff-c1b844ba341115f81f50b33fd691fcbbR14
redapple
Oct 18, 2016
Author
Contributor
@kmike , are you thinking of an addition to w3lib.url
?
Also, the current output is a ParseResult
tuple, mainly for comparing origins.
Does it make sense to have the utility function serialize/urlunparse on demand, with some argument?
Or should we just compare origins as strings? (which is probably the cleanest as I don't really like returning different non-None types)
@kmike , are you thinking of an addition to w3lib.url
?
Also, the current output is a ParseResult
tuple, mainly for comparing origins.
Does it make sense to have the utility function serialize/urlunparse on demand, with some argument?
Or should we just compare origins as strings? (which is probably the cleanest as I don't really like returning different non-None types)
@@ -103,3 +103,35 @@ def guess_scheme(url): | |||
return any_to_uri(url) | |||
else: | |||
return add_http_if_no_scheme(url) | |||
|
|||
|
|||
def strip_url_credentials(url, origin_only=False, keep_fragments=False): |
kmike
Oct 21, 2016
Member
Could you please add a docstring to this function?
The function is named 'strip credentials', but it also strips port, I think this is not intuitive.
Could you please add a docstring to this function?
The function is named 'strip credentials', but it also strips port, I think this is not intuitive.
redapple
Oct 21, 2016
Author
Contributor
Right, it strips standard HTTP and HTTPS port numbers. This is mainly for comparing origins.
Docstring is needed indeed.
Does it make sense to have this feature as switchable through an argument?
Right, it strips standard HTTP and HTTPS port numbers. This is mainly for comparing origins.
Docstring is needed indeed.
Does it make sense to have this feature as switchable through an argument?
kmike
Oct 21, 2016
Member
Yeah, it should be either an argument or a separate function. If it becomes an argument then I think it is better to rename the function (not sure about the new name).
Yeah, it should be either an argument or a separate function. If it becomes an argument then I think it is better to rename the function (not sure about the new name).
I'm adding 2 policies that appeared in the working draft: |
Alright, @kmike , @eliasdorneles , @lopuhin , @dangra , I think this is ready for review. |
(I'm not sure why I use |
using ``file://`` or ``s3://`` scheme. | ||
|
||
.. warning:: | ||
By default, Scrapy's default referrer policy, just like `"no-referrer-when-downgrade"`_, |
kmike
Jan 18, 2017
Member
"By default" looks redundant
"By default" looks redundant
.. warning:: | ||
Scrapy's default referrer policy, just like `"no-referrer-when-downgrade"`_, | ||
will send a non-empty "Referer" header from any ``https://`` to any ``https://`` URL, | ||
even if the domain is different. |
kmike
Jan 18, 2017
Member
If that's what browsers do by default then I think it makes sense to note that, to make a warning less scary.
If that's what browsers do by default then I think it makes sense to note that, to make a warning less scary.
|
||
- a path to a ``scrapy.spidermiddlewares.referer.ReferrerPolicy`` subclass, | ||
either a custom one or one of the built-in ones | ||
(see ``scrapy.spidermiddlewares.referer``), |
kmike
Jan 18, 2017
Member
Is there a page in the docs we can reference to?
Is there a page in the docs we can reference to?
redapple
Jan 26, 2017
Author
Contributor
What kind of reference do you have in mind? built-in classes?
We could also define an interface for policies, so people know what to implement.
What kind of reference do you have in mind? built-in classes?
We could also define an interface for policies, so people know what to implement.
redapple
Jan 26, 2017
Author
Contributor
@kmike , an alternative is to use a table:
$ git diff
diff --git a/docs/topics/spider-middleware.rst b/docs/topics/spider-middleware.rst
index 9010ae7..46a84d0 100644
--- a/docs/topics/spider-middleware.rst
+++ b/docs/topics/spider-middleware.rst
@@ -346,21 +346,21 @@ This setting accepts:
- a path to a ``scrapy.spidermiddlewares.referer.ReferrerPolicy`` subclass,
either a custom one or one of the built-in ones
(see ``scrapy.spidermiddlewares.referer``),
-- or one of the standard W3C-defined string values:
-
- - `"no-referrer"`_,
- - `"no-referrer-when-downgrade"`_
- (the W3C-recommended default, used by major web browsers),
- - `"same-origin"`_,
- - `"origin"`_,
- - `"strict-origin"`_,
- - `"origin-when-cross-origin"`_,
- - `"strict-origin-when-cross-origin"`_,
- - or `"unsafe-url"`_
- (not recommended).
-
-It can also be the non-standard value ``"scrapy-default"`` to use
-Scrapy's default referrer policy.
+- or one of the standard W3C-defined string values
+
+======================================= ======================================================================== =======================================================
+String value Class name
+======================================= ======================================================================== =======================================================
+`"no-referrer"`_ ``'scrapy.spidermiddlewares.referer.NoReferrerPolicy'``
+`"no-referrer-when-downgrade"`_ ``'scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy'`` the W3C-recommended default, used by major web browsers
+`"same-origin"`_ ``'scrapy.spidermiddlewares.referer.SameOriginPolicy'``
+`"origin"`_ ``'scrapy.spidermiddlewares.referer.OriginPolicy'``
+`"strict-origin"`_ ``'scrapy.spidermiddlewares.referer.StrictOriginPolicy'``
+`"origin-when-cross-origin"`_ ``'scrapy.spidermiddlewares.referer.OriginWhenCrossOriginPolicy'``
+`"strict-origin-when-cross-origin"`_ ``'scrapy.spidermiddlewares.referer.StrictOriginWhenCrossOriginPolicy'``
+`"unsafe-url"`_ ``'scrapy.spidermiddlewares.referer.UnsafeUrlPolicy'`` NOT recommended
+``"scrapy-default"`` ``'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'`` Scrapy's default policy (see below)
+======================================= ======================================================================== =======================================================
Scrapy's default referrer policy is a variant of `"no-referrer-when-downgrade"`_,
with the addition that "Referer" is not sent if the parent request was
@kmike , an alternative is to use a table:
$ git diff
diff --git a/docs/topics/spider-middleware.rst b/docs/topics/spider-middleware.rst
index 9010ae7..46a84d0 100644
--- a/docs/topics/spider-middleware.rst
+++ b/docs/topics/spider-middleware.rst
@@ -346,21 +346,21 @@ This setting accepts:
- a path to a ``scrapy.spidermiddlewares.referer.ReferrerPolicy`` subclass,
either a custom one or one of the built-in ones
(see ``scrapy.spidermiddlewares.referer``),
-- or one of the standard W3C-defined string values:
-
- - `"no-referrer"`_,
- - `"no-referrer-when-downgrade"`_
- (the W3C-recommended default, used by major web browsers),
- - `"same-origin"`_,
- - `"origin"`_,
- - `"strict-origin"`_,
- - `"origin-when-cross-origin"`_,
- - `"strict-origin-when-cross-origin"`_,
- - or `"unsafe-url"`_
- (not recommended).
-
-It can also be the non-standard value ``"scrapy-default"`` to use
-Scrapy's default referrer policy.
+- or one of the standard W3C-defined string values
+
+======================================= ======================================================================== =======================================================
+String value Class name
+======================================= ======================================================================== =======================================================
+`"no-referrer"`_ ``'scrapy.spidermiddlewares.referer.NoReferrerPolicy'``
+`"no-referrer-when-downgrade"`_ ``'scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy'`` the W3C-recommended default, used by major web browsers
+`"same-origin"`_ ``'scrapy.spidermiddlewares.referer.SameOriginPolicy'``
+`"origin"`_ ``'scrapy.spidermiddlewares.referer.OriginPolicy'``
+`"strict-origin"`_ ``'scrapy.spidermiddlewares.referer.StrictOriginPolicy'``
+`"origin-when-cross-origin"`_ ``'scrapy.spidermiddlewares.referer.OriginWhenCrossOriginPolicy'``
+`"strict-origin-when-cross-origin"`_ ``'scrapy.spidermiddlewares.referer.StrictOriginWhenCrossOriginPolicy'``
+`"unsafe-url"`_ ``'scrapy.spidermiddlewares.referer.UnsafeUrlPolicy'`` NOT recommended
+``"scrapy-default"`` ``'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'`` Scrapy's default policy (see below)
+======================================= ======================================================================== =======================================================
Scrapy's default referrer policy is a variant of `"no-referrer-when-downgrade"`_,
with the addition that "Referer" is not sent if the parent request was
kmike
Feb 21, 2017
Member
I like such table.
Also, adding autodocs for these policy classes looks helpful - they already have docstrings, it is just a matter of putting several autoclass
labels to scrapy rst docs.
I like such table.
Also, adding autodocs for these policy classes looks helpful - they already have docstrings, it is just a matter of putting several autoclass
labels to scrapy rst docs.
kmike
Feb 21, 2017
Member
as for "see scrapy.spidermiddlewares.referer
", the question was how would user find this scrapy.spidermiddlewares.referer
- there is no link or anything.
as for "see scrapy.spidermiddlewares.referer
", the question was how would user find this scrapy.spidermiddlewares.referer
- there is no link or anything.
kmike
Feb 21, 2017
Member
Documenting policy interface sounds like too much for me; I'm not sure this is something users may want to extend.
Documenting policy interface sounds like too much for me; I'm not sure this is something users may want to extend.
|
||
Default: ``'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'`` | ||
|
||
.. reqmeta:: referrer_policy |
kmike
Jan 18, 2017
Member
It could make sense use the same spelling everywhere in Scrapy. Currently option name (REFERER_POLICY) has one 'r' while meta key name has 'rr'.
It could make sense use the same spelling everywhere in Scrapy. Currently option name (REFERER_POLICY) has one 'r' while meta key name has 'rr'.
kmike
Jan 18, 2017
Member
aha, I see, we already have REFERER_ENABLED
aha, I see, we already have REFERER_ENABLED
redapple
Jan 26, 2017
Author
Contributor
I'm ok with changing to REFERRER_POLICY
. (It's a shame about that spelling mistake in the RFCs.)
I'm ok with changing to REFERRER_POLICY
. (It's a shame about that spelling mistake in the RFCs.)
kmike
Feb 22, 2017
Member
We're loosing either way - now option names are inconsistent :)
We're loosing either way - now option names are inconsistent :)
kmike
Feb 22, 2017
Member
I guess the way to go is to rename the other option and a middleware itself as well; of course, it can be done later.
I guess the way to go is to rename the other option and a middleware itself as well; of course, it can be done later.
Scrapy's default referrer policy. | ||
|
||
Scrapy's default referrer policy is a variant of `"no-referrer-when-downgrade"`_, | ||
with the addition that "Referrer" is not sent if the parent request was |
kmike
Jan 18, 2017
Member
header name is Referer
header name is Referer
using ``file://`` or ``s3://`` scheme. | ||
|
||
.. warning:: | ||
Scrapy's default referrer policy—just like `"no-referrer-when-downgrade"`_, |
kmike
Jan 18, 2017
Member
a nitpick - there should be spaces before and after a dash :)
a nitpick - there should be spaces before and after a dash :)
@kmike , I rebased, updated the docs with |
Another thing could be to move the policies to scrapy/extensions, similar to HTTP cache policies (with |
try: | ||
self.default_policy = _policy_classes[policy.lower()] | ||
except: | ||
raise NotConfigured("Unknown referrer policy name %r" % policy) |
kmike
Feb 22, 2017
Member
I think it is better to raise an error and stop the crawl instead of silently disabling the middleware if a policy is unknown.
I think it is better to raise an error and stop the crawl instead of silently disabling the middleware if a policy is unknown.
redapple
Feb 22, 2017
Author
Contributor
Good point!
Good point!
policy_name = to_native_str( | ||
response.headers.get('Referrer-Policy', '').decode('latin1')) | ||
|
||
cls = _policy_classes.get(policy_name.lower(), self.default_policy) |
kmike
Feb 22, 2017
Member
What do you think about warning for unknown policies? They can be typos.
What do you think about warning for unknown policies? They can be typos.
redapple
Feb 22, 2017
Author
Contributor
Makes sense.
Makes sense.
|
||
NOREFERRER_SCHEMES = LOCAL_SCHEMES | ||
|
||
def referrer(self, response, request): |
kmike
Feb 22, 2017
•
Member
It seems the API requires response and request (not response_url and request_url) only because of urlparse_cached - you wanted to avoid duplicate computations. It leads to some ugly code when a fake Response is created. Is this correct, or are there cases response and requestis indeed needed?
I wonder how much is the effect, given that urlparse has its own internal cache (using urls as keys, not response/requests).
It seems the API requires response and request (not response_url and request_url) only because of urlparse_cached - you wanted to avoid duplicate computations. It leads to some ugly code when a fake Response is created. Is this correct, or are there cases response and requestis indeed needed?
I wonder how much is the effect, given that urlparse has its own internal cache (using urls as keys, not response/requests).
redapple
Feb 22, 2017
Author
Contributor
Right @kmike , that was the reason.
I cannot think right now of a real need for handling request/responses directly. Meta information and HTTP headers ("Referrer-Policy") are to be interpreted at middleware level, and not at policy level I believe.
Indeed the fake response is quite ugly.
Let me clean it up with URLs all along.
Right @kmike , that was the reason.
I cannot think right now of a real need for handling request/responses directly. Meta information and HTTP headers ("Referrer-Policy") are to be interpreted at middleware level, and not at policy level I believe.
Indeed the fake response is quite ugly.
Let me clean it up with URLs all along.
@kmike , I think I have addressed all your comments. (and rebased) |
@redapple tests are failing - could you please check it? |
@kmike , done. Sorry about that. |
strip_default_port=True, | ||
origin_only=origin_only) | ||
|
||
def origin(self, r): |
kmike
Mar 2, 2017
•
Member
Is there a reason some of the arguments are named r
while other are named url
?
Is there a reason some of the arguments are named r
while other are named url
?
redapple
Mar 2, 2017
Author
Contributor
Not really. I kept request/response when the 2 arguments are needed. But I can use "url" in others.
Not really. I kept request/response when the 2 arguments are needed. But I can use "url" in others.
kmike
Mar 2, 2017
Member
Hm, what's the reason arguments are still named request/response, and not something like request_url? Maybe I got a bit too used to type hints :)
Hm, what's the reason arguments are still named request/response, and not something like request_url? Maybe I got a bit too used to type hints :)
redapple
Mar 2, 2017
Author
Contributor
Most probably laziness and readability. I can change that too if you prefer.
Most probably laziness and readability. I can change that too if you prefer.
kmike
Mar 2, 2017
Member
Could you please change it? When looking at code I expect scrapy Request and Response objects to be passed to .referrer method, to figure out that they are urls you need to find calling code. I agree that request_url
looks worse, but imho it can help readability here.
Could you please change it? When looking at code I expect scrapy Request and Response objects to be passed to .referrer method, to figure out that they are urls you need to find calling code. I agree that request_url
looks worse, but imho it can help readability here.
redapple
Mar 2, 2017
Author
Contributor
Done.
Done.
Looks great, thanks @redapple! |
Fixes #2142
I've based the tests and implementation on https://www.w3.org/TR/referrer-policy/
By default, with this change,
RefererMiddleware
does not send aReferer
header (referrer value):file://
ors3://
,https://
response to anhttp://
request.However, it does send a referrer value for any
https://
to anotherhttps://
(as browsers do,"no-referrer-when-downgrade"
being the default policy)This needs discussion.
User can change the policy per-request, using a new
referrer_policy
meta key, with values from "no-referrer" / "no-referrer-when-downgrade" / "same-origin" / "origin" / "origin-when-cross-origin" / "unsafe-url".Still missing:urlparse_cached()
REFERER_DEFAULT_POLICY
settingReferrer-Policy
) -- is this even used in practice by web servers?To handle redirects, I added a signal handler on
request_scheduled
. I did not find a way to test both redirect middleware and referer middleware (I did not search too much though).Suggestions to do that are welcome.