New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Referrer policies in RefererMiddleware #2306
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2306 +/- ##
=========================================
- Coverage 83.93% 83.34% -0.6%
=========================================
Files 161 161
Lines 8920 9060 +140
Branches 1316 1290 -26
=========================================
+ Hits 7487 7551 +64
- Misses 1176 1240 +64
- Partials 257 269 +12
Continue to review full report at Codecov.
|
scrapy/spidermiddlewares/referer.py
Outdated
if stripped is not None: | ||
return urlunparse(stripped) | ||
|
||
def strip_url(self, req_or_resp, origin_only=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making it an utility function? It can be more reusable and easier to test this way (even doctests could be enough). I'd make it accept URL or ParseResult instead of scrapy Request or Response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea @kmike , I even had a similar use-case for https://github.com/umrashrf/scrapy/pull/1/files#diff-c1b844ba341115f81f50b33fd691fcbbR14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike , are you thinking of an addition to w3lib.url
?
Also, the current output is a ParseResult
tuple, mainly for comparing origins.
Does it make sense to have the utility function serialize/urlunparse on demand, with some argument?
Or should we just compare origins as strings? (which is probably the cleanest as I don't really like returning different non-None types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike , I moved some of it to scrapy.utils.url
. What do you think?
scrapy/utils/url.py
Outdated
@@ -103,3 +103,35 @@ def guess_scheme(url): | |||
return any_to_uri(url) | |||
else: | |||
return add_http_if_no_scheme(url) | |||
|
|||
|
|||
def strip_url_credentials(url, origin_only=False, keep_fragments=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a docstring to this function?
The function is named 'strip credentials', but it also strips port, I think this is not intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it strips standard HTTP and HTTPS port numbers. This is mainly for comparing origins.
Docstring is needed indeed.
Does it make sense to have this feature as switchable through an argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it should be either an argument or a separate function. If it becomes an argument then I think it is better to rename the function (not sure about the new name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike , code updated with new name for helper strip_url()
I'm adding 2 policies that appeared in the working draft: |
41eb742
to
31dd441
Compare
Alright, @kmike , @eliasdorneles , @lopuhin , @dangra , I think this is ready for review. |
(I'm not sure why I use |
docs/topics/spider-middleware.rst
Outdated
using ``file://`` or ``s3://`` scheme. | ||
|
||
.. warning:: | ||
By default, Scrapy's default referrer policy, just like `"no-referrer-when-downgrade"`_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"By default" looks redundant
docs/topics/spider-middleware.rst
Outdated
.. warning:: | ||
Scrapy's default referrer policy, just like `"no-referrer-when-downgrade"`_, | ||
will send a non-empty "Referer" header from any ``https://`` to any ``https://`` URL, | ||
even if the domain is different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's what browsers do by default then I think it makes sense to note that, to make a warning less scary.
c4e8297
to
f43be28
Compare
docs/topics/spider-middleware.rst
Outdated
|
||
- a path to a ``scrapy.spidermiddlewares.referer.ReferrerPolicy`` subclass, | ||
either a custom one or one of the built-in ones | ||
(see ``scrapy.spidermiddlewares.referer``), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a page in the docs we can reference to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of reference do you have in mind? built-in classes?
We could also define an interface for policies, so people know what to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike , an alternative is to use a table:
$ git diff
diff --git a/docs/topics/spider-middleware.rst b/docs/topics/spider-middleware.rst
index 9010ae7..46a84d0 100644
--- a/docs/topics/spider-middleware.rst
+++ b/docs/topics/spider-middleware.rst
@@ -346,21 +346,21 @@ This setting accepts:
- a path to a ``scrapy.spidermiddlewares.referer.ReferrerPolicy`` subclass,
either a custom one or one of the built-in ones
(see ``scrapy.spidermiddlewares.referer``),
-- or one of the standard W3C-defined string values:
-
- - `"no-referrer"`_,
- - `"no-referrer-when-downgrade"`_
- (the W3C-recommended default, used by major web browsers),
- - `"same-origin"`_,
- - `"origin"`_,
- - `"strict-origin"`_,
- - `"origin-when-cross-origin"`_,
- - `"strict-origin-when-cross-origin"`_,
- - or `"unsafe-url"`_
- (not recommended).
-
-It can also be the non-standard value ``"scrapy-default"`` to use
-Scrapy's default referrer policy.
+- or one of the standard W3C-defined string values
+
+======================================= ======================================================================== =======================================================
+String value Class name
+======================================= ======================================================================== =======================================================
+`"no-referrer"`_ ``'scrapy.spidermiddlewares.referer.NoReferrerPolicy'``
+`"no-referrer-when-downgrade"`_ ``'scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy'`` the W3C-recommended default, used by major web browsers
+`"same-origin"`_ ``'scrapy.spidermiddlewares.referer.SameOriginPolicy'``
+`"origin"`_ ``'scrapy.spidermiddlewares.referer.OriginPolicy'``
+`"strict-origin"`_ ``'scrapy.spidermiddlewares.referer.StrictOriginPolicy'``
+`"origin-when-cross-origin"`_ ``'scrapy.spidermiddlewares.referer.OriginWhenCrossOriginPolicy'``
+`"strict-origin-when-cross-origin"`_ ``'scrapy.spidermiddlewares.referer.StrictOriginWhenCrossOriginPolicy'``
+`"unsafe-url"`_ ``'scrapy.spidermiddlewares.referer.UnsafeUrlPolicy'`` NOT recommended
+``"scrapy-default"`` ``'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'`` Scrapy's default policy (see below)
+======================================= ======================================================================== =======================================================
Scrapy's default referrer policy is a variant of `"no-referrer-when-downgrade"`_,
with the addition that "Referer" is not sent if the parent request was
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like such table.
Also, adding autodocs for these policy classes looks helpful - they already have docstrings, it is just a matter of putting several autoclass
labels to scrapy rst docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for "see scrapy.spidermiddlewares.referer
", the question was how would user find this scrapy.spidermiddlewares.referer
- there is no link or anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documenting policy interface sounds like too much for me; I'm not sure this is something users may want to extend.
|
||
Default: ``'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'`` | ||
|
||
.. reqmeta:: referrer_policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could make sense use the same spelling everywhere in Scrapy. Currently option name (REFERER_POLICY) has one 'r' while meta key name has 'rr'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha, I see, we already have REFERER_ENABLED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with changing to REFERRER_POLICY
. (It's a shame about that spelling mistake in the RFCs.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're loosing either way - now option names are inconsistent :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the way to go is to rename the other option and a middleware itself as well; of course, it can be done later.
docs/topics/spider-middleware.rst
Outdated
Scrapy's default referrer policy. | ||
|
||
Scrapy's default referrer policy is a variant of `"no-referrer-when-downgrade"`_, | ||
with the addition that "Referrer" is not sent if the parent request was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
header name is Referer
docs/topics/spider-middleware.rst
Outdated
using ``file://`` or ``s3://`` scheme. | ||
|
||
.. warning:: | ||
Scrapy's default referrer policy—just like `"no-referrer-when-downgrade"`_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nitpick - there should be spaces before and after a dash :)
ebc2f8f
to
4e78d19
Compare
@kmike , I rebased, updated the docs with |
Another thing could be to move the policies to scrapy/extensions, similar to HTTP cache policies (with |
scrapy/spidermiddlewares/referer.py
Outdated
try: | ||
self.default_policy = _policy_classes[policy.lower()] | ||
except: | ||
raise NotConfigured("Unknown referrer policy name %r" % policy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to raise an error and stop the crawl instead of silently disabling the middleware if a policy is unknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
scrapy/spidermiddlewares/referer.py
Outdated
policy_name = to_native_str( | ||
response.headers.get('Referrer-Policy', '').decode('latin1')) | ||
|
||
cls = _policy_classes.get(policy_name.lower(), self.default_policy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about warning for unknown policies? They can be typos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
scrapy/spidermiddlewares/referer.py
Outdated
|
||
NOREFERRER_SCHEMES = LOCAL_SCHEMES | ||
|
||
def referrer(self, response, request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the API requires response and request (not response_url and request_url) only because of urlparse_cached - you wanted to avoid duplicate computations. It leads to some ugly code when a fake Response is created. Is this correct, or are there cases response and requestis indeed needed?
I wonder how much is the effect, given that urlparse has its own internal cache (using urls as keys, not response/requests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right @kmike , that was the reason.
I cannot think right now of a real need for handling request/responses directly. Meta information and HTTP headers ("Referrer-Policy") are to be interpreted at middleware level, and not at policy level I believe.
Indeed the fake response is quite ugly.
Let me clean it up with URLs all along.
c3c69d9
to
efa5003
Compare
@kmike , I think I have addressed all your comments. (and rebased) |
@redapple tests are failing - could you please check it? |
@kmike , done. Sorry about that. |
scrapy/spidermiddlewares/referer.py
Outdated
strip_default_port=True, | ||
origin_only=origin_only) | ||
|
||
def origin(self, r): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason some of the arguments are named r
while other are named url
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. I kept request/response when the 2 arguments are needed. But I can use "url" in others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, what's the reason arguments are still named request/response, and not something like request_url? Maybe I got a bit too used to type hints :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most probably laziness and readability. I can change that too if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please change it? When looking at code I expect scrapy Request and Response objects to be passed to .referrer method, to figure out that they are urls you need to find calling code. I agree that request_url
looks worse, but imho it can help readability here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Looks great, thanks @redapple! |
Fixes #2142
I've based the tests and implementation on https://www.w3.org/TR/referrer-policy/
By default, with this change,
RefererMiddleware
does not send aReferer
header (referrer value):file://
ors3://
,https://
response to anhttp://
request.However, it does send a referrer value for any
https://
to anotherhttps://
(as browsers do,"no-referrer-when-downgrade"
being the default policy)This needs discussion.
User can change the policy per-request, using a new
referrer_policy
meta key, with values from "no-referrer" / "no-referrer-when-downgrade" / "same-origin" / "origin" / "origin-when-cross-origin" / "unsafe-url".Still missing:urlparse_cached()
REFERER_DEFAULT_POLICY
settingReferrer-Policy
) -- is this even used in practice by web servers?To handle redirects, I added a signal handler on
request_scheduled
. I did not find a way to test both redirect middleware and referer middleware (I did not search too much though).Suggestions to do that are welcome.