Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Referrer policies in RefererMiddleware #2306

Merged
merged 41 commits into from Mar 2, 2017
Merged

Conversation

@redapple
Copy link
Contributor

@redapple redapple commented Oct 5, 2016

Fixes #2142

I've based the tests and implementation on https://www.w3.org/TR/referrer-policy/

By default, with this change, RefererMiddleware does not send a Referer header (referrer value):

  • when the source response is file:// or s3://,
  • nor from https:// response to an http:// request.

However, it does send a referrer value for any https:// to another https:// (as browsers do, "no-referrer-when-downgrade" being the default policy)
This needs discussion.

User can change the policy per-request, using a new referrer_policy meta key, with values from "no-referrer" / "no-referrer-when-downgrade" / "same-origin" / "origin" / "origin-when-cross-origin" / "unsafe-url".

Still missing:

  • Try to use urlparse_cached()
  • Make default policy customizable through a REFERER_DEFAULT_POLICY setting
  • Test custom policies
  • Docs updates
  • Add tests for policy given by response headers (Referrer-Policy) -- is this even used in practice by web servers?
  • Handle referrers during redirects

To handle redirects, I added a signal handler on request_scheduled. I did not find a way to test both redirect middleware and referer middleware (I did not search too much though).
Suggestions to do that are welcome.

@codecov-io
Copy link

@codecov-io codecov-io commented Oct 5, 2016

Codecov Report

Merging #2306 into master will decrease coverage by -0.6%.
The diff coverage is 94.44%.

@@            Coverage Diff            @@
##           master    #2306     +/-   ##
=========================================
- Coverage   83.93%   83.34%   -0.6%     
=========================================
  Files         161      161             
  Lines        8920     9060    +140     
  Branches     1316     1290     -26     
=========================================
+ Hits         7487     7551     +64     
- Misses       1176     1240     +64     
- Partials      257      269     +12
Impacted Files Coverage Δ
scrapy/utils/url.py 100% <100%> (ø)
scrapy/settings/default_settings.py 98.63% <100%> (ø)
scrapy/spidermiddlewares/referer.py 93% <93.98%> (+8.39%)
scrapy/core/downloader/handlers/s3.py 62.9% <0%> (-32.26%)
scrapy/utils/boto.py 46.66% <0%> (-26.67%)
scrapy/utils/gz.py 85.29% <0%> (-14.71%)
scrapy/link.py 86.36% <0%> (-13.64%)
scrapy/core/downloader/tls.py 80% <0%> (-8.58%)
scrapy/_monkeypatches.py 50% <0%> (-7.15%)
scrapy/utils/iterators.py 95.45% <0%> (-4.55%)
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b49b9c...fad499a. Read the comment docs.

@redapple redapple changed the title [WIP] Referrer policies in RefererMiddleware Referrer policies in RefererMiddleware Oct 12, 2016
if stripped is not None:
return urlunparse(stripped)

def strip_url(self, req_or_resp, origin_only=False):

This comment has been minimized.

@kmike

kmike Oct 18, 2016
Member

What do you think about making it an utility function? It can be more reusable and easier to test this way (even doctests could be enough). I'd make it accept URL or ParseResult instead of scrapy Request or Response.

This comment has been minimized.

This comment has been minimized.

@redapple

redapple Oct 18, 2016
Author Contributor

@kmike , are you thinking of an addition to w3lib.url?
Also, the current output is a ParseResult tuple, mainly for comparing origins.
Does it make sense to have the utility function serialize/urlunparse on demand, with some argument?
Or should we just compare origins as strings? (which is probably the cleanest as I don't really like returning different non-None types)

This comment has been minimized.

@redapple

redapple Oct 19, 2016
Author Contributor

@kmike , I moved some of it to scrapy.utils.url. What do you think?

@@ -103,3 +103,35 @@ def guess_scheme(url):
return any_to_uri(url)
else:
return add_http_if_no_scheme(url)


def strip_url_credentials(url, origin_only=False, keep_fragments=False):

This comment has been minimized.

@kmike

kmike Oct 21, 2016
Member

Could you please add a docstring to this function?

The function is named 'strip credentials', but it also strips port, I think this is not intuitive.

This comment has been minimized.

@redapple

redapple Oct 21, 2016
Author Contributor

Right, it strips standard HTTP and HTTPS port numbers. This is mainly for comparing origins.
Docstring is needed indeed.
Does it make sense to have this feature as switchable through an argument?

This comment has been minimized.

@kmike

kmike Oct 21, 2016
Member

Yeah, it should be either an argument or a separate function. If it becomes an argument then I think it is better to rename the function (not sure about the new name).

This comment has been minimized.

@redapple

redapple Oct 25, 2016
Author Contributor

@kmike , code updated with new name for helper strip_url()

@redapple redapple changed the title Referrer policies in RefererMiddleware [MRG] Referrer policies in RefererMiddleware Nov 16, 2016
@redapple redapple added this to the v1.3 milestone Nov 16, 2016
@redapple redapple changed the title [MRG] Referrer policies in RefererMiddleware [WIP] Referrer policies in RefererMiddleware Jan 17, 2017
@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 17, 2017

I'm adding 2 policies that appeared in the working draft:

@redapple redapple force-pushed the redapple:referrer-policy branch from 41eb742 to 31dd441 Jan 17, 2017
@redapple redapple changed the title [WIP] Referrer policies in RefererMiddleware [MRG] Referrer policies in RefererMiddleware Jan 17, 2017
@redapple redapple changed the title [MRG] Referrer policies in RefererMiddleware [WIP] Referrer policies in RefererMiddleware Jan 17, 2017
@redapple redapple changed the title [WIP] Referrer policies in RefererMiddleware [MRG] Referrer policies in RefererMiddleware Jan 17, 2017
@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 18, 2017

Alright, @kmike , @eliasdorneles , @lopuhin , @dangra , I think this is ready for review.
Thanks in advance!

@redapple
Copy link
Contributor Author

@redapple redapple commented Jan 18, 2017

(I'm not sure why I use ftps in some tests. It's not supported by Scrapy)

using ``file://`` or ``s3://`` scheme.

.. warning::
By default, Scrapy's default referrer policy, just like `"no-referrer-when-downgrade"`_,

This comment has been minimized.

@kmike

kmike Jan 18, 2017
Member

"By default" looks redundant

.. warning::
Scrapy's default referrer policy, just like `"no-referrer-when-downgrade"`_,
will send a non-empty "Referer" header from any ``https://`` to any ``https://`` URL,
even if the domain is different.

This comment has been minimized.

@kmike

kmike Jan 18, 2017
Member

If that's what browsers do by default then I think it makes sense to note that, to make a warning less scary.

@redapple redapple force-pushed the redapple:referrer-policy branch from c4e8297 to f43be28 Jan 18, 2017

- a path to a ``scrapy.spidermiddlewares.referer.ReferrerPolicy`` subclass,
either a custom one or one of the built-in ones
(see ``scrapy.spidermiddlewares.referer``),

This comment has been minimized.

@kmike

kmike Jan 18, 2017
Member

Is there a page in the docs we can reference to?

This comment has been minimized.

@redapple

redapple Jan 26, 2017
Author Contributor

What kind of reference do you have in mind? built-in classes?

We could also define an interface for policies, so people know what to implement.

This comment has been minimized.

@redapple

redapple Jan 26, 2017
Author Contributor

@kmike , an alternative is to use a table:

$ git diff
diff --git a/docs/topics/spider-middleware.rst b/docs/topics/spider-middleware.rst
index 9010ae7..46a84d0 100644
--- a/docs/topics/spider-middleware.rst
+++ b/docs/topics/spider-middleware.rst
@@ -346,21 +346,21 @@ This setting accepts:
 - a path to a ``scrapy.spidermiddlewares.referer.ReferrerPolicy`` subclass,
   either a custom one or one of the built-in ones
   (see ``scrapy.spidermiddlewares.referer``),
-- or one of the standard W3C-defined string values:
-
-  - `"no-referrer"`_,
-  - `"no-referrer-when-downgrade"`_
-    (the W3C-recommended default, used by major web browsers),
-  - `"same-origin"`_,
-  - `"origin"`_,
-  - `"strict-origin"`_,
-  - `"origin-when-cross-origin"`_,
-  - `"strict-origin-when-cross-origin"`_,
-  - or `"unsafe-url"`_
-    (not recommended).
-
-It can also be the non-standard value ``"scrapy-default"`` to use
-Scrapy's default referrer policy.
+- or one of the standard W3C-defined string values
+
+=======================================  ========================================================================  =======================================================
+String value                             Class name
+=======================================  ========================================================================  =======================================================
+`"no-referrer"`_                         ``'scrapy.spidermiddlewares.referer.NoReferrerPolicy'``
+`"no-referrer-when-downgrade"`_          ``'scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy'``      the W3C-recommended default, used by major web browsers
+`"same-origin"`_                         ``'scrapy.spidermiddlewares.referer.SameOriginPolicy'``
+`"origin"`_                              ``'scrapy.spidermiddlewares.referer.OriginPolicy'``
+`"strict-origin"`_                       ``'scrapy.spidermiddlewares.referer.StrictOriginPolicy'``
+`"origin-when-cross-origin"`_            ``'scrapy.spidermiddlewares.referer.OriginWhenCrossOriginPolicy'``
+`"strict-origin-when-cross-origin"`_     ``'scrapy.spidermiddlewares.referer.StrictOriginWhenCrossOriginPolicy'``
+`"unsafe-url"`_                          ``'scrapy.spidermiddlewares.referer.UnsafeUrlPolicy'``                    NOT recommended
+``"scrapy-default"``                     ``'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'``              Scrapy's default policy (see below)
+=======================================  ========================================================================  =======================================================
 
 Scrapy's default referrer policy is a variant of `"no-referrer-when-downgrade"`_,
 with the addition that "Referer" is not sent if the parent request was

This comment has been minimized.

@kmike

kmike Feb 21, 2017
Member

I like such table.

Also, adding autodocs for these policy classes looks helpful - they already have docstrings, it is just a matter of putting several autoclass labels to scrapy rst docs.

This comment has been minimized.

@kmike

kmike Feb 21, 2017
Member

as for "see scrapy.spidermiddlewares.referer", the question was how would user find this scrapy.spidermiddlewares.referer - there is no link or anything.

This comment has been minimized.

@kmike

kmike Feb 21, 2017
Member

Documenting policy interface sounds like too much for me; I'm not sure this is something users may want to extend.


Default: ``'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'``

.. reqmeta:: referrer_policy

This comment has been minimized.

@kmike

kmike Jan 18, 2017
Member

It could make sense use the same spelling everywhere in Scrapy. Currently option name (REFERER_POLICY) has one 'r' while meta key name has 'rr'.

This comment has been minimized.

@kmike

kmike Jan 18, 2017
Member

aha, I see, we already have REFERER_ENABLED

This comment has been minimized.

@redapple

redapple Jan 26, 2017
Author Contributor

I'm ok with changing to REFERRER_POLICY. (It's a shame about that spelling mistake in the RFCs.)

This comment has been minimized.

@kmike

kmike Feb 22, 2017
Member

We're loosing either way - now option names are inconsistent :)

This comment has been minimized.

@kmike

kmike Feb 22, 2017
Member

I guess the way to go is to rename the other option and a middleware itself as well; of course, it can be done later.

Scrapy's default referrer policy.

Scrapy's default referrer policy is a variant of `"no-referrer-when-downgrade"`_,
with the addition that "Referrer" is not sent if the parent request was

This comment has been minimized.

@kmike

kmike Jan 18, 2017
Member

header name is Referer

using ``file://`` or ``s3://`` scheme.

.. warning::
Scrapy's default referrer policy—just like `"no-referrer-when-downgrade"`_,

This comment has been minimized.

@kmike

kmike Jan 18, 2017
Member

a nitpick - there should be spaces before and after a dash :)

@redapple redapple changed the title [MRG] Referrer policies in RefererMiddleware [WIP] Referrer policies in RefererMiddleware Feb 21, 2017
@redapple redapple force-pushed the redapple:referrer-policy branch from ebc2f8f to 4e78d19 Feb 21, 2017
@redapple
Copy link
Contributor Author

@redapple redapple commented Feb 21, 2017

@kmike , I rebased, updated the docs with autoclass directives and renamed the setting to REFERRER_POLICY.

@redapple
Copy link
Contributor Author

@redapple redapple commented Feb 21, 2017

Another thing could be to move the policies to scrapy/extensions, similar to HTTP cache policies (with scrapy/scrapy/extensions/httpcache.py). What do you think?

try:
self.default_policy = _policy_classes[policy.lower()]
except:
raise NotConfigured("Unknown referrer policy name %r" % policy)

This comment has been minimized.

@kmike

kmike Feb 22, 2017
Member

I think it is better to raise an error and stop the crawl instead of silently disabling the middleware if a policy is unknown.

This comment has been minimized.

@redapple

redapple Feb 22, 2017
Author Contributor

Good point!

policy_name = to_native_str(
response.headers.get('Referrer-Policy', '').decode('latin1'))

cls = _policy_classes.get(policy_name.lower(), self.default_policy)

This comment has been minimized.

@kmike

kmike Feb 22, 2017
Member

What do you think about warning for unknown policies? They can be typos.

This comment has been minimized.

@redapple

redapple Feb 22, 2017
Author Contributor

Makes sense.


NOREFERRER_SCHEMES = LOCAL_SCHEMES

def referrer(self, response, request):

This comment has been minimized.

@kmike

kmike Feb 22, 2017
Member

It seems the API requires response and request (not response_url and request_url) only because of urlparse_cached - you wanted to avoid duplicate computations. It leads to some ugly code when a fake Response is created. Is this correct, or are there cases response and requestis indeed needed?

I wonder how much is the effect, given that urlparse has its own internal cache (using urls as keys, not response/requests).

This comment has been minimized.

@redapple

redapple Feb 22, 2017
Author Contributor

Right @kmike , that was the reason.
I cannot think right now of a real need for handling request/responses directly. Meta information and HTTP headers ("Referrer-Policy") are to be interpreted at middleware level, and not at policy level I believe.

Indeed the fake response is quite ugly.
Let me clean it up with URLs all along.

@redapple redapple force-pushed the redapple:referrer-policy branch from c3c69d9 to efa5003 Mar 1, 2017
@redapple
Copy link
Contributor Author

@redapple redapple commented Mar 1, 2017

@kmike , I think I have addressed all your comments. (and rebased)

@kmike
Copy link
Member

@kmike kmike commented Mar 1, 2017

@redapple tests are failing - could you please check it?

@redapple
Copy link
Contributor Author

@redapple redapple commented Mar 2, 2017

@kmike , done. Sorry about that.

@redapple redapple changed the title [WIP] Referrer policies in RefererMiddleware [MRG] Referrer policies in RefererMiddleware Mar 2, 2017
strip_default_port=True,
origin_only=origin_only)

def origin(self, r):

This comment has been minimized.

@kmike

kmike Mar 2, 2017
Member

Is there a reason some of the arguments are named r while other are named url?

This comment has been minimized.

@redapple

redapple Mar 2, 2017
Author Contributor

Not really. I kept request/response when the 2 arguments are needed. But I can use "url" in others.

This comment has been minimized.

@kmike

kmike Mar 2, 2017
Member

Hm, what's the reason arguments are still named request/response, and not something like request_url? Maybe I got a bit too used to type hints :)

This comment has been minimized.

@redapple

redapple Mar 2, 2017
Author Contributor

Most probably laziness and readability. I can change that too if you prefer.

This comment has been minimized.

@kmike

kmike Mar 2, 2017
Member

Could you please change it? When looking at code I expect scrapy Request and Response objects to be passed to .referrer method, to figure out that they are urls you need to find calling code. I agree that request_url looks worse, but imho it can help readability here.

This comment has been minimized.

@redapple

redapple Mar 2, 2017
Author Contributor

Done.

@redapple redapple changed the title [MRG] Referrer policies in RefererMiddleware [WIP] Referrer policies in RefererMiddleware Mar 2, 2017
@redapple redapple changed the title [WIP] Referrer policies in RefererMiddleware [MRG] Referrer policies in RefererMiddleware Mar 2, 2017
@kmike kmike merged commit 7e8453c into scrapy:master Mar 2, 2017
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kmike
Copy link
Member

@kmike kmike commented Mar 2, 2017

Looks great, thanks @redapple!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants