Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Referrer policies in RefererMiddleware #2306

Merged
merged 41 commits into from Mar 2, 2017

Conversation

redapple
Copy link
Contributor

@redapple redapple commented Oct 5, 2016

Fixes #2142

I've based the tests and implementation on https://www.w3.org/TR/referrer-policy/

By default, with this change, RefererMiddleware does not send a Referer header (referrer value):

  • when the source response is file:// or s3://,
  • nor from https:// response to an http:// request.

However, it does send a referrer value for any https:// to another https:// (as browsers do, "no-referrer-when-downgrade" being the default policy)
This needs discussion.

User can change the policy per-request, using a new referrer_policy meta key, with values from "no-referrer" / "no-referrer-when-downgrade" / "same-origin" / "origin" / "origin-when-cross-origin" / "unsafe-url".

Still missing:

  • Try to use urlparse_cached()
  • Make default policy customizable through a REFERER_DEFAULT_POLICY setting
  • Test custom policies
  • Docs updates
  • Add tests for policy given by response headers (Referrer-Policy) -- is this even used in practice by web servers?
  • Handle referrers during redirects

To handle redirects, I added a signal handler on request_scheduled. I did not find a way to test both redirect middleware and referer middleware (I did not search too much though).
Suggestions to do that are welcome.

@codecov-io
Copy link

codecov-io commented Oct 5, 2016

Codecov Report

Merging #2306 into master will decrease coverage by -0.6%.
The diff coverage is 94.44%.

@@            Coverage Diff            @@
##           master    #2306     +/-   ##
=========================================
- Coverage   83.93%   83.34%   -0.6%     
=========================================
  Files         161      161             
  Lines        8920     9060    +140     
  Branches     1316     1290     -26     
=========================================
+ Hits         7487     7551     +64     
- Misses       1176     1240     +64     
- Partials      257      269     +12
Impacted Files Coverage Δ
scrapy/utils/url.py 100% <100%> (ø)
scrapy/settings/default_settings.py 98.63% <100%> (ø)
scrapy/spidermiddlewares/referer.py 93% <93.98%> (+8.39%)
scrapy/core/downloader/handlers/s3.py 62.9% <0%> (-32.26%)
scrapy/utils/boto.py 46.66% <0%> (-26.67%)
scrapy/utils/gz.py 85.29% <0%> (-14.71%)
scrapy/link.py 86.36% <0%> (-13.64%)
scrapy/core/downloader/tls.py 80% <0%> (-8.58%)
scrapy/_monkeypatches.py 50% <0%> (-7.15%)
scrapy/utils/iterators.py 95.45% <0%> (-4.55%)
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b49b9c...fad499a. Read the comment docs.

@redapple redapple changed the title [WIP] Referrer policies in RefererMiddleware Referrer policies in RefererMiddleware Oct 12, 2016
if stripped is not None:
return urlunparse(stripped)

def strip_url(self, req_or_resp, origin_only=False):
Copy link
Member

@kmike kmike Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making it an utility function? It can be more reusable and easier to test this way (even doctests could be enough). I'd make it accept URL or ParseResult instead of scrapy Request or Response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike , are you thinking of an addition to w3lib.url?
Also, the current output is a ParseResult tuple, mainly for comparing origins.
Does it make sense to have the utility function serialize/urlunparse on demand, with some argument?
Or should we just compare origins as strings? (which is probably the cleanest as I don't really like returning different non-None types)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike , I moved some of it to scrapy.utils.url. What do you think?

@@ -103,3 +103,35 @@ def guess_scheme(url):
return any_to_uri(url)
else:
return add_http_if_no_scheme(url)


def strip_url_credentials(url, origin_only=False, keep_fragments=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a docstring to this function?

The function is named 'strip credentials', but it also strips port, I think this is not intuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it strips standard HTTP and HTTPS port numbers. This is mainly for comparing origins.
Docstring is needed indeed.
Does it make sense to have this feature as switchable through an argument?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should be either an argument or a separate function. If it becomes an argument then I think it is better to rename the function (not sure about the new name).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike , code updated with new name for helper strip_url()

@redapple redapple changed the title Referrer policies in RefererMiddleware [MRG] Referrer policies in RefererMiddleware Nov 16, 2016
@redapple redapple added this to the v1.3 milestone Nov 16, 2016
@redapple redapple changed the title [MRG] Referrer policies in RefererMiddleware [WIP] Referrer policies in RefererMiddleware Jan 17, 2017
@redapple
Copy link
Contributor Author

I'm adding 2 policies that appeared in the working draft:

@redapple redapple changed the title [WIP] Referrer policies in RefererMiddleware [MRG] Referrer policies in RefererMiddleware Jan 17, 2017
@redapple redapple changed the title [MRG] Referrer policies in RefererMiddleware [WIP] Referrer policies in RefererMiddleware Jan 17, 2017
@redapple redapple changed the title [WIP] Referrer policies in RefererMiddleware [MRG] Referrer policies in RefererMiddleware Jan 17, 2017
@redapple
Copy link
Contributor Author

Alright, @kmike , @eliasdorneles , @lopuhin , @dangra , I think this is ready for review.
Thanks in advance!

@redapple
Copy link
Contributor Author

(I'm not sure why I use ftps in some tests. It's not supported by Scrapy)

using ``file://`` or ``s3://`` scheme.

.. warning::
By default, Scrapy's default referrer policy, just like `"no-referrer-when-downgrade"`_,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"By default" looks redundant

.. warning::
Scrapy's default referrer policy, just like `"no-referrer-when-downgrade"`_,
will send a non-empty "Referer" header from any ``https://`` to any ``https://`` URL,
even if the domain is different.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's what browsers do by default then I think it makes sense to note that, to make a warning less scary.


- a path to a ``scrapy.spidermiddlewares.referer.ReferrerPolicy`` subclass,
either a custom one or one of the built-in ones
(see ``scrapy.spidermiddlewares.referer``),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a page in the docs we can reference to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of reference do you have in mind? built-in classes?

We could also define an interface for policies, so people know what to implement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike , an alternative is to use a table:

$ git diff
diff --git a/docs/topics/spider-middleware.rst b/docs/topics/spider-middleware.rst
index 9010ae7..46a84d0 100644
--- a/docs/topics/spider-middleware.rst
+++ b/docs/topics/spider-middleware.rst
@@ -346,21 +346,21 @@ This setting accepts:
 - a path to a ``scrapy.spidermiddlewares.referer.ReferrerPolicy`` subclass,
   either a custom one or one of the built-in ones
   (see ``scrapy.spidermiddlewares.referer``),
-- or one of the standard W3C-defined string values:
-
-  - `"no-referrer"`_,
-  - `"no-referrer-when-downgrade"`_
-    (the W3C-recommended default, used by major web browsers),
-  - `"same-origin"`_,
-  - `"origin"`_,
-  - `"strict-origin"`_,
-  - `"origin-when-cross-origin"`_,
-  - `"strict-origin-when-cross-origin"`_,
-  - or `"unsafe-url"`_
-    (not recommended).
-
-It can also be the non-standard value ``"scrapy-default"`` to use
-Scrapy's default referrer policy.
+- or one of the standard W3C-defined string values
+
+=======================================  ========================================================================  =======================================================
+String value                             Class name
+=======================================  ========================================================================  =======================================================
+`"no-referrer"`_                         ``'scrapy.spidermiddlewares.referer.NoReferrerPolicy'``
+`"no-referrer-when-downgrade"`_          ``'scrapy.spidermiddlewares.referer.NoReferrerWhenDowngradePolicy'``      the W3C-recommended default, used by major web browsers
+`"same-origin"`_                         ``'scrapy.spidermiddlewares.referer.SameOriginPolicy'``
+`"origin"`_                              ``'scrapy.spidermiddlewares.referer.OriginPolicy'``
+`"strict-origin"`_                       ``'scrapy.spidermiddlewares.referer.StrictOriginPolicy'``
+`"origin-when-cross-origin"`_            ``'scrapy.spidermiddlewares.referer.OriginWhenCrossOriginPolicy'``
+`"strict-origin-when-cross-origin"`_     ``'scrapy.spidermiddlewares.referer.StrictOriginWhenCrossOriginPolicy'``
+`"unsafe-url"`_                          ``'scrapy.spidermiddlewares.referer.UnsafeUrlPolicy'``                    NOT recommended
+``"scrapy-default"``                     ``'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'``              Scrapy's default policy (see below)
+=======================================  ========================================================================  =======================================================
 
 Scrapy's default referrer policy is a variant of `"no-referrer-when-downgrade"`_,
 with the addition that "Referer" is not sent if the parent request was

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like such table.

Also, adding autodocs for these policy classes looks helpful - they already have docstrings, it is just a matter of putting several autoclass labels to scrapy rst docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as for "see scrapy.spidermiddlewares.referer", the question was how would user find this scrapy.spidermiddlewares.referer - there is no link or anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting policy interface sounds like too much for me; I'm not sure this is something users may want to extend.


Default: ``'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'``

.. reqmeta:: referrer_policy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could make sense use the same spelling everywhere in Scrapy. Currently option name (REFERER_POLICY) has one 'r' while meta key name has 'rr'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, I see, we already have REFERER_ENABLED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with changing to REFERRER_POLICY. (It's a shame about that spelling mistake in the RFCs.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're loosing either way - now option names are inconsistent :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the way to go is to rename the other option and a middleware itself as well; of course, it can be done later.

Scrapy's default referrer policy.

Scrapy's default referrer policy is a variant of `"no-referrer-when-downgrade"`_,
with the addition that "Referrer" is not sent if the parent request was
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header name is Referer

using ``file://`` or ``s3://`` scheme.

.. warning::
Scrapy's default referrer policy—just like `"no-referrer-when-downgrade"`_,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nitpick - there should be spaces before and after a dash :)

@redapple redapple changed the title [MRG] Referrer policies in RefererMiddleware [WIP] Referrer policies in RefererMiddleware Feb 21, 2017
@redapple
Copy link
Contributor Author

@kmike , I rebased, updated the docs with autoclass directives and renamed the setting to REFERRER_POLICY.

@redapple
Copy link
Contributor Author

Another thing could be to move the policies to scrapy/extensions, similar to HTTP cache policies (with scrapy/scrapy/extensions/httpcache.py). What do you think?

try:
self.default_policy = _policy_classes[policy.lower()]
except:
raise NotConfigured("Unknown referrer policy name %r" % policy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to raise an error and stop the crawl instead of silently disabling the middleware if a policy is unknown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

policy_name = to_native_str(
response.headers.get('Referrer-Policy', '').decode('latin1'))

cls = _policy_classes.get(policy_name.lower(), self.default_policy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about warning for unknown policies? They can be typos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.


NOREFERRER_SCHEMES = LOCAL_SCHEMES

def referrer(self, response, request):
Copy link
Member

@kmike kmike Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the API requires response and request (not response_url and request_url) only because of urlparse_cached - you wanted to avoid duplicate computations. It leads to some ugly code when a fake Response is created. Is this correct, or are there cases response and requestis indeed needed?

I wonder how much is the effect, given that urlparse has its own internal cache (using urls as keys, not response/requests).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right @kmike , that was the reason.
I cannot think right now of a real need for handling request/responses directly. Meta information and HTTP headers ("Referrer-Policy") are to be interpreted at middleware level, and not at policy level I believe.

Indeed the fake response is quite ugly.
Let me clean it up with URLs all along.

@redapple
Copy link
Contributor Author

redapple commented Mar 1, 2017

@kmike , I think I have addressed all your comments. (and rebased)

@kmike
Copy link
Member

kmike commented Mar 1, 2017

@redapple tests are failing - could you please check it?

@redapple
Copy link
Contributor Author

redapple commented Mar 2, 2017

@kmike , done. Sorry about that.

@redapple redapple changed the title [WIP] Referrer policies in RefererMiddleware [MRG] Referrer policies in RefererMiddleware Mar 2, 2017
strip_default_port=True,
origin_only=origin_only)

def origin(self, r):
Copy link
Member

@kmike kmike Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason some of the arguments are named r while other are named url?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. I kept request/response when the 2 arguments are needed. But I can use "url" in others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what's the reason arguments are still named request/response, and not something like request_url? Maybe I got a bit too used to type hints :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most probably laziness and readability. I can change that too if you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please change it? When looking at code I expect scrapy Request and Response objects to be passed to .referrer method, to figure out that they are urls you need to find calling code. I agree that request_url looks worse, but imho it can help readability here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@redapple redapple changed the title [MRG] Referrer policies in RefererMiddleware [WIP] Referrer policies in RefererMiddleware Mar 2, 2017
@redapple redapple changed the title [WIP] Referrer policies in RefererMiddleware [MRG] Referrer policies in RefererMiddleware Mar 2, 2017
@kmike kmike merged commit 7e8453c into scrapy:master Mar 2, 2017
@kmike
Copy link
Member

kmike commented Mar 2, 2017

Looks great, thanks @redapple!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants