Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] RFC2616 policy enhancements + tests #1151

Merged
merged 7 commits into from Jul 11, 2015
Merged

Conversation

@marven
Copy link
Contributor

@marven marven commented Apr 11, 2015

Continuation of #994 that includes tests for the enhancements made

@curita
Copy link
Member

@curita curita commented Apr 13, 2015

Thanks for the tests @marven!

As discussed in #994, we should add scrapy/contrib/downloadermiddleware/httpcache.py to tests/py3-ignores.txt (and probably tests/test_downloadermiddleware_httpcache.py too if not already there), that's why the travis build is failing.

A nitpick, can we rebase @jameysharp commits in this PR instead of merging them, so we can avoid the merge commit ce38129? Merge commits in Scrapy right now are only done when merging PRs to scrapy:master.

@marven
Copy link
Contributor Author

@marven marven commented Apr 14, 2015

Rebased commits and added scrapy/contrib/downloadermiddleware/httpcache.py to tests/py3-ignores.txt

@curita
Copy link
Member

@curita curita commented Apr 14, 2015

Last thing we need for these changes is some documentation for the two new settings HTTPCACHE_ALWAYS_STORE and HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS and probably update the rfc2616 policy entry (docs/topics/downloader-middleware.rst#rfc2616-policy). Jamey left pretty good comments in the commit messages introducing those settings (32948cb and 18da090), I think we can use those descriptions and add them to the HTTPCache middleware settings section (docs/topics/downloader-middleware.rst#httpcache-middleware-settings).

Would you mind to add that if you have the time?

@marven
Copy link
Contributor Author

@marven marven commented Apr 17, 2015

Updated the docs

@@ -575,6 +576,26 @@ Default: ``False``
If enabled, will compress all cached data with gzip.
This setting is specific to the Filesystem backend.

HTTPCACHE_ALWAYS_STORE

This comment has been minimized.

@kmike

kmike Apr 17, 2015
Member

.. setting:: HTTPCACHE_ALWAYS_STORE should be added above

This comment has been minimized.

@marven

marven Apr 18, 2015
Author Contributor

Woops, added.

@curita
Copy link
Member

@curita curita commented Apr 20, 2015

+1 to merge

@curita curita changed the title RFC2616 policy enhancements + tests [MRG+1] RFC2616 policy enhancements + tests Apr 20, 2015
Default: ``[]``

List of Cache-Control directives to be ignored.
Cache-Control directives in requests are not filtered.

This comment has been minimized.

@kmike

kmike Apr 21, 2015
Member

I think docs should be expanded using @jameysharp's commit comments. Without reading the following comment the reason HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS exists was far from clear to me:

Sites often set "no-store", "no-cache", "must-revalidate", etc., but get
upset at the traffic a spider can generate if it respects those
directives.

Allow the spider's author to selectively ignore Cache-Control directives
that are known to be unimportant for the sites being crawled.

We assume that the spider will not issue Cache-Control directives in
requests unless it actually needs them, so directives in requests are
not filtered.

@@ -373,6 +373,7 @@ what is implemented:
* Revalidate stale responses based on `Last-Modified` response header
* Revalidate stale responses based on `ETag` response header
* Set `Date` header for any received response missing it
* Support `max-stale` cache-control directive in requests

This comment has been minimized.

@kmike

kmike Apr 21, 2015
Member

To use this feature developer must know details of RFC2616 - not many people read this 178-page manuscript. @jameysharp's commit comments were great, e.g.

This allows spiders to be configured with the full RFC2616 cache policy,
but avoid revalidation on a request-by-request basis, while remaining
conformant with the HTTP spec.

I think we shouldn't loose these comments. Something like "Add Cache-Control: max-stale=600 to Request headers to ..."

See also: RFC2616, 14.9.3


If enabled, will cache pages unconditionally.
This setting still respects "Cache-Control: no-store" directives in
responses.

This comment has been minimized.

@kmike

kmike Apr 21, 2015
Member

It doesn't describe why one want to use it instead of DummyCachePolicy.

See the commit comment:

A spider may wish to have all responses available in the cache, for
future use with "Cache-Control: max-stale", for instance. The
DummyPolicy caches all responses but never revalidates them, and
sometimes a more nuanced policy is desirable.

This setting still respects "Cache-Control: no-store" directives in
responses. If you don't want that, filter "no-store" out of the
Cache-Control headers in responses you feed to the cache middleware.

@marven
Copy link
Contributor Author

@marven marven commented Apr 22, 2015

@kmike, updated the docs based on your comments

@dangra
Copy link
Member

@dangra dangra commented May 15, 2015

LGTM - needs rebasing.

jameysharp and others added 7 commits Dec 29, 2014
This allows spiders to be configured with the full RFC2616 cache policy,
but avoid revalidation on a request-by-request basis, while remaining
conformant with the HTTP spec.
Sites often set "no-store", "no-cache", "must-revalidate", etc., but get
upset at the traffic a spider can generate if it respects those
directives.

Allow the spider's author to selectively ignore Cache-Control directives
that are known to be unimportant for the sites being crawled.

We assume that the spider will not issue Cache-Control directives in
requests unless it actually needs them, so directives in requests are
not filtered.
A spider may wish to have all responses available in the cache, for
future use with "Cache-Control: max-stale", for instance. The
DummyPolicy caches all responses but never revalidates them, and
sometimes a more nuanced policy is desirable.

This setting still respects "Cache-Control: no-store" directives in
responses. If you don't want that, filter "no-store" out of the
Cache-Control headers in responses you feed to the cache middleware.
Unlike specifying "Cache-Control: no-cache", if the request specifies
"max-age=0", then the cached validators will be used if possible to
avoid re-fetching unchanged pages.

That said, it's still useful to be able to specify "no-cache" on the
request, in cases where the origin server may have changed page contents
without changing validators.
Add `scrapy/downloadermiddlewares/httpcache.py` to `tests/py3-ignores.txt
@marven
Copy link
Contributor Author

@marven marven commented Jun 1, 2015

@dangra, I've rebased the commits

@dangra
Copy link
Member

@dangra dangra commented Jun 1, 2015

well, I guess we have to wait for v1.1 now that v1.0.0rc1 was tagged.

@nyov
Copy link
Contributor

@nyov nyov commented Jul 11, 2015

y u no MRG?
:D

@curita
Copy link
Member

@curita curita commented Jul 11, 2015

Because we totally forgot 😅

curita added a commit that referenced this pull request Jul 11, 2015
[MRG+1] RFC2616 policy enhancements + tests
@curita curita merged commit d706310 into scrapy:master Jul 11, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@nyov
Copy link
Contributor

@nyov nyov commented Jul 11, 2015

awesome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.