[MRG+1] RFC2616 policy enhancements + tests #1151
Conversation
Thanks for the tests @marven! As discussed in #994, we should add A nitpick, can we rebase @jameysharp commits in this PR instead of merging them, so we can avoid the merge commit ce38129? Merge commits in Scrapy right now are only done when merging PRs to |
Rebased commits and added |
Last thing we need for these changes is some documentation for the two new settings Would you mind to add that if you have the time? |
Updated the docs |
@@ -575,6 +576,26 @@ Default: ``False`` | |||
If enabled, will compress all cached data with gzip. | |||
This setting is specific to the Filesystem backend. | |||
|
|||
HTTPCACHE_ALWAYS_STORE |
kmike
Apr 17, 2015
Member
.. setting:: HTTPCACHE_ALWAYS_STORE
should be added above
.. setting:: HTTPCACHE_ALWAYS_STORE
should be added above
marven
Apr 18, 2015
Author
Contributor
Woops, added.
Woops, added.
+1 to merge |
Default: ``[]`` | ||
|
||
List of Cache-Control directives to be ignored. | ||
Cache-Control directives in requests are not filtered. |
kmike
Apr 21, 2015
Member
I think docs should be expanded using @jameysharp's commit comments. Without reading the following comment the reason HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS exists was far from clear to me:
Sites often set "no-store", "no-cache", "must-revalidate", etc., but get
upset at the traffic a spider can generate if it respects those
directives.
Allow the spider's author to selectively ignore Cache-Control directives
that are known to be unimportant for the sites being crawled.
We assume that the spider will not issue Cache-Control directives in
requests unless it actually needs them, so directives in requests are
not filtered.
I think docs should be expanded using @jameysharp's commit comments. Without reading the following comment the reason HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS exists was far from clear to me:
Sites often set "no-store", "no-cache", "must-revalidate", etc., but get
upset at the traffic a spider can generate if it respects those
directives.Allow the spider's author to selectively ignore Cache-Control directives
that are known to be unimportant for the sites being crawled.We assume that the spider will not issue Cache-Control directives in
requests unless it actually needs them, so directives in requests are
not filtered.
@@ -373,6 +373,7 @@ what is implemented: | |||
* Revalidate stale responses based on `Last-Modified` response header | |||
* Revalidate stale responses based on `ETag` response header | |||
* Set `Date` header for any received response missing it | |||
* Support `max-stale` cache-control directive in requests |
kmike
Apr 21, 2015
Member
To use this feature developer must know details of RFC2616 - not many people read this 178-page manuscript. @jameysharp's commit comments were great, e.g.
This allows spiders to be configured with the full RFC2616 cache policy,
but avoid revalidation on a request-by-request basis, while remaining
conformant with the HTTP spec.
I think we shouldn't loose these comments. Something like "Add Cache-Control: max-stale=600
to Request headers to ..."
See also: RFC2616, 14.9.3
To use this feature developer must know details of RFC2616 - not many people read this 178-page manuscript. @jameysharp's commit comments were great, e.g.
This allows spiders to be configured with the full RFC2616 cache policy,
but avoid revalidation on a request-by-request basis, while remaining
conformant with the HTTP spec.
I think we shouldn't loose these comments. Something like "Add Cache-Control: max-stale=600
to Request headers to ..."
See also: RFC2616, 14.9.3
|
||
If enabled, will cache pages unconditionally. | ||
This setting still respects "Cache-Control: no-store" directives in | ||
responses. |
kmike
Apr 21, 2015
Member
It doesn't describe why one want to use it instead of DummyCachePolicy.
See the commit comment:
A spider may wish to have all responses available in the cache, for
future use with "Cache-Control: max-stale", for instance. The
DummyPolicy caches all responses but never revalidates them, and
sometimes a more nuanced policy is desirable.
This setting still respects "Cache-Control: no-store" directives in
responses. If you don't want that, filter "no-store" out of the
Cache-Control headers in responses you feed to the cache middleware.
It doesn't describe why one want to use it instead of DummyCachePolicy.
See the commit comment:
A spider may wish to have all responses available in the cache, for
future use with "Cache-Control: max-stale", for instance. The
DummyPolicy caches all responses but never revalidates them, and
sometimes a more nuanced policy is desirable.This setting still respects "Cache-Control: no-store" directives in
responses. If you don't want that, filter "no-store" out of the
Cache-Control headers in responses you feed to the cache middleware.
@kmike, updated the docs based on your comments |
LGTM - needs rebasing. |
This allows spiders to be configured with the full RFC2616 cache policy, but avoid revalidation on a request-by-request basis, while remaining conformant with the HTTP spec.
Sites often set "no-store", "no-cache", "must-revalidate", etc., but get upset at the traffic a spider can generate if it respects those directives. Allow the spider's author to selectively ignore Cache-Control directives that are known to be unimportant for the sites being crawled. We assume that the spider will not issue Cache-Control directives in requests unless it actually needs them, so directives in requests are not filtered.
A spider may wish to have all responses available in the cache, for future use with "Cache-Control: max-stale", for instance. The DummyPolicy caches all responses but never revalidates them, and sometimes a more nuanced policy is desirable. This setting still respects "Cache-Control: no-store" directives in responses. If you don't want that, filter "no-store" out of the Cache-Control headers in responses you feed to the cache middleware.
Unlike specifying "Cache-Control: no-cache", if the request specifies "max-age=0", then the cached validators will be used if possible to avoid re-fetching unchanged pages. That said, it's still useful to be able to specify "no-cache" on the request, in cases where the origin server may have changed page contents without changing validators.
Add `scrapy/downloadermiddlewares/httpcache.py` to `tests/py3-ignores.txt
@dangra, I've rebased the commits |
well, I guess we have to wait for v1.1 now that v1.0.0rc1 was tagged. |
y u no MRG? |
Because we totally forgot |
[MRG+1] RFC2616 policy enhancements + tests
awesome :) |
Continuation of #994 that includes tests for the enhancements made