New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] RFC2616 policy enhancements + tests #1151
Conversation
Thanks for the tests @marven! As discussed in #994, we should add A nitpick, can we rebase @jameysharp commits in this PR instead of merging them, so we can avoid the merge commit ce38129? Merge commits in Scrapy right now are only done when merging PRs to |
Rebased commits and added |
Last thing we need for these changes is some documentation for the two new settings Would you mind to add that if you have the time? |
Updated the docs |
@@ -575,6 +576,26 @@ Default: ``False`` | |||
If enabled, will compress all cached data with gzip. | |||
This setting is specific to the Filesystem backend. | |||
|
|||
HTTPCACHE_ALWAYS_STORE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. setting:: HTTPCACHE_ALWAYS_STORE
should be added above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, added.
+1 to merge |
Default: ``[]`` | ||
|
||
List of Cache-Control directives to be ignored. | ||
Cache-Control directives in requests are not filtered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think docs should be expanded using @jameysharp's commit comments. Without reading the following comment the reason HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS exists was far from clear to me:
Sites often set "no-store", "no-cache", "must-revalidate", etc., but get
upset at the traffic a spider can generate if it respects those
directives.Allow the spider's author to selectively ignore Cache-Control directives
that are known to be unimportant for the sites being crawled.We assume that the spider will not issue Cache-Control directives in
requests unless it actually needs them, so directives in requests are
not filtered.
@kmike, updated the docs based on your comments |
LGTM - needs rebasing. |
This allows spiders to be configured with the full RFC2616 cache policy, but avoid revalidation on a request-by-request basis, while remaining conformant with the HTTP spec.
Sites often set "no-store", "no-cache", "must-revalidate", etc., but get upset at the traffic a spider can generate if it respects those directives. Allow the spider's author to selectively ignore Cache-Control directives that are known to be unimportant for the sites being crawled. We assume that the spider will not issue Cache-Control directives in requests unless it actually needs them, so directives in requests are not filtered.
A spider may wish to have all responses available in the cache, for future use with "Cache-Control: max-stale", for instance. The DummyPolicy caches all responses but never revalidates them, and sometimes a more nuanced policy is desirable. This setting still respects "Cache-Control: no-store" directives in responses. If you don't want that, filter "no-store" out of the Cache-Control headers in responses you feed to the cache middleware.
Unlike specifying "Cache-Control: no-cache", if the request specifies "max-age=0", then the cached validators will be used if possible to avoid re-fetching unchanged pages. That said, it's still useful to be able to specify "no-cache" on the request, in cases where the origin server may have changed page contents without changing validators.
Add `scrapy/downloadermiddlewares/httpcache.py` to `tests/py3-ignores.txt
@dangra, I've rebased the commits |
well, I guess we have to wait for v1.1 now that v1.0.0rc1 was tagged. |
y u no MRG? |
Because we totally forgot 😅 |
[MRG+1] RFC2616 policy enhancements + tests
awesome :) |
Continuation of #994 that includes tests for the enhancements made