Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] RFC2616 policy enhancements + tests #1151

Merged
merged 7 commits into from Jul 11, 2015
Merged

[MRG+1] RFC2616 policy enhancements + tests #1151

merged 7 commits into from Jul 11, 2015

Conversation

marven
Copy link
Contributor

@marven marven commented Apr 11, 2015

Continuation of #994 that includes tests for the enhancements made

@curita
Copy link
Member

curita commented Apr 13, 2015

Thanks for the tests @marven!

As discussed in #994, we should add scrapy/contrib/downloadermiddleware/httpcache.py to tests/py3-ignores.txt (and probably tests/test_downloadermiddleware_httpcache.py too if not already there), that's why the travis build is failing.

A nitpick, can we rebase @jameysharp commits in this PR instead of merging them, so we can avoid the merge commit ce38129? Merge commits in Scrapy right now are only done when merging PRs to scrapy:master.

@marven
Copy link
Contributor Author

marven commented Apr 14, 2015

Rebased commits and added scrapy/contrib/downloadermiddleware/httpcache.py to tests/py3-ignores.txt

@curita
Copy link
Member

curita commented Apr 14, 2015

Last thing we need for these changes is some documentation for the two new settings HTTPCACHE_ALWAYS_STORE and HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS and probably update the rfc2616 policy entry (docs/topics/downloader-middleware.rst#rfc2616-policy). Jamey left pretty good comments in the commit messages introducing those settings (32948cb and 18da090), I think we can use those descriptions and add them to the HTTPCache middleware settings section (docs/topics/downloader-middleware.rst#httpcache-middleware-settings).

Would you mind to add that if you have the time?

@marven
Copy link
Contributor Author

marven commented Apr 17, 2015

Updated the docs

@@ -575,6 +576,26 @@ Default: ``False``
If enabled, will compress all cached data with gzip.
This setting is specific to the Filesystem backend.

HTTPCACHE_ALWAYS_STORE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. setting:: HTTPCACHE_ALWAYS_STORE should be added above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, added.

@curita
Copy link
Member

curita commented Apr 20, 2015

+1 to merge

@curita curita changed the title RFC2616 policy enhancements + tests [MRG+1] RFC2616 policy enhancements + tests Apr 20, 2015
Default: ``[]``

List of Cache-Control directives to be ignored.
Cache-Control directives in requests are not filtered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think docs should be expanded using @jameysharp's commit comments. Without reading the following comment the reason HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS exists was far from clear to me:

Sites often set "no-store", "no-cache", "must-revalidate", etc., but get
upset at the traffic a spider can generate if it respects those
directives.

Allow the spider's author to selectively ignore Cache-Control directives
that are known to be unimportant for the sites being crawled.

We assume that the spider will not issue Cache-Control directives in
requests unless it actually needs them, so directives in requests are
not filtered.

@marven
Copy link
Contributor Author

marven commented Apr 22, 2015

@kmike, updated the docs based on your comments

@dangra
Copy link
Member

dangra commented May 15, 2015

LGTM - needs rebasing.

jameysharp and others added 7 commits June 1, 2015 18:06
This allows spiders to be configured with the full RFC2616 cache policy,
but avoid revalidation on a request-by-request basis, while remaining
conformant with the HTTP spec.
Sites often set "no-store", "no-cache", "must-revalidate", etc., but get
upset at the traffic a spider can generate if it respects those
directives.

Allow the spider's author to selectively ignore Cache-Control directives
that are known to be unimportant for the sites being crawled.

We assume that the spider will not issue Cache-Control directives in
requests unless it actually needs them, so directives in requests are
not filtered.
A spider may wish to have all responses available in the cache, for
future use with "Cache-Control: max-stale", for instance. The
DummyPolicy caches all responses but never revalidates them, and
sometimes a more nuanced policy is desirable.

This setting still respects "Cache-Control: no-store" directives in
responses. If you don't want that, filter "no-store" out of the
Cache-Control headers in responses you feed to the cache middleware.
Unlike specifying "Cache-Control: no-cache", if the request specifies
"max-age=0", then the cached validators will be used if possible to
avoid re-fetching unchanged pages.

That said, it's still useful to be able to specify "no-cache" on the
request, in cases where the origin server may have changed page contents
without changing validators.
Add `scrapy/downloadermiddlewares/httpcache.py` to `tests/py3-ignores.txt
@marven
Copy link
Contributor Author

marven commented Jun 1, 2015

@dangra, I've rebased the commits

@dangra
Copy link
Member

dangra commented Jun 1, 2015

well, I guess we have to wait for v1.1 now that v1.0.0rc1 was tagged.

@nyov
Copy link
Contributor

nyov commented Jul 11, 2015

y u no MRG?
:D

@curita
Copy link
Member

curita commented Jul 11, 2015

Because we totally forgot 😅

curita added a commit that referenced this pull request Jul 11, 2015
[MRG+1] RFC2616 policy enhancements + tests
@curita curita merged commit d706310 into scrapy:master Jul 11, 2015
@nyov
Copy link
Contributor

nyov commented Jul 11, 2015

awesome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants