Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute to control offsite filtering #3691

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kasun
Copy link
Contributor

@kasun kasun commented Mar 15, 2019

This is a fix for #3690

@codecov
Copy link

codecov bot commented Mar 15, 2019

Codecov Report

Merging #3691 into master will increase coverage by 0.17%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3691      +/-   ##
==========================================
+ Coverage   84.54%   84.71%   +0.17%     
==========================================
  Files         167      168       +1     
  Lines        9420     9460      +40     
  Branches     1402     1407       +5     
==========================================
+ Hits         7964     8014      +50     
+ Misses       1199     1188      -11     
- Partials      257      258       +1
Impacted Files Coverage Δ
scrapy/spidermiddlewares/offsite.py 100% <100%> (ø) ⬆️
scrapy/utils/trackref.py 83.78% <0%> (-2.71%) ⬇️
scrapy/http/__init__.py 100% <0%> (ø) ⬆️
scrapy/http/request/json_request.py 93.75% <0%> (ø)
scrapy/settings/default_settings.py 98.64% <0%> (ø) ⬆️
scrapy/item.py 98.48% <0%> (+0.07%) ⬆️
scrapy/extensions/feedexport.py 84.9% <0%> (+6.43%) ⬆️

@raphapassini
Copy link
Contributor

LGTM

docs/topics/request-response.rst Outdated Show resolved Hide resolved
docs/topics/spider-middleware.rst Outdated Show resolved Hide resolved
@@ -28,7 +28,7 @@ def from_crawler(cls, crawler):
def process_spider_output(self, response, result, spider):
for x in result:
if isinstance(x, Request):
if x.dont_filter or self.should_follow(x, spider):
if x.meta.get('allow_offsite_requests', False) or self.should_follow(x, spider):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking this would keep this change backward-compatible:

Suggested change
if x.meta.get('allow_offsite_requests', False) or self.should_follow(x, spider):
if x.meta.get('allow_offsite_requests', x.dont_filter) or self.should_follow(x, spider):

If you agree, then I suggest you also change your test changes, so that you keep existing tests unchanged and simply add additional tests scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure about this even with the advantage of backward compatibility because when we want to duplicate a request but not skip the offsite filter ( which I assume is the case for majority of instances where we set dont_filter=True) we have to add meta={'allow_offsite': False} which is extra work. Plus there will be two ways to skip offsite filtering.

Hence I'm proposing to do this in a backward incompatible way if possible.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not opposing to a backward-incompatible change here, but lets get more feedback.

@Gallaecio Gallaecio dismissed their stale review March 23, 2019 19:00

Author might have a point about making this backward-incompatible on purpose being the best choice, but I am not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants