-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new allow_offsite
parameter in OffsiteMiddleware
#6151
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6151 +/- ##
==========================================
+ Coverage 85.00% 88.52% +3.52%
==========================================
Files 161 159 -2
Lines 11962 11582 -380
Branches 1872 1885 +13
==========================================
+ Hits 10168 10253 +85
+ Misses 1512 1000 -512
- Partials 282 329 +47
|
I think this will show deprecation warnings for all requests with dont_filter=True, even ones that use it to skip the dupefilter? |
Yes but only once as https://docs.python.org/3/library/warnings.html states:
|
scrapy/spidermiddlewares/offsite.py
Outdated
ScrapyDeprecationWarning, | ||
stacklevel=2, | ||
) | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I wonder if it'd be better to just keep supporting dont_filter, without a deprecation warning.
First, the flag says "dont_filter", and we respect it, and don't filter out the request. An argument can be made though that "don't filter" only applies to deduplication filter, not to other types of filtering. I think that's valid, But that's not the current flag behavior, and also it's not in the flag name (it's not "dont_deduplicate").
Second, the user doesn't control all the don't_filter flags, Scrapy and other components can be setting this flag. For example, the default start_requests implementation uses dont_filter=True.
It seems it's not possibile to "deprecate don_filter flag in OffsiteMiddleware", because the user might be setting this flag not for the OffsiteMiddleware, but for other Scrapy components, but the request may still end up in OffsiteMiddleware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. In that case, we need to clarify that the dont_filter
value also affects some middlewares as well and not just for the Scheduler.
1d557a5
to
acba118
Compare
|
We have tests for the offsite downloader mw but they aren't a copy of the spider mw ones so I'll add a test for this flag later. |
I'm proposing to have a new
allow_offsite
parameter inOffsiteMiddleware
. Currently, it relies onRequest.dont_filter
attribute to be set toTrue
to allow offsite requests. However, it seems that we cannot rely on this flag directly since setting it to True could result in multiple duplicated requests, as per its definition:There are cases where we may want to visit an offsite page but also want to filter duplicate requests after.
Fixes #6366, fixes #3690, closes #3691.