-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not filter duplicate requests within a redirect chain #4314
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #4314 +/- ##
=======================================
Coverage 88.63% 88.64%
=======================================
Files 160 160
Lines 11614 11624 +10
Branches 1890 1891 +1
=======================================
+ Hits 10294 10304 +10
Misses 994 994
Partials 326 326
|
Should this be removed from 2.7? |
@wRAR yes; removed. I acknowledge it's looking close, but not having this feature doesn't look like a blocker for 2.7 release. |
fingerprints = request.meta.get("redirect_fingerprints", set()) | ||
assert self.crawler.request_fingerprinter is not None | ||
fingerprint = self.crawler.request_fingerprinter.fingerprint(request) | ||
redirected.meta["redirect_fingerprints"] = fingerprints | {fingerprint} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks good to me.
However, I wonder if it'd cause issues in practice, in case someone uses response.meta.copy()
. Unlike most other meta keys, it seems this one may affect the behavior significantly.
Do you think there is a room for making it less likely to happen? E.g. try to clear redirect_fingerprints once the response without a redirect is received? It may be a bit tricky, because more than 1 middleware handle redirects. Or would it cause more edge cases than it'd solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we clean it in scenarios where the redirect middlewares return response
, i.e. do not handle a redirect, there is the problem that the fact that one middleware does not handle a redirect does not mean that the other does not. So I discarded that approach.
I thought about removing them from the first of the middlewares (last on response processing), but then if that middleware is disabled, the cleanup would not happen.
We could also create a middleware specifically for the purpose of this cleanup. It feels weird, but it also feels like the cleanest approach to deal with this middleware duality and optionality.
Any thoughts or additional ideas?
Fixes #1225, closes #5793.
The approach is based on @immerrr’s #1225 (comment).
However, because request fingerprinting takes into account information beyond the URL, I created a new meta key just for this purpose, where redirect chain fingerprints are stored as a set.
These changes seem to fix @immerrr’s code example to reproduce the issue (#1225 (comment)).
Pending:
Post-merge work:
dont_filter=True
for requests made fromstart_urls
#3276