Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not filter duplicate requests within a redirect chain #4314

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Feb 6, 2020

Fixes #1225, closes #5793.

The approach is based on @immerrr’s #1225 (comment).

However, because request fingerprinting takes into account information beyond the URL, I created a new meta key just for this purpose, where redirect chain fingerprints are stored as a set.

These changes seem to fix @immerrr’s code example to reproduce the issue (#1225 (comment)).

Pending:

Post-merge work:

@codecov
Copy link

codecov bot commented Feb 6, 2020

Codecov Report

Merging #4314 (f6807f7) into master (34e01a8) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4314   +/-   ##
=======================================
  Coverage   88.63%   88.64%           
=======================================
  Files         160      160           
  Lines       11614    11624   +10     
  Branches     1890     1891    +1     
=======================================
+ Hits        10294    10304   +10     
  Misses        994      994           
  Partials      326      326           
Files Coverage Δ
scrapy/downloadermiddlewares/redirect.py 95.69% <100.00%> (+0.46%) ⬆️
scrapy/dupefilters.py 94.28% <100.00%> (+0.08%) ⬆️

@Gallaecio Gallaecio changed the title Do not filter duplicate requests within a redirect chain [WIP] Do not filter duplicate requests within a redirect chain Jul 17, 2020
@Gallaecio Gallaecio added this to the 2.7 milestone Jun 29, 2022
@wRAR
Copy link
Member

wRAR commented Oct 13, 2022

Should this be removed from 2.7?

@kmike
Copy link
Member

kmike commented Oct 13, 2022

@wRAR yes; removed. I acknowledge it's looking close, but not having this feature doesn't look like a blocker for 2.7 release.

@Gallaecio Gallaecio changed the title [WIP] Do not filter duplicate requests within a redirect chain Do not filter duplicate requests within a redirect chain Dec 28, 2023
fingerprints = request.meta.get("redirect_fingerprints", set())
assert self.crawler.request_fingerprinter is not None
fingerprint = self.crawler.request_fingerprinter.fingerprint(request)
redirected.meta["redirect_fingerprints"] = fingerprints | {fingerprint}
Copy link
Member

@kmike kmike Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me.
However, I wonder if it'd cause issues in practice, in case someone uses response.meta.copy(). Unlike most other meta keys, it seems this one may affect the behavior significantly.

Do you think there is a room for making it less likely to happen? E.g. try to clear redirect_fingerprints once the response without a redirect is received? It may be a bit tricky, because more than 1 middleware handle redirects. Or would it cause more edge cases than it'd solve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we clean it in scenarios where the redirect middlewares return response, i.e. do not handle a redirect, there is the problem that the fact that one middleware does not handle a redirect does not mean that the other does not. So I discarded that approach.

I thought about removing them from the first of the middlewares (last on response processing), but then if that middleware is disabled, the cleanup would not happen.

We could also create a middleware specifically for the purpose of this cleanup. It feels weird, but it also feels like the cleanest approach to deal with this middleware duality and optionality.

Any thoughts or additional ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dupefilter skips a request when a page is redirected to itself
3 participants