Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl spider crawls previously visited URL on redirect #34

Closed
shaneaevans opened this issue Sep 9, 2011 · 1 comment
Closed

Crawl spider crawls previously visited URL on redirect #34

shaneaevans opened this issue Sep 9, 2011 · 1 comment

Comments

@shaneaevans
Copy link
Member

Previously reported by michaelvmata on Trac http://dev.scrapy.org/ticket/299

If a crawl spider is redirected to an already visited page, it will still crawl it.

From the mailing list http://groups.google.com/group/scrapy-users/browse_thread/thread/ee9ad68f5dbacc6d:

"...the dupe filter only catches requests after they leave the spider, so redirected pages are ignored by the dupe filter.

Since the dupefilter and the redirect middleware components are decoupled now, it would be awkward to implement what you suggest, but nevertheless I think it would be useful"

@dangra
Copy link
Member

dangra commented Jan 29, 2013

Dupe filter is part of scheduler now, and any redirected request goes trough it too and will be discarded if already seen.

@dangra dangra closed this as completed Jan 29, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants