Crawl spider crawls previously visited URL on redirect #34

shaneaevans · 2011-09-09T05:43:29Z

Previously reported by michaelvmata on Trac http://dev.scrapy.org/ticket/299

If a crawl spider is redirected to an already visited page, it will still crawl it.

From the mailing list http://groups.google.com/group/scrapy-users/browse_thread/thread/ee9ad68f5dbacc6d:

"...the dupe filter only catches requests after they leave the spider, so redirected pages are ignored by the dupe filter.

Since the dupefilter and the redirect middleware components are decoupled now, it would be awkward to implement what you suggest, but nevertheless I think it would be useful"

dangra · 2013-01-29T15:46:25Z

Dupe filter is part of scheduler now, and any redirected request goes trough it too and will be discarded if already seen.

dangra closed this as completed Jan 29, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl spider crawls previously visited URL on redirect #34

Crawl spider crawls previously visited URL on redirect #34

shaneaevans commented Sep 9, 2011

dangra commented Jan 29, 2013

Crawl spider crawls previously visited URL on redirect #34

Crawl spider crawls previously visited URL on redirect #34

Comments

shaneaevans commented Sep 9, 2011

dangra commented Jan 29, 2013