Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Scheduler's duplicate filtering is too silent #105

Closed
amckinlay opened this Issue · 6 comments

4 participants

@amckinlay

In version 14.2, if the duplicate filter should detect whether the URL its ignoring has been sent to a Spider previously. If not, it should not silently ignore the duplicate URL. A recent Request I did had a 302 redirect to an identical URL, before redirecting to a unique URL. It took half an hour before I realized that the scheduler was ignoring the first redirect. There were no debug indications or exceptions raised. I set dont_filter to True, which fixed the situation, as expected.

Scrapy should either raise a quiet exception of some sort or realize that a duplicate URL had not previously been sent to a spider instead of silently ignoring duplicates.

P.S.: I don't know why the site sent me to a duplicate URL, but it had modified cookies in the response.

@pablohoffman

cookies are ignored when calculating the request fingerprint. See code here: https://github.com/scrapy/scrapy/blob/master/scrapy/utils/request.py#L18

We should consider enabling cookies for the fingerprint, but in my experience it will also affect some sites that continuously change cookies (ASP.net I'm looking at you).

About making the duplicates filter drop more verbose... the problem is that there is usually a lot of duplicates filtered out in the common case, so the log would be too verbose and barely usable. We could maybe add a setting to enabled this behavior, but if it will be disabled by default it will fail its goal (which is to raise awareness of what it's happening).

@amckinlay

I think a setting for verbosity and a setting to allow including cookies in the fingerprint would be ideal. Verbosity for debug would be very convenient, especially after the final request. If including cookies in the fingerprint is not reliable enough, then maybe allowing another setting for adding URLs to for the duplicate filter to ignore is more useful? I'm thinking old sites with stupidly complicated auth sequences that redirect you to a URL twice (I guess with different cookies).

@dangra dangra referenced this issue from a commit in dangra/scrapy
@dangra dangra Improve duplicate rquest logging. #105
* log first N discarded duplicate requests
* track discards by dupefilter in logstats extension
5a6f269
@dangra
Owner

Duplicate requests are very common for spiders implementing aggressive link extraction like those based in CrawlSpider, which is the default spider class. Logging every discarded request in this case is too much and the reason not only to discard them silently but also to don't notify the spider by triggering its request's errback.

In the past we were logging too much for Items and downloaded pages, and we settle to log rates instead of per downloaded-page/scraped-item, what about adding the discarded-requests to logstats log line?

Attached commit to this issue implements that and logging of first N requests discarded by dupefilter.

@dangra
Owner

this is how it looks using the attached patch:

2012-04-22 00:37:30-0300 [followall] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min), discarded 0 requests (at 0 req/min)
2012-04-22 00:37:32-0300 [followall] DEBUG: Discarded duplicate request: <GET http://scrapinghub.com/>
2012-04-22 00:37:32-0300 [followall] DEBUG: Discarded duplicate request: <GET http://scrapinghub.com/scrapy-cloud.html>
2012-04-22 00:37:32-0300 [followall] DEBUG: Discarded duplicate request: <GET http://scrapinghub.com/tour.html>
2012-04-22 00:37:32-0300 [followall] DEBUG: Discarded duplicate request: <GET http://scrapinghub.com/about.html>
2012-04-22 00:37:32-0300 [followall] DEBUG: Discarded duplicate request: <GET http://scrapinghub.com/autoscraping.html>
2012-04-22 00:37:32-0300 [followall] DEBUG: Discarded duplicate request: <GET http://scrapinghub.com/proxyhub.html>
2012-04-22 00:37:32-0300 [followall] DEBUG: Discarded duplicate request: <GET http://scrapinghub.com/pricing.html>
2012-04-22 00:37:32-0300 [followall] DEBUG: Discarded duplicate request: <GET http://scrapinghub.com/faq.html>
2012-04-22 00:37:32-0300 [followall] DEBUG: Filtered offsite request to 'doc.scrapy.org': <GET http://doc.scrapy.org/en/latest/topics/spiders.html>
2012-04-22 00:37:32-0300 [followall] DEBUG: Discarded duplicate request: <GET http://scrapinghub.com/services.html>
2012-04-22 00:37:32-0300 [followall] DEBUG: Discarded duplicate request: <GET http://scrapinghub.com/consulting-faq.html>
2012-04-22 00:37:32-0300 [followall] DEBUG: Disabled logging of duplicate requests
@artemdevel

The solution is good as for me, would be cool if dupescount can be set by some setting.

@pablohoffman

Why not add a new request_dropped signal and catch it in LogStats?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.