Endless loop with duplicate values in query string #137

Closed
fcoelho opened this Issue May 25, 2012 · 7 comments

Comments

Projects
None yet
6 participants
@fcoelho

fcoelho commented May 25, 2012

Pages which have duplicate values in their query string are treated as different pages:

If the first page has a link to the second and the second a link to the third, scrapy enters an infinite loop, requesting each page in succession. I've put a simple page here to test the issue. Using a CrawlSpider with a rule like Rule(SgmlLinkExtractor(), callback='parse_page', follow=True) will cause this infinite recursion. Here is a sample spider:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class LoopingSpider(CrawlSpider):
    name = "loop"
    start_urls = ["http://fcoelho.alwaysdata.net/scrapy/page.php"]
    allowed_domains = ["fcoelho.alwaysdata.net"]
    rules = (
        Rule(SgmlLinkExtractor(), callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        print "Page: %s" % response.url
        return []

It seems like in the end the problem "arises" because scrapy.utils.url.canonicalize_url only sorts the query string keys, but don't remove duplicates. As far as I can tell, duplicates are "wrong", and shouldn't be counted as normal web page behavior. Would it be sane to remove arguments from the query string if their key is repeated?

@pablohoffman

This comment has been minimized.

Show comment Hide comment
@pablohoffman

pablohoffman Aug 24, 2012

Owner

There can be duplicate keys in url arguments, see:
http://en.wikipedia.org/wiki/Query_string#Web_forms

So canonicalize_url() cannot just remove them by default, because (even if in some cases it doesn't) it could well affect the content of the given page and actually point to a different resource.

Owner

pablohoffman commented Aug 24, 2012

There can be duplicate keys in url arguments, see:
http://en.wikipedia.org/wiki/Query_string#Web_forms

So canonicalize_url() cannot just remove them by default, because (even if in some cases it doesn't) it could well affect the content of the given page and actually point to a different resource.

@dangra

This comment has been minimized.

Show comment Hide comment
@dangra

dangra Jan 29, 2013

Owner

canonicalize_url is following expected behavior

Owner

dangra commented Jan 29, 2013

canonicalize_url is following expected behavior

@dangra dangra closed this Jan 29, 2013

@sartian

This comment has been minimized.

Show comment Hide comment
@sartian

sartian Oct 17, 2013

Hi fcoelho, did you find a workable strategy for coping with this problem of infinite / generated link loops?

While this isn't a problem with canonicalize_url per se, I've been working on a scrapy crawler and have hit several websites that go into an endless loop like this. For now I am limiting the problem using DEPTH_LIMIT, but I'd love to know if you found work around. If I had to take a stab at it, I'd parse the url and check for duplicate args values (multiple args are valid, but I can't think of any case where duplicate arg/values would be critical for a web page)

I did find a way to keep track of duplicate documents by generating a checksum for human visible document text that I use in my own custom downloader middleware.

import hashlib
from scrapy.selector import HtmlXPathSelector

# ...yadda, yadda, yadda...

    def _get_response_checksum(self, response):
        """strips HTML tags to get body text checksum for duplicate detection"""
        hxs = HtmlXPathSelector(response)
        body = " ".join(" ".join(hxs.select("//*[name() != 'script']/text()").extract()).split())
        return hashlib.md5(body.encode('utf-8')).hexdigest()

The crawler I am working on crawls one site at a time and use a modified version of the HTTPCACHE middleware. I have a separate content processing pipeline (external to scrapy) and the feeder for that pipeline scans documents and builds a hash table of documents using checksum as the key. The values are the document URLs. Given a set of urls, my current strategy is to chose the shortest URL and ignore the rest.

All that said, I'd much rather not download 'duplicate' documents to begin with, so I am very interested in a strategy that can catch URLs like this and remove them from the processing/download stream (on some sites it means the difference between it taking 5 minutes to 45 minutes to crawl a site)

sartian commented Oct 17, 2013

Hi fcoelho, did you find a workable strategy for coping with this problem of infinite / generated link loops?

While this isn't a problem with canonicalize_url per se, I've been working on a scrapy crawler and have hit several websites that go into an endless loop like this. For now I am limiting the problem using DEPTH_LIMIT, but I'd love to know if you found work around. If I had to take a stab at it, I'd parse the url and check for duplicate args values (multiple args are valid, but I can't think of any case where duplicate arg/values would be critical for a web page)

I did find a way to keep track of duplicate documents by generating a checksum for human visible document text that I use in my own custom downloader middleware.

import hashlib
from scrapy.selector import HtmlXPathSelector

# ...yadda, yadda, yadda...

    def _get_response_checksum(self, response):
        """strips HTML tags to get body text checksum for duplicate detection"""
        hxs = HtmlXPathSelector(response)
        body = " ".join(" ".join(hxs.select("//*[name() != 'script']/text()").extract()).split())
        return hashlib.md5(body.encode('utf-8')).hexdigest()

The crawler I am working on crawls one site at a time and use a modified version of the HTTPCACHE middleware. I have a separate content processing pipeline (external to scrapy) and the feeder for that pipeline scans documents and builds a hash table of documents using checksum as the key. The values are the document URLs. Given a set of urls, my current strategy is to chose the shortest URL and ignore the rest.

All that said, I'd much rather not download 'duplicate' documents to begin with, so I am very interested in a strategy that can catch URLs like this and remove them from the processing/download stream (on some sites it means the difference between it taking 5 minutes to 45 minutes to crawl a site)

@fcoelho

This comment has been minimized.

Show comment Hide comment
@fcoelho

fcoelho Nov 18, 2013

Hi @sartian, unfortunately, I haven't. At the time I had a very small subset of pages with that issue, and had just the offending page removed from search somehow because it wasn't that relevant for my task at the time. To be honest I don't remember exactly what I did, but I do remember that, for me, it wasn't a case of a 5/45 minutes difference, so I just went with it. Sorry I can't be of more help.

fcoelho commented Nov 18, 2013

Hi @sartian, unfortunately, I haven't. At the time I had a very small subset of pages with that issue, and had just the offending page removed from search somehow because it wasn't that relevant for my task at the time. To be honest I don't remember exactly what I did, but I do remember that, for me, it wasn't a case of a 5/45 minutes difference, so I just went with it. Sorry I can't be of more help.

@sartian

This comment has been minimized.

Show comment Hide comment
@sartian

sartian Nov 18, 2013

Hi @fcoelho, thanks anyways! I have yet to come up with a good solution to this particular issue myself, but for now we're working around it. :)

sartian commented Nov 18, 2013

Hi @fcoelho, thanks anyways! I have yet to come up with a good solution to this particular issue myself, but for now we're working around it. :)

@kmike

This comment has been minimized.

Show comment Hide comment
@mattfullerton

This comment has been minimized.

Show comment Hide comment
@mattfullerton

mattfullerton Sep 25, 2014

There is a similar detection algorithm in Heritrix, the crawler used for archive.org:
https://github.com/internetarchive/heritrix3/blob/master/modules/src/main/java/org/archive/modules/deciderules/PathologicalPathDecideRule.java

After some fiddling (I am not good with regex) a suitable regex for python (could be added to a middleware for scrapy) might be:
.?&(.?&)\1{1,}.*

Changing the 1 inside {1,} affects the number of consecutive repetitions required (and assumes that for the case where there is only one repetition, another argument comes after it). Improvements would be appreciated.

I already have a blacklist implemented inside a customized robots.txt middleware where I can specify forbidden url parts for certain sites. I will add this and see if I have any more trouble with repeating URL arguments. Something like:

def __init__(self, crawler):
    ...
    self.stoprepetitionsrearg = re.compile(ur'.*?\&(.*?\&)\1{1,}.*')
    self.stoprepetitionsreslash = re.compile(ur'.*?\/(.*?\/)\1{1,}.*')
    ...

def process_request(self, request, spider):
    ...
    #Check for silly repeating arguments
    if self.stoprepetitionsrearg.match(request.url) != None or self.stoprepetitionsreslash.match(request.url) != None:
        log.msg(format="URL is suspicious: %(request)s",
                level=log.DEBUG, request=request)
        raise IgnoreRequest
    ...

There is a similar detection algorithm in Heritrix, the crawler used for archive.org:
https://github.com/internetarchive/heritrix3/blob/master/modules/src/main/java/org/archive/modules/deciderules/PathologicalPathDecideRule.java

After some fiddling (I am not good with regex) a suitable regex for python (could be added to a middleware for scrapy) might be:
.?&(.?&)\1{1,}.*

Changing the 1 inside {1,} affects the number of consecutive repetitions required (and assumes that for the case where there is only one repetition, another argument comes after it). Improvements would be appreciated.

I already have a blacklist implemented inside a customized robots.txt middleware where I can specify forbidden url parts for certain sites. I will add this and see if I have any more trouble with repeating URL arguments. Something like:

def __init__(self, crawler):
    ...
    self.stoprepetitionsrearg = re.compile(ur'.*?\&(.*?\&)\1{1,}.*')
    self.stoprepetitionsreslash = re.compile(ur'.*?\/(.*?\/)\1{1,}.*')
    ...

def process_request(self, request, spider):
    ...
    #Check for silly repeating arguments
    if self.stoprepetitionsrearg.match(request.url) != None or self.stoprepetitionsreslash.match(request.url) != None:
        log.msg(format="URL is suspicious: %(request)s",
                level=log.DEBUG, request=request)
        raise IgnoreRequest
    ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment