duplicated url are crawled twice #302

Devhercule · 2018-07-31T14:58:56Z

What is the current behavior?

Duplicated urls are not skipped. The same url is crawled twice.

If the current behavior is a bug, please provide the steps to reproduce

const HCCrawler = require('./lib/hccrawler');

(async () => {
  const crawler = await HCCrawler.launch({
    evaluatePage: () => ({
      title: document.title,
    }),
    onSuccess: (result => {
      /console.log(result);
    }),
    skipDuplicates: true,
    jQuery: false,
    maxDepth: 3,
    args: ['--no-sandbox']
  });
  
  await crawler.queue([{
        url: 'https://www.example.com/'
      }, {
        url: 'https://www.example.com/'
  }]);

  await crawler.onIdle(); 
  await crawler.close(); 
})();

What is the expected behavior?

Crawled urls should be skipped even if they come from the queue.

Please tell us about your environment:

Version: lastest
Platform / OS version: Centos 7.1
Node.js version: v8.4.0

The text was updated successfully, but these errors were encountered:

davidebaldini · 2018-09-25T12:22:12Z

The reason might lie in helper.js:

static generateKey(options) {
    const json = JSON.stringify(pick(options, PICKED_OPTION_FIELDS), Helper.jsonStableReplacer);
    return Helper.hash(json).substring(0, MAX_KEY_LENGTH);
  }

Uniqueness is assessed from a hash generated on the result of JSON.stringify(), but this method doesn't guarantee constant order.

I'm looking for opinions. See https://github.com/substack/json-stable-stringify

BubuAnabelas · 2018-10-20T18:24:55Z

Same as #299
@yujiosaka should look into this.

SuperFireFoxy · 2019-10-11T11:45:51Z

headless 模式下一直报302

popstas · 2020-03-05T15:46:51Z

I found two reasons:

maxConcurrency > 1, same page requested in parallel threads.
Page that redirected will deduplicate source url, not target. You can skip these urls by setting skipRequestedRedirect: true

kulikalov · 2020-10-17T07:42:13Z

is anyone consider creating a PR?

iamprageeth · 2022-06-19T06:36:19Z

Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did.

First created a sqlite database.
Then in RequestStarted event, insert the current url.
In preRequest function (You can pass this function along with options object) , just check whether there is a record of current url. If it is there that means url has crawler or still crawling. so return false. It will skip the url
In RequestRetried, RequestFailed events, delete the url. So that will allows crawler to try it again.

kulikalov added the bug label Oct 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicated url are crawled twice #302

duplicated url are crawled twice #302

Devhercule commented Jul 31, 2018 •

edited

davidebaldini commented Sep 25, 2018

BubuAnabelas commented Oct 20, 2018

SuperFireFoxy commented Oct 11, 2019

popstas commented Mar 5, 2020 •

edited

kulikalov commented Oct 17, 2020 •

edited

iamprageeth commented Jun 19, 2022

duplicated url are crawled twice #302

duplicated url are crawled twice #302

Comments

Devhercule commented Jul 31, 2018 • edited

davidebaldini commented Sep 25, 2018

BubuAnabelas commented Oct 20, 2018

SuperFireFoxy commented Oct 11, 2019

popstas commented Mar 5, 2020 • edited

kulikalov commented Oct 17, 2020 • edited

iamprageeth commented Jun 19, 2022

Devhercule commented Jul 31, 2018 •

edited

popstas commented Mar 5, 2020 •

edited

kulikalov commented Oct 17, 2020 •

edited