Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicated url are crawled twice #302

Open
Devhercule opened this issue Jul 31, 2018 · 6 comments
Open

duplicated url are crawled twice #302

Devhercule opened this issue Jul 31, 2018 · 6 comments
Labels

Comments

@Devhercule
Copy link

Devhercule commented Jul 31, 2018

What is the current behavior?

Duplicated urls are not skipped. The same url is crawled twice.

If the current behavior is a bug, please provide the steps to reproduce

const HCCrawler = require('./lib/hccrawler');

(async () => {
  const crawler = await HCCrawler.launch({
    evaluatePage: () => ({
      title: document.title,
    }),
    onSuccess: (result => {
      /console.log(result);
    }),
    skipDuplicates: true,
    jQuery: false,
    maxDepth: 3,
    args: ['--no-sandbox']
  });
  
  await crawler.queue([{
        url: 'https://www.example.com/'
      }, {
        url: 'https://www.example.com/'
  }]);

  await crawler.onIdle(); 
  await crawler.close(); 
})();

What is the expected behavior?

Crawled urls should be skipped even if they come from the queue.

Please tell us about your environment:

  • Version: lastest
  • Platform / OS version: Centos 7.1
  • Node.js version: v8.4.0
@davidebaldini
Copy link

The reason might lie in helper.js:

static generateKey(options) {
    const json = JSON.stringify(pick(options, PICKED_OPTION_FIELDS), Helper.jsonStableReplacer);
    return Helper.hash(json).substring(0, MAX_KEY_LENGTH);
  }

Uniqueness is assessed from a hash generated on the result of JSON.stringify(), but this method doesn't guarantee constant order.

I'm looking for opinions. See https://github.com/substack/json-stable-stringify

@BubuAnabelas
Copy link

Same as #299
@yujiosaka should look into this.

@SuperFireFoxy
Copy link

headless 模式下一直报302

@popstas
Copy link

popstas commented Mar 5, 2020

I found two reasons:

  1. maxConcurrency > 1, same page requested in parallel threads.
  2. Page that redirected will deduplicate source url, not target. You can skip these urls by setting skipRequestedRedirect: true

@kulikalov kulikalov added the bug label Oct 17, 2020
@kulikalov
Copy link
Contributor

kulikalov commented Oct 17, 2020

is anyone consider creating a PR?

@iamprageeth
Copy link

Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did.

  1. First created a sqlite database.
  2. Then in RequestStarted event, insert the current url.
  3. In preRequest function (You can pass this function along with options object) , just check whether there is a record of current url. If it is there that means url has crawler or still crawling. so return false. It will skip the url
  4. In RequestRetried, RequestFailed events, delete the url. So that will allows crawler to try it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants