Crawling rate limit or requeue? #485

abale · 2022-03-09T14:09:15Z

Is there a simple way to re-queue a page for crawling? Many sites employ request rate limiting (429 http status code) and typically it's a question of putting that back in the queue for retry.

An alternative would be a function to rate limit the crawler beyond max concurrency - perhaps a global maximum requests/s (with the ability to provide less-than-1 for slower crawling).

Setting maxConcurrency to 1 still crawls too quickly.

s0ph1e · 2022-03-30T08:58:06Z

Hi @abale 👋

Sorry for late response.

To achieve retries I suggest to check request option. It uses got module inside website-scraper to make http requests and I suppose it's possible to configure got to do retries when request fails.

Also you can try to add delays between requests - please check an example of beforeRequest action usage

Hope it helps

no-response · 2022-04-13T09:56:37Z

This issue has been automatically closed because there has been no response from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

s0ph1e added the wait-response label Mar 30, 2022

no-response bot closed this as completed Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawling rate limit or requeue? #485

Crawling rate limit or requeue? #485

abale commented Mar 9, 2022

s0ph1e commented Mar 30, 2022

no-response bot commented Apr 13, 2022

Crawling rate limit or requeue? #485

Crawling rate limit or requeue? #485

Comments

abale commented Mar 9, 2022

s0ph1e commented Mar 30, 2022

no-response bot commented Apr 13, 2022