Handle seed redirects #475

ikreymer · 2024-02-24T07:09:40Z

A seed URL may redirect to a different page via a 3xx. For prefix/domain scope crawls, this may result in the page after redirect being out of scope, resulting in crawl ending early.
eg. https://nytimes.com/ redirects to https://www.nytimes.com/, so if the former is specified as a seed with scopeType prefix, the crawl will end after the first page.

To be as flexible as possible, the crawler should check if seed page URL is different from original URL, and add the new URL as a seed as well with same scoping rules.

- if a seed page redirects (page response != seed url), then add the final url as a new seed with same scope - add newScopeSeed() to ScopedSeed to duplicate seed with different URL, store original includes / excludes - also add check for 'chrome-error://' URLs for the page, and ensure page is marked as failed if page.url() starts with chrome-error:// - fixes #475

- Fixes state serialization, which was missing the done list. Instead, adds a 'finished' list computed from the seen list, minus failed and queued URLs. - Also adds serialization support for 'extraSeeds', seeds added dynamically from a redirect (via #475). Extra seeds are added to Redis and also included in the serialization. Fixes #491

- subtract extraSeeds when computing limit - don't include redirect seeds in seen list when serializing - tests: adjust saved-state-test to also check total pages when crawl is done fixes #508 (for 1.0.3 release)

) - subtract extraSeeds when computing limit - don't include redirect seeds in seen list when serializing - tests: adjust saved-state-test to also check total pages when crawl is done fixes #508 (for 1.0.3 release)

… a redirect following changes in webrecorder/browsertrix-crawler#475, webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed to the seen list. To account for this, it needs to be subtracted to get the actual page count.

… a redirect (#1649) Following changes in webrecorder/browsertrix-crawler#475, webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed to the seen list. To account for this, it needs to be subtracted to get the total page count.

ikreymer self-assigned this Feb 24, 2024

ikreymer mentioned this issue Feb 24, 2024

new seed on redirect + error page check: #476

Merged

ikreymer closed this as completed in fba4730 Mar 5, 2024

ikreymer mentioned this issue Mar 15, 2024

Fix Save/Load State #495

Merged

ikreymer mentioned this issue Mar 24, 2024

Seed redirect causes one less URL to be crawled due to limit #508

Closed

ikreymer mentioned this issue Apr 4, 2024

fix issue with incorrect number of total pages if any of the seeds is a redirect webrecorder/browsertrix#1649

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle seed redirects #475

Handle seed redirects #475

ikreymer commented Feb 24, 2024

Handle seed redirects #475

Handle seed redirects #475

Comments

ikreymer commented Feb 24, 2024