New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle seed redirects #475
Comments
ikreymer
added a commit
that referenced
this issue
Feb 24, 2024
- if a seed page redirects (page response != seed url), then add the final url as a new seed with same scope - add newScopeSeed() to ScopedSeed to duplicate seed with different URL, store original includes / excludes - also add check for 'chrome-error://' URLs for the page, and ensure page is marked as failed if page.url() starts with chrome-error:// - fixes #475
Merged
ikreymer
added a commit
that referenced
this issue
Mar 16, 2024
- Fixes state serialization, which was missing the done list. Instead, adds a 'finished' list computed from the seen list, minus failed and queued URLs. - Also adds serialization support for 'extraSeeds', seeds added dynamically from a redirect (via #475). Extra seeds are added to Redis and also included in the serialization. Fixes #491
ikreymer
added a commit
that referenced
this issue
Mar 24, 2024
- subtract extraSeeds when computing limit - don't include redirect seeds in seen list when serializing - tests: adjust saved-state-test to also check total pages when crawl is done fixes #508 (for 1.0.3 release)
ikreymer
added a commit
that referenced
this issue
Mar 26, 2024
) - subtract extraSeeds when computing limit - don't include redirect seeds in seen list when serializing - tests: adjust saved-state-test to also check total pages when crawl is done fixes #508 (for 1.0.3 release)
ikreymer
added a commit
to webrecorder/browsertrix
that referenced
this issue
Apr 4, 2024
… a redirect following changes in webrecorder/browsertrix-crawler#475, webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed to the seen list. To account for this, it needs to be subtracted to get the actual page count.
ikreymer
added a commit
to webrecorder/browsertrix
that referenced
this issue
Apr 4, 2024
… a redirect (#1649) Following changes in webrecorder/browsertrix-crawler#475, webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed to the seen list. To account for this, it needs to be subtracted to get the total page count.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
A seed URL may redirect to a different page via a 3xx. For prefix/domain scope crawls, this may result in the page after redirect being out of scope, resulting in crawl ending early.
eg. https://nytimes.com/ redirects to https://www.nytimes.com/, so if the former is specified as a seed with scopeType prefix, the crawl will end after the first page.
To be as flexible as possible, the crawler should check if seed page URL is different from original URL, and add the new URL as a seed as well with same scoping rules.
The text was updated successfully, but these errors were encountered: