Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle seed redirects #475

Closed
ikreymer opened this issue Feb 24, 2024 · 0 comments
Closed

Handle seed redirects #475

ikreymer opened this issue Feb 24, 2024 · 0 comments
Assignees

Comments

@ikreymer
Copy link
Member

A seed URL may redirect to a different page via a 3xx. For prefix/domain scope crawls, this may result in the page after redirect being out of scope, resulting in crawl ending early.
eg. https://nytimes.com/ redirects to https://www.nytimes.com/, so if the former is specified as a seed with scopeType prefix, the crawl will end after the first page.

To be as flexible as possible, the crawler should check if seed page URL is different from original URL, and add the new URL as a seed as well with same scoping rules.

@ikreymer ikreymer self-assigned this Feb 24, 2024
ikreymer added a commit that referenced this issue Feb 24, 2024
- if a seed page redirects (page response != seed url), then add the final url as a new seed with same scope
- add newScopeSeed() to ScopedSeed to duplicate seed with different URL, store original includes / excludes
- also add check for 'chrome-error://' URLs for the page, and ensure page is marked as failed if page.url() starts with chrome-error://
- fixes #475
ikreymer added a commit that referenced this issue Mar 16, 2024
- Fixes state serialization, which was missing the done list. Instead,
adds a 'finished' list computed from the seen list, minus failed and
queued URLs.
- Also adds serialization support for 'extraSeeds', seeds added
dynamically from a redirect (via #475). Extra seeds are added to Redis
and also included in the serialization.

Fixes #491
ikreymer added a commit that referenced this issue Mar 24, 2024
- subtract extraSeeds when computing limit
- don't include redirect seeds in seen list when serializing
- tests: adjust saved-state-test to also check total pages when crawl is done

fixes #508
(for 1.0.3 release)
ikreymer added a commit that referenced this issue Mar 26, 2024
)

- subtract extraSeeds when computing limit
- don't include redirect seeds in seen list when serializing
- tests: adjust saved-state-test to also check total pages when crawl is
done

fixes #508
(for 1.0.3 release)
ikreymer added a commit to webrecorder/browsertrix that referenced this issue Apr 4, 2024
… a redirect

following changes in webrecorder/browsertrix-crawler#475, webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed
to the seen list. To account for this, it needs to be subtracted to get the actual page count.
ikreymer added a commit to webrecorder/browsertrix that referenced this issue Apr 4, 2024
… a redirect (#1649)

Following changes in webrecorder/browsertrix-crawler#475,
webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed
to the seen list. To account for this, it needs to be subtracted to get
the total page count.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done!
Development

No branches or pull requests

1 participant