Skip to content

Conversation

@tw4l
Copy link
Member

@tw4l tw4l commented Oct 2, 2023

Fixes #1180

Screenshot of option as presented in UI:

Screen Shot 2023-10-02 at 11 37 52 AM

Checkbox does not appear for seeded crawls.

Crawls that fail this way currently show as "Partial Complete", and the fatal message is clearly visible in the Error Logs.

@tw4l tw4l requested review from Shrinks99, SuaYoo and ikreymer October 2, 2023 15:44
Copy link
Collaborator

@Shrinks99 Shrinks99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@ikreymer
Copy link
Member

ikreymer commented Oct 2, 2023

I think the intention of the --failOnFailedSeed in the crawler is that the whole crawl should interrupt and fail immediately if any of the seeds fail. I see that as having the crawl be marked as failed in Browsertrix Cloud, rather than just doing a graceful stop. The reason to do this is that you need all URLs in the list to succeed, otherwise this is a failure.
The default behavior is to ignore the failed URLs and get as many as possible.

If we also add a failed threshold setting, that could then be use for more fine-grained control, eg. with this setting, and 100 URLs, 99/100 pages crawled would be considered a failure. But, if we a a fail threshold, then that could be set to 5, and 95/100 would be a failure by 99/100 would be success.

@SuaYoo
Copy link
Member

SuaYoo commented Oct 2, 2023

Nitpick and happy to move this topic to Discord, but could we move towards sentence case for long checkbox labels? Personally I find the long checkbox labels in title case hard to read, which I do think is validated rather than only personal preference. I also remember seeing that it makes localization/translation more difficult down the road, though I can't find any resources to back that up.

@ikreymer
Copy link
Member

ikreymer commented Oct 3, 2023

With webrecorder/browsertrix-crawler#402, the --failOnFailedSeed should also mark crawl as failed (needs a bit more testing).

tw4l and others added 2 commits October 3, 2023 10:44
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
@ikreymer
Copy link
Member

ikreymer commented Oct 4, 2023

Tested with latest crawler, works as expected.

@ikreymer ikreymer merged commit b1ead61 into main Oct 4, 2023
@ikreymer ikreymer deleted the issue-1180-fail-on-failed-seed-url-list branch October 4, 2023 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add --failOnFailedSeed crawler arg as option

5 participants