-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] recrawl and excludes #394
Comments
I don't think the logs currently indicate if something was included or excluded because of a rule. When I look at the logs with --logging debug --logLevel --context, the websites that were excluded don't show up at all. To know if something was excluded (or included), I open the logs and try to find the website in the logs. Sometimes a website can get caught even if it doesn't show up on the logs, so I use replayweb.page to look there too |
Currently there's no way to partially re-crawl with browsertrix-crawler. In our Browsertrix Cloud system you can use the archiveweb.page Chrome extension to manually capture content that wasn't crawled and then combine it with the crawl in a Collection, which replays together and can be downloaded as a single (nested) WACZ file.
Currently exclusions are not logged. We could possibly log these as debug messages so that they're optionally available but that's not yet implemented.
Failed pages are pages that return a 4xx or 5xx status code or if there is a page load timeout. If anything is captured, it will be included in the WACZ, and each page should also show up in the |
I have a request regarding the documentation.
There are three topics that are underdocumented. It would be useful for people(like me) if docs were available for these:
The text was updated successfully, but these errors were encountered: