[docs] recrawl and excludes #394

wsdookadr · 2023-09-26T09:07:35Z

I have a request regarding the documentation.

There are three topics that are underdocumented. It would be useful for people(like me) if docs were available for these:

recrawls, how to do them (was asked before here)
excludes: how do they actually work in combination with includes, how can one check the logs to see if something was actually excluded, usage examples
in the Crawler Statistics, what is the meaning of the count for "failed" ? more specifically, are pages that exceed pageLoadTimeout still stored in the WARC in a partial form or are they discarded altogether? do we define a "failed" page to be one that was still loading external resources when the pageLoadTimeout expired?

pato-pan · 2023-10-24T12:39:02Z

I don't think the logs currently indicate if something was included or excluded because of a rule. When I look at the logs with --logging debug --logLevel --context, the websites that were excluded don't show up at all.

To know if something was excluded (or included), I open the logs and try to find the website in the logs. Sometimes a website can get caught even if it doesn't show up on the logs, so I use replayweb.page to look there too

tw4l · 2023-10-25T20:15:26Z

recrawls, how to do them (was asked before here)

Currently there's no way to partially re-crawl with browsertrix-crawler. In our Browsertrix Cloud system you can use the archiveweb.page Chrome extension to manually capture content that wasn't crawled and then combine it with the crawl in a Collection, which replays together and can be downloaded as a single (nested) WACZ file.

excludes: how do they actually work in combination with includes, how can one check the logs to see if something was actually excluded, usage examples

Currently exclusions are not logged. We could possibly log these as debug messages so that they're optionally available but that's not yet implemented.

in the Crawler Statistics, what is the meaning of the count for "failed" ? more specifically, are pages that exceed pageLoadTimeout still stored in the WARC in a partial form or are they discarded altogether? do we define a "failed" page to be one that was still loading external resources when the pageLoadTimeout expired?

Failed pages are pages that return a 4xx or 5xx status code or if there is a page load timeout. If anything is captured, it will be included in the WACZ, and each page should also show up in the pages.jsonl file within the WACZ with a load state indicator showing the last successful step for the page (e.g. content loaded, full page loaded, behaviors run)

This was referenced Oct 23, 2023

[Docs] More exclusion examples? #416

Closed

To do before making a pull request pato-pan/browsertrix-crawler#1

Closed

tw4l assigned tw4l and Shrinks99 and unassigned tw4l Oct 25, 2023

Shrinks99 added the question Further information is requested label Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] recrawl and excludes #394

[docs] recrawl and excludes #394

wsdookadr commented Sep 26, 2023 •

edited

Loading

pato-pan commented Oct 24, 2023 •

edited

Loading

tw4l commented Oct 25, 2023 •

edited

Loading

[docs] recrawl and excludes #394

[docs] recrawl and excludes #394

Comments

wsdookadr commented Sep 26, 2023 • edited Loading

pato-pan commented Oct 24, 2023 • edited Loading

tw4l commented Oct 25, 2023 • edited Loading

wsdookadr commented Sep 26, 2023 •

edited

Loading

pato-pan commented Oct 24, 2023 •

edited

Loading

tw4l commented Oct 25, 2023 •

edited

Loading