Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] recrawl and excludes #394

Open
wsdookadr opened this issue Sep 26, 2023 · 2 comments
Open

[docs] recrawl and excludes #394

wsdookadr opened this issue Sep 26, 2023 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@wsdookadr
Copy link

wsdookadr commented Sep 26, 2023

I have a request regarding the documentation.

There are three topics that are underdocumented. It would be useful for people(like me) if docs were available for these:

  1. recrawls, how to do them (was asked before here)
  2. excludes: how do they actually work in combination with includes, how can one check the logs to see if something was actually excluded, usage examples
  3. in the Crawler Statistics, what is the meaning of the count for "failed" ? more specifically, are pages that exceed pageLoadTimeout still stored in the WARC in a partial form or are they discarded altogether? do we define a "failed" page to be one that was still loading external resources when the pageLoadTimeout expired?
@pato-pan
Copy link

pato-pan commented Oct 24, 2023

I don't think the logs currently indicate if something was included or excluded because of a rule. When I look at the logs with --logging debug --logLevel --context, the websites that were excluded don't show up at all.

To know if something was excluded (or included), I open the logs and try to find the website in the logs. Sometimes a website can get caught even if it doesn't show up on the logs, so I use replayweb.page to look there too

@tw4l tw4l assigned tw4l and Shrinks99 and unassigned tw4l Oct 25, 2023
@tw4l
Copy link
Contributor

tw4l commented Oct 25, 2023

  • recrawls, how to do them (was asked before here)

Currently there's no way to partially re-crawl with browsertrix-crawler. In our Browsertrix Cloud system you can use the archiveweb.page Chrome extension to manually capture content that wasn't crawled and then combine it with the crawl in a Collection, which replays together and can be downloaded as a single (nested) WACZ file.

  • excludes: how do they actually work in combination with includes, how can one check the logs to see if something was actually excluded, usage examples

Currently exclusions are not logged. We could possibly log these as debug messages so that they're optionally available but that's not yet implemented.

  • in the Crawler Statistics, what is the meaning of the count for "failed" ? more specifically, are pages that exceed pageLoadTimeout still stored in the WARC in a partial form or are they discarded altogether? do we define a "failed" page to be one that was still loading external resources when the pageLoadTimeout expired?

Failed pages are pages that return a 4xx or 5xx status code or if there is a page load timeout. If anything is captured, it will be included in the WACZ, and each page should also show up in the pages.jsonl file within the WACZ with a load state indicator showing the last successful step for the page (e.g. content loaded, full page loaded, behaviors run)

@Shrinks99 Shrinks99 added the question Further information is requested label Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Todo
Development

No branches or pull requests

4 participants