-
-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev 1.0.0 -> Main #482
Dev 1.0.0 -> Main #482
Commits on Nov 8, 2023
-
Use new browser-based archiving mechanism instead of pywb proxy (#424)
Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 Changes include: - Recorder class for capture CDP network traffic for each page. - Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..) - WARC writing support via TS-based warcio.js library. - Generates single WARC file per worker (still need to add size rollover). - Request interception via Fetch.requestPaused - Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest() - Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch via fetch() - Direct async fetch() capture of non-HTML URLs - Awaiting for all requests to finish before moving on to next page, upto page timeout. - Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use). - removed pywb, using cdxj-indexer for --generateCDX option.
Configuration menu - View commit details
-
Copy full SHA for 877d9f5 - Browse repository at this point
Copy the full SHA 877d9f5View commit details
Commits on Nov 9, 2023
-
Configuration menu - View commit details
-
Copy full SHA for af1e086 - Browse repository at this point
Copy the full SHA af1e086View commit details
Commits on Nov 10, 2023
-
Add Prettier to the repo, and format all the files! (#428)
This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
Configuration menu - View commit details
-
Copy full SHA for 2a49406 - Browse repository at this point
Copy the full SHA 2a49406View commit details -
follow-up to #428: update ignore files (#431)
- actually update lint/prettier/git ignore files with scatch, crawls, test-crawls, behaviors, as needed
Configuration menu - View commit details
-
Copy full SHA for 783d006 - Browse repository at this point
Copy the full SHA 783d006View commit details -
Raise size limit for large HTML pages (#430)
Previously, responses >2MB are streamed to disk and an empty response returned to browser, to avoid holding large response in memory. This limit was too small, as some HTML pages may be >2MB, resulting in no content loaded. This PR sets different limits for: - HTML as well as other JS necessary for page to load to 25MB - All other content limit is set to 5MB Also includes some more type fixing
Configuration menu - View commit details
-
Copy full SHA for ab0f66a - Browse repository at this point
Copy the full SHA ab0f66aView commit details
Commits on Nov 13, 2023
-
logging: don't log filtered out direct fetch attempt as error (#432)
When calling directFetchCapture, and aborting the response via an exception, throw `new Error("response-filtered-out");` so that it can be ignored. This exception is only used for direct capture, and should not be logged as an error - rethrow and handle in calling function to indicate direct fetch is skipped
Configuration menu - View commit details
-
Copy full SHA for 3972942 - Browse repository at this point
Copy the full SHA 3972942View commit details -
Fix potential for pending list never being processed (#433)
Due to an optimization, numPending() call assumed that queueSize() would be called to update cached queue size. However, in the current worker code, this is not the case. Remove cacheing the queue size and just check queue size in numPending(), to ensure pending list is always processed.
Configuration menu - View commit details
-
Copy full SHA for 0d51e03 - Browse repository at this point
Copy the full SHA 0d51e03View commit details -
more specific types additions (#434)
- add QueueEntry for type of json object stored in Redis - and PageCallbacks for callback type - use Crawler type
Configuration menu - View commit details
-
Copy full SHA for 456155e - Browse repository at this point
Copy the full SHA 456155eView commit details
Commits on Nov 15, 2023
-
Add types + validation for log context options (#435)
- add LogContext type and enumerate all log contexts - also add LOG_CONTEXT_TYPES array to validate --context arg - rename errJSON -> formatErr, convert unknown (likely Error) to dict - make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers
Configuration menu - View commit details
-
Copy full SHA for 19dac94 - Browse repository at this point
Copy the full SHA 19dac94View commit details
Commits on Nov 16, 2023
-
Configuration menu - View commit details
-
Copy full SHA for e9ed7a4 - Browse repository at this point
Copy the full SHA e9ed7a4View commit details
Commits on Dec 8, 2023
-
WARC filename prefix + rollover size + improved 'livestream' / trunca…
…ted response support. (#440) Support for rollover size and custom WARC prefix templates: - reenable --rolloverSize (default to 1GB) for when a new WARC is created - support custom WARC prefix via --warcPrefix, prepended to new WARC filename, test via basic_crawl.test.js - filename template for new files is: `${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}` with `$ts` replaced at new file creation time with current timestamp Improved support for long (non-terminating) responses, such as from live-streaming: - add a size to CDP takeStream to ensure data is streamed in fixed chunks, defaulting to 64k - change shutdown order: first close browser, then finish writing all WARCs to ensure any truncated responses can be captured. - ensure WARC is not rewritten after it is done, skip writing records if stream already flushed - add timeout to final fetch tasks to avoid never hanging on finish - fix adding `WARC-Truncated` header, need to set after stream is finished to determine if its been truncated - move temp download `tmp-dl` dir to main temp folder, outside of collection (no need to be there).
Configuration menu - View commit details
-
Copy full SHA for 3323262 - Browse repository at this point
Copy the full SHA 3323262View commit details
Commits on Dec 13, 2023
-
detect invalid custom behaviors on load: (#450)
- on first page, attempt to evaluate the behavior class to ensure it compiles - if fails to compile, log exception with fatal and exit - update behavior gathering code to keep track of behavior filename - tests: add test for invalid behavior which causes crawl to exit with fatal exit code (17)
Configuration menu - View commit details
-
Copy full SHA for 703835a - Browse repository at this point
Copy the full SHA 703835aView commit details
Commits on Jan 3, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 63c884f - Browse repository at this point
Copy the full SHA 63c884fView commit details -
Configuration menu - View commit details
-
Copy full SHA for db2dbe0 - Browse repository at this point
Copy the full SHA db2dbe0View commit details
Commits on Jan 15, 2024
-
Generate urn:pageinfo:<page url> records (#458)
Generate records for each page, containing a list of resources and their status codes, to aid in future diffing/comparison. Generates a `urn:pageinfo:<page url>` record for each page - Adds POST / non-GET request canonicalization from warcio to handle non-GET requests - Adds `writeSingleRecord` to WARCWriter Fixes #457
Configuration menu - View commit details
-
Copy full SHA for 2fc0f67 - Browse repository at this point
Copy the full SHA 2fc0f67View commit details
Commits on Jan 17, 2024
-
skipping resources: ensure HEAD, OPTIONS, 206, and 304 response/reque…
…st pairs are not written to WARC (#460) Allows for skipping network traffic that doesn't need to be stored, as it is not necessary/will result in incorrect replay (eg. 304 instead of a 200).
Configuration menu - View commit details
-
Copy full SHA for 18ffb3d - Browse repository at this point
Copy the full SHA 18ffb3dView commit details -
Configuration menu - View commit details
-
Copy full SHA for f4ecaa8 - Browse repository at this point
Copy the full SHA f4ecaa8View commit details -
Configuration menu - View commit details
-
Copy full SHA for 298deac - Browse repository at this point
Copy the full SHA 298deacView commit details
Commits on Feb 10, 2024
-
Add arg to write pages to Redis (#464)
Fixes #462 Add --writePagesToRedis arg, for use conjunction with QA features in Browsertrix Cloud, to add pages to the database for each crawl. Ensure timestamp (as ISO date) is added to pages when they are serialized (both to pages.jsonl and redis) Also include timestamp (as ISO date) in `pageinfo:` records --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for bdffa79 - Browse repository at this point
Copy the full SHA bdffa79View commit details
Commits on Feb 16, 2024
-
Page Resources: Include Cached Resources (#465)
Ensure cached resources (that are not written to WARC) are still included in the `url:pageinfo:...` records. This will make it easier to track which resources are actually *loaded* from a given page. Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about include cached resources
Configuration menu - View commit details
-
Copy full SHA for 96f3c40 - Browse repository at this point
Copy the full SHA 96f3c40View commit details -
Configuration menu - View commit details
-
Copy full SHA for 46eb02d - Browse repository at this point
Copy the full SHA 46eb02dView commit details
Commits on Feb 18, 2024
-
- Update to Brave browser (1.62.165) - Update page resource test to reflect latest Brave behavior
Configuration menu - View commit details
-
Copy full SHA for e8f2073 - Browse repository at this point
Copy the full SHA e8f2073View commit details -
Misc Page Resource/Recorder Fixes (#467)
- recorder: don't attempt to record response with mime type `text/event-stream` (will not terminate). - resources: don't track non http/https resources. - resources: store page timestamp on first resources URL match, in case multiple responses for same page encountered.
Configuration menu - View commit details
-
Copy full SHA for 8d2d79a - Browse repository at this point
Copy the full SHA 8d2d79aView commit details
Commits on Feb 20, 2024
-
Include resource type + mime type in page resources list (#468)
The `:pageinfo:<url>` record now includes the mime type + resource type (from Chrome) along with status code for each resource, for better filtering / comparison.
Configuration menu - View commit details
-
Copy full SHA for a512e92 - Browse repository at this point
Copy the full SHA a512e92View commit details
Commits on Feb 21, 2024
-
Set warc prefix via WARC_PREFIX env var (#470)
In addition to `--warcPrefix` flag, also support WARC_PREFIX env var, which takes precedence. Bump to 1.0.0-beta.4
Configuration menu - View commit details
-
Copy full SHA for a5e9395 - Browse repository at this point
Copy the full SHA a5e9395View commit details
Commits on Feb 22, 2024
-
pageinfo: add console errors to pageinfo record, tracking in 'counts'…
… field (#471) Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.
Configuration menu - View commit details
-
Copy full SHA for 51660cd - Browse repository at this point
Copy the full SHA 51660cdView commit details
Commits on Feb 23, 2024
-
Configuration menu - View commit details
-
Copy full SHA for d36564e - Browse repository at this point
Copy the full SHA d36564eView commit details
Commits on Feb 24, 2024
-
warcwriter: better filehandle init on first use (#474)
Ensure warcwriter file is inited on first use, instead of throwing error - was initing from writeRecordPair() but not writeSingleRecord()
Configuration menu - View commit details
-
Copy full SHA for cdd047d - Browse repository at this point
Copy the full SHA cdd047dView commit details
Commits on Feb 28, 2024
-
Include WARC prefix for screenshots and text WARCs (#473)
Ensure the env var / cli <warc prefix>-<crawlId> is also applied to `screenshots.warc.gz` and `text.warc.gz`
Configuration menu - View commit details
-
Copy full SHA for dd48251 - Browse repository at this point
Copy the full SHA dd48251View commit details -
new seed on redirect + error page check: (#476)
- if a seed page redirects (page response != seed url), then add the final url as a new seed with same scope - add newScopeSeed() to ScopedSeed to duplicate seed with different URL, store original includes / excludes - also add check for 'chrome-error://' URLs for the page, and ensure page is marked as failed if page.url() starts with chrome-error:// - fixes #475
Configuration menu - View commit details
-
Copy full SHA for fba4730 - Browse repository at this point
Copy the full SHA fba4730View commit details
Commits on Feb 29, 2024
-
store page statusCode if not 200 (#477)
don't treat non-200 pages as errors, still extract text, take screenshots, and run behaviors only consider actual page load errors, eg. chrome-error:// page url, as errors
Configuration menu - View commit details
-
Copy full SHA for c348de2 - Browse repository at this point
Copy the full SHA c348de2View commit details -
Ensure links added via behaviors also get processed (#478)
Requires webrecorder/browsertrix-behaviors#69 / browsertrix-behaviors 0.5.3, which will add support for behaviors to add links. Simplify adding links by simply adding the links directly, instead of batching to 500 links. Errors are already being logged in queueing a new URL fails.
Configuration menu - View commit details
-
Copy full SHA for 184f4a2 - Browse repository at this point
Copy the full SHA 184f4a2View commit details -
Configuration menu - View commit details
-
Copy full SHA for dd78457 - Browse repository at this point
Copy the full SHA dd78457View commit details
Commits on Mar 5, 2024
-
Fail on status code option + requeue fix (#480)
Add fail on status code option, --failOnInvalidStatus to treat non-200 responses as failures. Can be useful especially when combined with --failOnFailedSeed or --failOnFailedLimit requeue: ensure requeued urls are requeued with same depth/priority, not 0
Configuration menu - View commit details
-
Copy full SHA for 4520e9e - Browse repository at this point
Copy the full SHA 4520e9eView commit details -
warc: add Network.resourceType (https://chromedevtools.github.io/devt… (
#481) Add resourcesType value from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention fixes #451
Configuration menu - View commit details
-
Copy full SHA for 5a47cc4 - Browse repository at this point
Copy the full SHA 5a47cc4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 63cedbc - Browse repository at this point
Copy the full SHA 63cedbcView commit details -
resourceType lowercase fix: (#483)
follow up to #481, check reqresp.resourceType with lowercase value just set message based on resourceType value
Configuration menu - View commit details
-
Copy full SHA for 65133c9 - Browse repository at this point
Copy the full SHA 65133c9View commit details