Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev 1.0.0 -> Main #482

Merged
merged 37 commits into from Mar 5, 2024
Merged

Dev 1.0.0 -> Main #482

merged 37 commits into from Mar 5, 2024

Commits on Nov 8, 2023

  1. Use new browser-based archiving mechanism instead of pywb proxy (#424)

    Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files
    via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing
    with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 
    
    Changes include:
    - Recorder class for capture CDP network traffic for each page.
    - Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..)
    - WARC writing support via TS-based warcio.js library.
    - Generates single WARC file per worker (still need to add size rollover).
    - Request interception via Fetch.requestPaused
    - Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest()
    - Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, 
    async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch
    via fetch()
    - Direct async fetch() capture of non-HTML URLs
    - Awaiting for all requests to finish before moving on to next page, upto page timeout.
    - Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use).
    - removed pywb, using cdxj-indexer for --generateCDX option.
    ikreymer committed Nov 8, 2023
    Configuration menu
    Copy the full SHA
    877d9f5 View commit details
    Browse the repository at this point in the history

Commits on Nov 9, 2023

  1. TypeScript Conversion (#425)

    Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426
    
    ---------
    Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
    Co-authored-by: emma <hi@emma.cafe>
    ikreymer committed Nov 9, 2023
    Configuration menu
    Copy the full SHA
    af1e086 View commit details
    Browse the repository at this point in the history

Commits on Nov 10, 2023

  1. Add Prettier to the repo, and format all the files! (#428)

    This adds prettier to the repo, and sets up the pre-commit hook to
    auto-format as well as lint.
    Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
    emma-sg committed Nov 10, 2023
    Configuration menu
    Copy the full SHA
    2a49406 View commit details
    Browse the repository at this point in the history
  2. follow-up to #428: update ignore files (#431)

    - actually update lint/prettier/git ignore files with scatch, crawls, test-crawls, behaviors, as needed
    ikreymer committed Nov 10, 2023
    Configuration menu
    Copy the full SHA
    783d006 View commit details
    Browse the repository at this point in the history
  3. Raise size limit for large HTML pages (#430)

    Previously, responses >2MB are streamed to disk and an empty response returned to browser,
    to avoid holding large response in memory. 
    This limit was too small, as some HTML pages may be >2MB, resulting in no content loaded.
    
    This PR sets different limits for:
    - HTML as well as other JS necessary for page to load to 25MB
    - All other content limit is set to 5MB
    
    Also includes some more type fixing
    ikreymer committed Nov 10, 2023
    Configuration menu
    Copy the full SHA
    ab0f66a View commit details
    Browse the repository at this point in the history

Commits on Nov 13, 2023

  1. logging: don't log filtered out direct fetch attempt as error (#432)

    When calling directFetchCapture, and aborting the response via an
    exception, throw `new Error("response-filtered-out");`
    so that it can be ignored. This exception is only used for direct
    capture, and should not be logged as an error - rethrow and
    handle in calling function to indicate direct fetch is skipped
    ikreymer committed Nov 13, 2023
    Configuration menu
    Copy the full SHA
    3972942 View commit details
    Browse the repository at this point in the history
  2. Fix potential for pending list never being processed (#433)

    Due to an optimization, numPending() call assumed that queueSize() would
    be called to update cached queue size. However, in the current worker
    code, this is not the case. Remove cacheing the queue size and just check
    queue size in numPending(), to ensure pending list is always processed.
    ikreymer committed Nov 13, 2023
    Configuration menu
    Copy the full SHA
    0d51e03 View commit details
    Browse the repository at this point in the history
  3. more specific types additions (#434)

    - add QueueEntry for type of json object stored in Redis
    - and PageCallbacks for callback type
    - use Crawler type
    ikreymer committed Nov 13, 2023
    Configuration menu
    Copy the full SHA
    456155e View commit details
    Browse the repository at this point in the history

Commits on Nov 15, 2023

  1. Add types + validation for log context options (#435)

    - add LogContext type and enumerate all log contexts
    - also add LOG_CONTEXT_TYPES array to validate --context arg
    - rename errJSON -> formatErr, convert unknown (likely Error) to dict
    - make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers
    ikreymer committed Nov 15, 2023
    Configuration menu
    Copy the full SHA
    19dac94 View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2023

  1. Merge 0.12.2 into dev-1.0.0

    ikreymer committed Nov 16, 2023
    Configuration menu
    Copy the full SHA
    e9ed7a4 View commit details
    Browse the repository at this point in the history

Commits on Dec 8, 2023

  1. WARC filename prefix + rollover size + improved 'livestream' / trunca…

    …ted response support. (#440)
    
    Support for rollover size and custom WARC prefix templates:
    - reenable --rolloverSize (default to 1GB) for when a new WARC is
    created
    - support custom WARC prefix via --warcPrefix, prepended to new WARC
    filename, test via basic_crawl.test.js
    - filename template for new files is:
    `${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}`
    with `$ts` replaced at new file creation time with current timestamp
    
    Improved support for long (non-terminating) responses, such as from
    live-streaming:
    - add a size to CDP takeStream to ensure data is streamed in fixed
    chunks, defaulting to 64k
    - change shutdown order: first close browser, then finish writing all
    WARCs to ensure any truncated responses can be captured.
    - ensure WARC is not rewritten after it is done, skip writing records if
    stream already flushed
      - add timeout to final fetch tasks to avoid never hanging on finish
    - fix adding `WARC-Truncated` header, need to set after stream is
    finished to determine if its been truncated
    - move temp download `tmp-dl` dir to main temp folder, outside of
    collection (no need to be there).
    ikreymer committed Dec 8, 2023
    Configuration menu
    Copy the full SHA
    3323262 View commit details
    Browse the repository at this point in the history

Commits on Dec 13, 2023

  1. detect invalid custom behaviors on load: (#450)

    - on first page, attempt to evaluate the behavior class to ensure it
    compiles
    - if fails to compile, log exception with fatal and exit
    - update behavior gathering code to keep track of behavior filename
    - tests: add test for invalid behavior which causes crawl to exit with
    fatal exit code (17)
    ikreymer committed Dec 13, 2023
    Configuration menu
    Copy the full SHA
    703835a View commit details
    Browse the repository at this point in the history

Commits on Jan 3, 2024

  1. Configuration menu
    Copy the full SHA
    63c884f View commit details
    Browse the repository at this point in the history
  2. bump to 1.0.0-beta.1

    update yarn.lock
    ikreymer committed Jan 3, 2024
    Configuration menu
    Copy the full SHA
    db2dbe0 View commit details
    Browse the repository at this point in the history

Commits on Jan 15, 2024

  1. Generate urn:pageinfo:<page url> records (#458)

    Generate records for each page, containing a list of resources and their
    status codes, to aid in future diffing/comparison.
    
    Generates a `urn:pageinfo:<page url>` record for each page
    - Adds POST / non-GET request canonicalization from warcio to handle
    non-GET requests
    - Adds `writeSingleRecord` to WARCWriter
    
    Fixes #457
    ikreymer committed Jan 15, 2024
    Configuration menu
    Copy the full SHA
    2fc0f67 View commit details
    Browse the repository at this point in the history

Commits on Jan 17, 2024

  1. skipping resources: ensure HEAD, OPTIONS, 206, and 304 response/reque…

    …st pairs are not written to WARC (#460)
    
    Allows for skipping network traffic that doesn't need to be stored, as
    it is not necessary/will result in incorrect replay (eg. 304 instead of
    a 200).
    ikreymer committed Jan 17, 2024
    Configuration menu
    Copy the full SHA
    18ffb3d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f4ecaa8 View commit details
    Browse the repository at this point in the history
  3. add fix from 0.12.4 - puppeteer-core to 20.8.2

    bump to 1.0.0-beta.2
    ikreymer committed Jan 17, 2024
    Configuration menu
    Copy the full SHA
    298deac View commit details
    Browse the repository at this point in the history

Commits on Feb 10, 2024

  1. Add arg to write pages to Redis (#464)

    Fixes #462 
    
    Add --writePagesToRedis arg, for use conjunction with QA features in Browsertrix Cloud, to add
    pages to the database for each crawl.
    Ensure timestamp (as ISO date) is added to pages when they are serialized (both to pages.jsonl and redis)
    Also include timestamp (as ISO date) in `pageinfo:` records
    
    ---------
    Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
    tw4l committed Feb 10, 2024
    Configuration menu
    Copy the full SHA
    bdffa79 View commit details
    Browse the repository at this point in the history

Commits on Feb 16, 2024

  1. Page Resources: Include Cached Resources (#465)

    Ensure cached resources (that are not written to WARC) are still
    included in the `url:pageinfo:...` records. This will make it easier to
    track which resources are actually *loaded* from a given page.
    
    Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about
    include cached resources
    ikreymer committed Feb 16, 2024
    Configuration menu
    Copy the full SHA
    96f3c40 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    46eb02d View commit details
    Browse the repository at this point in the history

Commits on Feb 18, 2024

  1. Update Browser Image (#466)

    - Update to Brave browser (1.62.165)
    - Update page resource test to reflect latest Brave behavior
    ikreymer committed Feb 18, 2024
    Configuration menu
    Copy the full SHA
    e8f2073 View commit details
    Browse the repository at this point in the history
  2. Misc Page Resource/Recorder Fixes (#467)

    - recorder: don't attempt to record response with mime type
    `text/event-stream` (will not terminate).
    - resources: don't track non http/https resources.
    - resources: store page timestamp on first resources URL match, in case
    multiple responses for same page encountered.
    ikreymer committed Feb 18, 2024
    Configuration menu
    Copy the full SHA
    8d2d79a View commit details
    Browse the repository at this point in the history

Commits on Feb 20, 2024

  1. Include resource type + mime type in page resources list (#468)

    The `:pageinfo:<url>` record now includes the mime type + resource type
    (from Chrome) along with status code for each resource, for better
    filtering / comparison.
    ikreymer committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    a512e92 View commit details
    Browse the repository at this point in the history

Commits on Feb 21, 2024

  1. Set warc prefix via WARC_PREFIX env var (#470)

    In addition to `--warcPrefix` flag, also support WARC_PREFIX env var,
    which takes precedence.
    Bump to 1.0.0-beta.4
    ikreymer committed Feb 21, 2024
    Configuration menu
    Copy the full SHA
    a5e9395 View commit details
    Browse the repository at this point in the history

Commits on Feb 22, 2024

  1. pageinfo: add console errors to pageinfo record, tracking in 'counts'…

    … field (#471)
    
    Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.
    ikreymer committed Feb 22, 2024
    Configuration menu
    Copy the full SHA
    51660cd View commit details
    Browse the repository at this point in the history

Commits on Feb 23, 2024

  1. Configuration menu
    Copy the full SHA
    d36564e View commit details
    Browse the repository at this point in the history

Commits on Feb 24, 2024

  1. warcwriter: better filehandle init on first use (#474)

    Ensure warcwriter file is inited on first use, instead of throwing error
    - was initing from writeRecordPair() but not writeSingleRecord()
    ikreymer committed Feb 24, 2024
    Configuration menu
    Copy the full SHA
    cdd047d View commit details
    Browse the repository at this point in the history

Commits on Feb 28, 2024

  1. Include WARC prefix for screenshots and text WARCs (#473)

    Ensure the env var / cli <warc prefix>-<crawlId> is also applied to
    `screenshots.warc.gz` and `text.warc.gz`
    ikreymer committed Feb 28, 2024
    Configuration menu
    Copy the full SHA
    dd48251 View commit details
    Browse the repository at this point in the history
  2. new seed on redirect + error page check: (#476)

    - if a seed page redirects (page response != seed url), then add the
    final url as a new seed with same scope
    - add newScopeSeed() to ScopedSeed to duplicate seed with different URL,
    store original includes / excludes
    - also add check for 'chrome-error://' URLs for the page, and ensure
    page is marked as failed if page.url() starts with chrome-error://
    - fixes #475
    ikreymer committed Feb 28, 2024
    Configuration menu
    Copy the full SHA
    fba4730 View commit details
    Browse the repository at this point in the history

Commits on Feb 29, 2024

  1. store page statusCode if not 200 (#477)

    don't treat non-200 pages as errors, still extract text, take
    screenshots, and run behaviors
    only consider actual page load errors, eg. chrome-error:// page url, as
    errors
    ikreymer committed Feb 29, 2024
    Configuration menu
    Copy the full SHA
    c348de2 View commit details
    Browse the repository at this point in the history
  2. Ensure links added via behaviors also get processed (#478)

    Requires webrecorder/browsertrix-behaviors#69 / browsertrix-behaviors
    0.5.3, which will add support for behaviors to add links.
    
    Simplify adding links by simply adding the links directly, instead of
    batching to 500 links. Errors are already being logged in queueing a new
    URL fails.
    ikreymer committed Feb 29, 2024
    Configuration menu
    Copy the full SHA
    184f4a2 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    dd78457 View commit details
    Browse the repository at this point in the history

Commits on Mar 5, 2024

  1. Fail on status code option + requeue fix (#480)

    Add fail on status code option, --failOnInvalidStatus to treat non-200
    responses as failures. Can be useful especially when combined with
    --failOnFailedSeed or --failOnFailedLimit
    
    requeue: ensure requeued urls are requeued with same depth/priority, not
    0
    ikreymer committed Mar 5, 2024
    Configuration menu
    Copy the full SHA
    4520e9e View commit details
    Browse the repository at this point in the history
  2. warc: add Network.resourceType (https://chromedevtools.github.io/devt… (

    #481)
    
    Add resourcesType value from
    https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType
    as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention
    fixes #451
    ikreymer committed Mar 5, 2024
    Configuration menu
    Copy the full SHA
    5a47cc4 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    63cedbc View commit details
    Browse the repository at this point in the history
  4. resourceType lowercase fix: (#483)

    follow up to #481, check reqresp.resourceType with lowercase value just
    set message based on resourceType value
    ikreymer committed Mar 5, 2024
    Configuration menu
    Copy the full SHA
    65133c9 View commit details
    Browse the repository at this point in the history