-
-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use js-wacz to create WACZ files #484
Comments
Related js-wacz PR: harvard-lil/js-wacz#89 |
This was referenced Mar 22, 2024
Closed
ikreymer
added a commit
that referenced
this issue
Mar 26, 2024
Previously, there was the main WARCWriter as well as utility WARCResourceWriter that was used for screenshots, text, pageinfo and only generated resource records. This separate WARC writing path did not generate CDX, but used appendFile() to append new WARC records to an existing WARC. This change removes WARCResourceWriter and ensures all WARC writing is done through a single WARCWriter, which uses a writable stream to append records, and can also generate CDX on the fly. This change is a pre-requisite to the js-wacz conversion (#484) since all WARCs need to have generated CDX. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Closed in favor of #674 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Improvements for 1.0.0 branch of crawler:
/tmp-cdx
rather than reindexing from WARCS--generateCDX
fromtemp-cdx/
rather than having to reindex from the WARCs/tmp-cdx
after no longer neededThe text was updated successfully, but these errors were encountered: