Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use js-wacz to create WACZ files #484

Closed
tw4l opened this issue Mar 5, 2024 · 2 comments
Closed

Use js-wacz to create WACZ files #484

tw4l opened this issue Mar 5, 2024 · 2 comments
Assignees

Comments

@tw4l
Copy link
Contributor

tw4l commented Mar 5, 2024

Improvements for 1.0.0 branch of crawler:

  • Switch from using py-wacz to js-wacz for WACZ generation
  • Pass in indexes from /tmp-cdx rather than reindexing from WARCS
  • Support creating indices with --generateCDX from temp-cdx/ rather than having to reindex from the WARCs
  • Delete /tmp-cdx after no longer needed
@tw4l
Copy link
Contributor Author

tw4l commented Mar 7, 2024

Related js-wacz PR: harvard-lil/js-wacz#89

ikreymer added a commit that referenced this issue Mar 26, 2024
Previously, there was the main WARCWriter as well as utility
WARCResourceWriter that was used for screenshots, text, pageinfo and
only generated resource records. This separate WARC writing path did not
generate CDX, but used appendFile() to append new WARC records to an
existing WARC.

This change removes WARCResourceWriter and ensures all WARC writing is done through a single WARCWriter, which uses a writable stream to append records, and can also generate CDX on the fly. This change is a
pre-requisite to the js-wacz conversion (#484) since all WARCs need to
have generated CDX.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
@tw4l
Copy link
Contributor Author

tw4l commented Aug 26, 2024

Closed in favor of #674

@tw4l tw4l closed this as completed Aug 26, 2024
@ikreymer ikreymer closed this as not planned Won't fix, can't repro, duplicate, stale Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done!
Development

Successfully merging a pull request may close this issue.

2 participants