-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#799] Add wacz archives to collection #800
Conversation
Add wacz archives to collections using wb-manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kuechensofa! This is a great first step in getting WACZ support into pywb and much appreciated! :)
I've left a few small comments. In addition, wb-manager add /path/to/my.wacz
doesn't seem to be working for me testing locally with Python 3.10.9. I think there might be an encoding issue somewhere. Here's what I'm seeing:
❯ wb-manager add wacz-test /path/to/my.wacz
Error: unknown 1-field cdx format: [b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec\xbdis\x1a\xc9\xb6.\xfc\xfd\xfe\x8a\x13;bw\xdc\x13{c\xe5<\xec\x08\xc7\xb9\x12\x021K\x08\x81\x80/\x8e\x1c\x011J\xa0\xf1\xc6\xfd\xef\xef*\r6\x94d\x0b!\xd4\xb6\xe3\xb5\xbb\xdb\rEQY\xe4\x1a\x9eg\xadZ\xb9\xd2M\xc7\xff\xeeM\xa7\xbdQ0\xb3\xc1\xfc\xdfq:Y\xcc\xff{\xc7\xcd\xe7\xff\x13\xcdx0\xba\xfd|1\xb5\xd3\xc5\xf4?\x0c\xa1\x7f\xc3\x7f\x83\x7fsx!\x11\xfa/\x82\x08E\x18#\xac\x89\xc2\xe8\xbf\xfe\xef?./F\xff\xf8\xcf\x7f\xfd\xa3\xbfX\xcc\xe6\xff\xd9\xd9\xb9\xbf\xd2\xa7o\x97\xfe\xe4\xa6\xe3\xe5\xeb\x1e\xbf|\xdd\x7f\xfc\xfb\xbf\xfe1\x1e\x8cCr\xa9E\xb8Y$_I\x8e\xcd\x17fq9O\x8e\x92\x87\x93\xfc\xa0\x17\xe6\x8b\xe4\xc0\xbco\xf0\x7fH\xb3\xd3\xc8\x1d\xe6\x0b\xf5\xd3n\xfb\xb4I\x0e\x8f\x9a\x85\xca!\x15\x07\x8dJ\x9d\x9e\xb2R3\xf9\xca(Lz\x8b~\xf2\x15\xcc9I\x8eLc\x9c\x87\xfb\x8bP*\xf0\xfdu\xe3\x00N3\x0f7p\x11\\f\xf9w"\xcd\xa9b\x98fxD\x91[K\x94G\xf2\xd3\xb5\xb9p\x9fzw\xc9\x97\xe1\x0b\xd3\x0b\xbf\xbf|k\x84\x8b\xffx\xc1=\xd3\x82[d<\xf1\x8a\x19M\x15']
I tried with a few WACZ files and got the same thing every time.
It would be nice to have some tests for the new functionality as well. If you feel comfortable giving that a shot, please go ahead, or I can add some when we have this PR about ready to merge.
collection_index_name = os.path.basename(collection_index_path) | ||
collection_index_temp_path = os.path.join(tempdir, collection_index_name) | ||
|
||
if os.path.exists(collection_index_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, this might be a bit more defensive than needed since wb-manager init
creates the index directory. What's the scenario where the index directory doesn't exist for a collection? If there is one and the check is needed, should we create it here if it's missing?
Thanks for reviewing my pull request! :) Are there any public wacz archives available that are suitable for testing? I couldn't reproduce the bug you mentioned with wacz archives that I created with ArchiveWeb.page yet, but I would like to add some test cases for the feature and fix the error with the cdx rewriting. |
My pleasure! Thanks for the PR! :) I'm not aware of public WACZ collections but there are some samples in the py-wacz repository's test fixtures: https://github.com/webrecorder/py-wacz/tree/main/tests/fixtures. I did a little digging and it seems that the issue is happening only when WACZ files have gzipped CDX indexes - that would make sense with why WACZ files created by archiveweb.page are being added fine and those created by browsertrix-crawler are not. You can create a WACZ that will fail using browsertrix-crawler in Docker like so:
The WACZ file will be output to Detecting if the index is gzipped and reading it properly if so might add some additional complexity. If you want to take a stab at it, you're welcome to, but we could also simply catch the exception and print a warning for now and then we could add the ability to read gzipped CDX in a separate PR. Up to you! |
Log an error message when adding an invalid wacz archive and abort the adding but don't raise an IOError
* uncompress gziped indexes before merging them * add tests for manager
* remove check for full path in log message assertion to make the test work in different testing environments
Thanks for the hints. I now open gziped index files using gzip and tested the behaviour with an wacz archive I created with the browsertrix-crawler. I also added some tests for the feature that I added. I wasn't really sure where a good place could be store some test wacz archives and choose the sample_archive directory for now. There I copied a valid and an invalid wacz archive from py-wacz repo. Do you think it's sufficient to detect whether it's a gziped CDX only by the file's extension? |
@kuechensofa This is looking great, thank you! Including the test WACZ files in the sample_archive directory makes sense to me, and the tests are much appreciated! I had a chat with Ilya and would like to request one more small change. When we eventually release pywb 2.8, we are planning to add WACZ support that doesn't require unpacking the WACZ file. So eventually we'll want to be able to In order to plan for that future, we'd like to add an So expected behavior would be:
How does that sound to you? |
@tw4l Thanks for the feedback! :) I can implement the changes you suggested and they sound good to me. |
Adds --uncompress-wacz flag to prepare future implementation of adding wacz files to a collection without unpacking them.
@kuechensofa apologies for the delay, but this is looking great! After discussion this seemed like a major feature to us, so we will be including it in the 2.8 release alongside support for uncompressed WACZs. Thanks again for the PR and for all your efforts! |
Description
Expand
wb-manager
features to allow adding wacz archives to a collection.wb-manager
extracts the warc files to a temporary directory, copies them to the collection archive directory, renames them to the wacz filename with indices, rewrites the wacz index and adds it to the collection index.Motivation and Context
See issue #799
Screenshots (if appropriate):
Types of changes
Checklist: