[#799] Add wacz archives to collection #800

kuechensofa · 2023-01-19T21:16:45Z

Description

Expand wb-manager features to allow adding wacz archives to a collection. wb-manager extracts the warc files to a temporary directory, copies them to the collection archive directory, renames them to the wacz filename with indices, rewrites the wacz index and adds it to the collection index.

Motivation and Context

See issue #799

Screenshots (if appropriate):

Types of changes

Replay fix (fixes a replay specific issue)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added or updated tests to cover my changes.
All new and existing tests passed.

Add wacz archives to collections using wb-manager

tw4l

Thank you @kuechensofa! This is a great first step in getting WACZ support into pywb and much appreciated! :)

I've left a few small comments. In addition, wb-manager add /path/to/my.wacz doesn't seem to be working for me testing locally with Python 3.10.9. I think there might be an encoding issue somewhere. Here's what I'm seeing:

❯ wb-manager add wacz-test /path/to/my.wacz
Error: unknown 1-field cdx format: [b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec\xbdis\x1a\xc9\xb6.\xfc\xfd\xfe\x8a\x13;bw\xdc\x13{c\xe5<\xec\x08\xc7\xb9\x12\x021K\x08\x81\x80/\x8e\x1c\x011J\xa0\xf1\xc6\xfd\xef\xef*\r6\x94d\x0b!\xd4\xb6\xe3\xb5\xbb\xdb\rEQY\xe4\x1a\x9eg\xadZ\xb9\xd2M\xc7\xff\xeeM\xa7\xbdQ0\xb3\xc1\xfc\xdfq:Y\xcc\xff{\xc7\xcd\xe7\xff\x13\xcdx0\xba\xfd|1\xb5\xd3\xc5\xf4?\x0c\xa1\x7f\xc3\x7f\x83\x7fsx!\x11\xfa/\x82\x08E\x18#\xac\x89\xc2\xe8\xbf\xfe\xef?./F\xff\xf8\xcf\x7f\xfd\xa3\xbfX\xcc\xe6\xff\xd9\xd9\xb9\xbf\xd2\xa7o\x97\xfe\xe4\xa6\xe3\xe5\xeb\x1e\xbf|\xdd\x7f\xfc\xfb\xbf\xfe1\x1e\x8cCr\xa9E\xb8Y$_I\x8e\xcd\x17fq9O\x8e\x92\x87\x93\xfc\xa0\x17\xe6\x8b\xe4\xc0\xbco\xf0\x7fH\xb3\xd3\xc8\x1d\xe6\x0b\xf5\xd3n\xfb\xb4I\x0e\x8f\x9a\x85\xca!\x15\x07\x8dJ\x9d\x9e\xb2R3\xf9\xca(Lz\x8b~\xf2\x15\xcc9I\x8eLc\x9c\x87\xfb\x8bP*\xf0\xfdu\xe3\x00N3\x0f7p\x11\\f\xf9w"\xcd\xa9b\x98fxD\x91[K\x94G\xf2\xd3\xb5\xb9p\x9fzw\xc9\x97\xe1\x0b\xd3\x0b\xbf\xbf|k\x84\x8b\xffx\xc1=\xd3\x82[d<\xf1\x8a\x19M\x15']

I tried with a few WACZ files and got the same thing every time.

It would be nice to have some tests for the new functionality as well. If you feel comfortable giving that a shot, please go ahead, or I can add some when we have this PR about ready to merge.

pywb/manager/manager.py

tw4l · 2023-01-24T15:36:33Z

pywb/manager/manager.py

+        collection_index_name = os.path.basename(collection_index_path)
+        collection_index_temp_path = os.path.join(tempdir, collection_index_name)
+
+        if os.path.exists(collection_index_path):


Unless I'm missing something, this might be a bit more defensive than needed since wb-manager init creates the index directory. What's the scenario where the index directory doesn't exist for a collection? If there is one and the check is needed, should we create it here if it's missing?

kuechensofa · 2023-01-25T17:09:22Z

Thanks for reviewing my pull request! :) Are there any public wacz archives available that are suitable for testing? I couldn't reproduce the bug you mentioned with wacz archives that I created with ArchiveWeb.page yet, but I would like to add some test cases for the feature and fix the error with the cdx rewriting.

tw4l · 2023-01-25T19:19:24Z

My pleasure! Thanks for the PR! :) I'm not aware of public WACZ collections but there are some samples in the py-wacz repository's test fixtures: https://github.com/webrecorder/py-wacz/tree/main/tests/fixtures.

I did a little digging and it seems that the issue is happening only when WACZ files have gzipped CDX indexes - that would make sense with why WACZ files created by archiveweb.page are being added fine and those created by browsertrix-crawler are not. You can create a WACZ that will fail using browsertrix-crawler in Docker like so:

$ docker pull webrecorder/browsertrix-crawler
$ docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url [URL] --generateCDX --generateWACZ --collection test

The WACZ file will be output to crawls/collection/test/test.wacz from the directory you ran the Docker command from.

Detecting if the index is gzipped and reading it properly if so might add some additional complexity. If you want to take a stab at it, you're welcome to, but we could also simply catch the exception and print a warning for now and then we could add the ability to read gzipped CDX in a separate PR. Up to you!

Log an error message when adding an invalid wacz archive and abort the adding but don't raise an IOError

* uncompress gziped indexes before merging them * add tests for manager

* remove check for full path in log message assertion to make the test work in different testing environments

kuechensofa · 2023-01-26T11:27:19Z

Thanks for the hints. I now open gziped index files using gzip and tested the behaviour with an wacz archive I created with the browsertrix-crawler. I also added some tests for the feature that I added. I wasn't really sure where a good place could be store some test wacz archives and choose the sample_archive directory for now. There I copied a valid and an invalid wacz archive from py-wacz repo.

Do you think it's sufficient to detect whether it's a gziped CDX only by the file's extension?

tw4l · 2023-01-31T19:53:18Z

@kuechensofa This is looking great, thank you! Including the test WACZ files in the sample_archive directory makes sense to me, and the tests are much appreciated!

I had a chat with Ilya and would like to request one more small change. When we eventually release pywb 2.8, we are planning to add WACZ support that doesn't require unpacking the WACZ file. So eventually we'll want to be able to wb-manager add my.wacz and have it add the WACZ directly to a collection where it'll be replayable without extracting WARCS. We feel your PR is a great starting point for adding WACZ support in 2.7.x on the way, and that even after adding support for zipped WACZs, it'd be valuable to keep the option to extract the WARCS and indices.

In order to plan for that future, we'd like to add an --uncompress-wacz CLI flag to wb-manager add that would trigger the behavior added in this PR.

So expected behavior would be:

$ wb-manager add example.wacz: raises NotImplementedError for now, eventually will add WACZ directly to collection without unpacking
$ wb-manager add --uncompress-wacz example.wacz: Unpacks WARCS and indices and adds them to collection

How does that sound to you?

kuechensofa · 2023-02-02T14:00:25Z

@tw4l Thanks for the feedback! :)

I can implement the changes you suggested and they sound good to me.

Adds --uncompress-wacz flag to prepare future implementation of adding wacz files to a collection without unpacking them.

tw4l · 2023-02-15T18:06:16Z

@kuechensofa apologies for the delay, but this is looking great! After discussion this seemed like a major feature to us, so we will be including it in the 2.8 release alongside support for uncompressed WACZs. Thanks again for the PR and for all your efforts!

[webrecorder#799] Add wacz archives to collection

44ed693

Add wacz archives to collections using wb-manager

tw4l self-requested a review January 19, 2023 21:26

tw4l requested changes Jan 24, 2023

View reviewed changes

kuechensofa added 2 commits January 25, 2023 18:03

Log ignored archive files when adding archives to collection

c3884cf

Do not override existing warc files when adding wacz archive

df57635

kuechensofa added 4 commits January 25, 2023 23:16

Show invalid archive message only when there were invalid archives

bc15307

Log error when adding invalid wacz

a9e2f4e

Log an error message when adding an invalid wacz archive and abort the adding but don't raise an IOError

Fix bug merging gziped index files

9ab110f

* uncompress gziped indexes before merging them * add tests for manager

Update manager test to work in different environments

cfe0de5

* remove check for full path in log message assertion to make the test work in different testing environments

Add --uncompress-wacz flag

8e71821

Adds --uncompress-wacz flag to prepare future implementation of adding wacz files to a collection without unpacking them.

tw4l approved these changes Feb 15, 2023

View reviewed changes

tw4l merged commit 454486b into webrecorder:main Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#799] Add wacz archives to collection #800

[#799] Add wacz archives to collection #800

kuechensofa commented Jan 19, 2023 •

edited

Loading

tw4l left a comment

tw4l Jan 24, 2023

kuechensofa commented Jan 25, 2023

tw4l commented Jan 25, 2023 •

edited

Loading

kuechensofa commented Jan 26, 2023

tw4l commented Jan 31, 2023

kuechensofa commented Feb 2, 2023

tw4l commented Feb 15, 2023 •

edited

Loading

[#799] Add wacz archives to collection #800

[#799] Add wacz archives to collection #800

Conversation

kuechensofa commented Jan 19, 2023 • edited Loading

Description

Motivation and Context

Screenshots (if appropriate):

Types of changes

Checklist:

tw4l left a comment

Choose a reason for hiding this comment

tw4l Jan 24, 2023

Choose a reason for hiding this comment

kuechensofa commented Jan 25, 2023

tw4l commented Jan 25, 2023 • edited Loading

kuechensofa commented Jan 26, 2023

tw4l commented Jan 31, 2023

kuechensofa commented Feb 2, 2023

tw4l commented Feb 15, 2023 • edited Loading

kuechensofa commented Jan 19, 2023 •

edited

Loading

tw4l commented Jan 25, 2023 •

edited

Loading

tw4l commented Feb 15, 2023 •

edited

Loading