Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#799] Add wacz archives to collection #800

Merged
merged 8 commits into from
Feb 15, 2023
Merged

Conversation

kuechensofa
Copy link
Contributor

@kuechensofa kuechensofa commented Jan 19, 2023

Description

Expand wb-manager features to allow adding wacz archives to a collection. wb-manager extracts the warc files to a temporary directory, copies them to the collection archive directory, renames them to the wacz filename with indices, rewrites the wacz index and adds it to the collection index.

Motivation and Context

See issue #799

Screenshots (if appropriate):

Types of changes

  • Replay fix (fixes a replay specific issue)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added or updated tests to cover my changes.
  • All new and existing tests passed.

Add wacz archives to collections using wb-manager
@tw4l tw4l self-requested a review January 19, 2023 21:26
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kuechensofa! This is a great first step in getting WACZ support into pywb and much appreciated! :)

I've left a few small comments. In addition, wb-manager add /path/to/my.wacz doesn't seem to be working for me testing locally with Python 3.10.9. I think there might be an encoding issue somewhere. Here's what I'm seeing:

❯ wb-manager add wacz-test /path/to/my.wacz
Error: unknown 1-field cdx format: [b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec\xbdis\x1a\xc9\xb6.\xfc\xfd\xfe\x8a\x13;bw\xdc\x13{c\xe5<\xec\x08\xc7\xb9\x12\x021K\x08\x81\x80/\x8e\x1c\x011J\xa0\xf1\xc6\xfd\xef\xef*\r6\x94d\x0b!\xd4\xb6\xe3\xb5\xbb\xdb\rEQY\xe4\x1a\x9eg\xadZ\xb9\xd2M\xc7\xff\xeeM\xa7\xbdQ0\xb3\xc1\xfc\xdfq:Y\xcc\xff{\xc7\xcd\xe7\xff\x13\xcdx0\xba\xfd|1\xb5\xd3\xc5\xf4?\x0c\xa1\x7f\xc3\x7f\x83\x7fsx!\x11\xfa/\x82\x08E\x18#\xac\x89\xc2\xe8\xbf\xfe\xef?./F\xff\xf8\xcf\x7f\xfd\xa3\xbfX\xcc\xe6\xff\xd9\xd9\xb9\xbf\xd2\xa7o\x97\xfe\xe4\xa6\xe3\xe5\xeb\x1e\xbf|\xdd\x7f\xfc\xfb\xbf\xfe1\x1e\x8cCr\xa9E\xb8Y$_I\x8e\xcd\x17fq9O\x8e\x92\x87\x93\xfc\xa0\x17\xe6\x8b\xe4\xc0\xbco\xf0\x7fH\xb3\xd3\xc8\x1d\xe6\x0b\xf5\xd3n\xfb\xb4I\x0e\x8f\x9a\x85\xca!\x15\x07\x8dJ\x9d\x9e\xb2R3\xf9\xca(Lz\x8b~\xf2\x15\xcc9I\x8eLc\x9c\x87\xfb\x8bP*\xf0\xfdu\xe3\x00N3\x0f7p\x11\\f\xf9w"\xcd\xa9b\x98fxD\x91[K\x94G\xf2\xd3\xb5\xb9p\x9fzw\xc9\x97\xe1\x0b\xd3\x0b\xbf\xbf|k\x84\x8b\xffx\xc1=\xd3\x82[d<\xf1\x8a\x19M\x15']

I tried with a few WACZ files and got the same thing every time.

It would be nice to have some tests for the new functionality as well. If you feel comfortable giving that a shot, please go ahead, or I can add some when we have this PR about ready to merge.

pywb/manager/manager.py Show resolved Hide resolved
pywb/manager/manager.py Show resolved Hide resolved
collection_index_name = os.path.basename(collection_index_path)
collection_index_temp_path = os.path.join(tempdir, collection_index_name)

if os.path.exists(collection_index_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, this might be a bit more defensive than needed since wb-manager init creates the index directory. What's the scenario where the index directory doesn't exist for a collection? If there is one and the check is needed, should we create it here if it's missing?

@kuechensofa
Copy link
Contributor Author

Thanks for reviewing my pull request! :) Are there any public wacz archives available that are suitable for testing? I couldn't reproduce the bug you mentioned with wacz archives that I created with ArchiveWeb.page yet, but I would like to add some test cases for the feature and fix the error with the cdx rewriting.

@tw4l
Copy link
Member

tw4l commented Jan 25, 2023

My pleasure! Thanks for the PR! :) I'm not aware of public WACZ collections but there are some samples in the py-wacz repository's test fixtures: https://github.com/webrecorder/py-wacz/tree/main/tests/fixtures.

I did a little digging and it seems that the issue is happening only when WACZ files have gzipped CDX indexes - that would make sense with why WACZ files created by archiveweb.page are being added fine and those created by browsertrix-crawler are not. You can create a WACZ that will fail using browsertrix-crawler in Docker like so:

$ docker pull webrecorder/browsertrix-crawler
$ docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url [URL] --generateCDX --generateWACZ --collection test

The WACZ file will be output to crawls/collection/test/test.wacz from the directory you ran the Docker command from.

Detecting if the index is gzipped and reading it properly if so might add some additional complexity. If you want to take a stab at it, you're welcome to, but we could also simply catch the exception and print a warning for now and then we could add the ability to read gzipped CDX in a separate PR. Up to you!

Log an error message when adding an invalid wacz archive and abort the adding but don't raise an IOError
* uncompress gziped indexes before merging them
* add tests for manager
* remove check for full path in log message assertion to make the test work in different testing environments
@kuechensofa
Copy link
Contributor Author

Thanks for the hints. I now open gziped index files using gzip and tested the behaviour with an wacz archive I created with the browsertrix-crawler. I also added some tests for the feature that I added. I wasn't really sure where a good place could be store some test wacz archives and choose the sample_archive directory for now. There I copied a valid and an invalid wacz archive from py-wacz repo.

Do you think it's sufficient to detect whether it's a gziped CDX only by the file's extension?

@tw4l
Copy link
Member

tw4l commented Jan 31, 2023

@kuechensofa This is looking great, thank you! Including the test WACZ files in the sample_archive directory makes sense to me, and the tests are much appreciated!

I had a chat with Ilya and would like to request one more small change. When we eventually release pywb 2.8, we are planning to add WACZ support that doesn't require unpacking the WACZ file. So eventually we'll want to be able to wb-manager add my.wacz and have it add the WACZ directly to a collection where it'll be replayable without extracting WARCS. We feel your PR is a great starting point for adding WACZ support in 2.7.x on the way, and that even after adding support for zipped WACZs, it'd be valuable to keep the option to extract the WARCS and indices.

In order to plan for that future, we'd like to add an --uncompress-wacz CLI flag to wb-manager add that would trigger the behavior added in this PR.

So expected behavior would be:

  • $ wb-manager add example.wacz: raises NotImplementedError for now, eventually will add WACZ directly to collection without unpacking
  • $ wb-manager add --uncompress-wacz example.wacz: Unpacks WARCS and indices and adds them to collection

How does that sound to you?

@kuechensofa
Copy link
Contributor Author

@tw4l Thanks for the feedback! :)

I can implement the changes you suggested and they sound good to me.

Adds --uncompress-wacz flag to prepare future implementation of adding wacz files to a collection without unpacking them.
@tw4l
Copy link
Member

tw4l commented Feb 15, 2023

@kuechensofa apologies for the delay, but this is looking great! After discussion this seemed like a major feature to us, so we will be including it in the 2.8 release alongside support for uncompressed WACZs. Thanks again for the PR and for all your efforts!

@tw4l tw4l merged commit 454486b into webrecorder:main Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants