New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pywb 2.0.0 doesn't work with WARCs generated by Wget 1.19 #294

Closed
ghost opened this Issue Feb 7, 2018 · 7 comments

Comments

Projects
None yet
1 participant
@ghost

ghost commented Feb 7, 2018

Commands to reproduce:

pip3 install -U pywb

wget --warc-file=example.com --warc-cdx "http://example.com/" -O /dev/null
# this creates example.com.warc.gz, example.com.cdx

wb-manager init example

cp example.com.warc.gz collections/example/archive/
cp example.com.cdx collections/example/indexes/

wb-manager cdx-convert collections/example/indexes/
# type 'y'

wayback

Then go to localhost:8080/example/, type in example.com, click the record, and you should get an error like the following (or at least I do):

Error Details:

{'args': {'coll': 'example', 'type': 'replay'}, 'error': '{"message": ": invalid literal for int() with base 10: \'example.com.warc.gz\'", "errors": {"WARCPathLoader": ": invalid literal for int() with base 10: \'example.com.warc.gz\'"}}'}
@ikreymer

This comment has been minimized.

Show comment
Hide comment
@ikreymer

ikreymer Feb 7, 2018

Member

Looks like the CDX format used by wget is one that pywb doesn't currently support.

The recommended way is to let pywb reindex the WARC (and this is also less commands to type!):

wb-manager init example
wb-manager add example example.com.warc.gz
Member

ikreymer commented Feb 7, 2018

Looks like the CDX format used by wget is one that pywb doesn't currently support.

The recommended way is to let pywb reindex the WARC (and this is also less commands to type!):

wb-manager init example
wb-manager add example example.com.warc.gz
@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Feb 7, 2018

@ikreymer That was actually what I tried first: just do wb-manager add. However, if I do that, then when I search for example.com in the archive, it doesn't seem to find anything (0 captures of http://example.com), even though the index.cdxj file seems to be generated correctly. I'm not sure if this is a problem on my end. Can you reproduce it?

If it helps:
OS: macOS 10.13.2
wget: 1.19.4
python: 3.6.4
Edited to add details.

ghost commented Feb 7, 2018

@ikreymer That was actually what I tried first: just do wb-manager add. However, if I do that, then when I search for example.com in the archive, it doesn't seem to find anything (0 captures of http://example.com), even though the index.cdxj file seems to be generated correctly. I'm not sure if this is a problem on my end. Can you reproduce it?

If it helps:
OS: macOS 10.13.2
wget: 1.19.4
python: 3.6.4
Edited to add details.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Feb 7, 2018

For what it's worth, I don't see anything in Webrecorder Player either. However, warcat is able to extract the content.

ghost commented Feb 7, 2018

For what it's worth, I don't see anything in Webrecorder Player either. However, warcat is able to extract the content.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Feb 8, 2018

Also, I was able to reproduce the issue using wget 1.19.4 compiled from source on a Debian VM. However, there seems to be no issue when using wget 1.18.

Moreover, I believe I have determined the source of the problem:

Wget 1.18 gives WARC output like:

WARC-Target-URI: http://example.com/

However, 1.19.4 outputs the following:

WARC-Target-URI: <http://example.com/>

So you can actually access the page, but you have to type <http://example.com/> into the input field of pywb to get the page!

The wget maintainers seem to have noticed this:

http://lists.gnu.org/archive/html/bug-wget/2017-11/msg00050.html ("Validity of angle brackets around WARC-Target-URI value")

ghost commented Feb 8, 2018

Also, I was able to reproduce the issue using wget 1.19.4 compiled from source on a Debian VM. However, there seems to be no issue when using wget 1.18.

Moreover, I believe I have determined the source of the problem:

Wget 1.18 gives WARC output like:

WARC-Target-URI: http://example.com/

However, 1.19.4 outputs the following:

WARC-Target-URI: <http://example.com/>

So you can actually access the page, but you have to type <http://example.com/> into the input field of pywb to get the page!

The wget maintainers seem to have noticed this:

http://lists.gnu.org/archive/html/bug-wget/2017-11/msg00050.html ("Validity of angle brackets around WARC-Target-URI value")

@ghost ghost changed the title from Playback error ("invalid literal for int() with base 10") to Pywb 2.0.0 doesn't work with WARCs generated by Wget 1.19 Feb 8, 2018

@ikreymer

This comment has been minimized.

Show comment
Hide comment
@ikreymer

ikreymer Feb 9, 2018

Member

Nice find, thanks for looking into this!
This would indeed cause the problem.. most WARCs I've seen do not have the brackets, and most WARC tools do not produce brackets there. Seems like this is an unfortunate side-effect of ambiguity.. Looks like it'll be removed in a future version.

It might make sense to strip the brackets now since there are WARCs being created with them in place.

Would you mind attaching a sample WARC with brackets to this issue?

Member

ikreymer commented Feb 9, 2018

Nice find, thanks for looking into this!
This would indeed cause the problem.. most WARCs I've seen do not have the brackets, and most WARC tools do not produce brackets there. Seems like this is an unfortunate side-effect of ambiguity.. Looks like it'll be removed in a future version.

It might make sense to strip the brackets now since there are WARCs being created with them in place.

Would you mind attaching a sample WARC with brackets to this issue?

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Feb 9, 2018

Here's an example WARC with brackets: example.com.warc.gz

ghost commented Feb 9, 2018

Here's an example WARC with brackets: example.com.warc.gz

N0taN3rd added a commit to N0taN3rd/warcio that referenced this issue Oct 4, 2018

Wget 1.19 produces WARCs with WARC-Target-URI field values wrapped in…
… angle brackets which does does not follow the WARC spec (see http://lists.gnu.org/archive/html/bug-wget/2017-11/msg00050.html for more details).

This causes warcio and projects that depend on warcio (e.g. produce cdx indexes webrecorder/pywb#294) to use the incorrect value for the WARC-Target-URI field.

Added `_ensure_target_uri_format` method to `warcio.recorderloader.ArcWarcRecordLoader` that will detect if the value for the WARC-Target-URI field
contains the Wget bug and corrects it.
Updated `test_archiveiterator.py` to account for this change and added a small WARC (example-wget-bad-target-uir.warc.gz) that is affected by this bug.

N0taN3rd added a commit to N0taN3rd/warcio that referenced this issue Oct 4, 2018

Wget 1.19 produces WARCs with WARC-Target-URI field values wrapped in…
… angle brackets which does does not follow the WARC spec (see http://lists.gnu.org/archive/html/bug-wget/2017-11/msg00050.html for more details).

This causes warcio and projects that depend on warcio (e.g. produce cdx indexes webrecorder/pywb#294) to use the incorrect value for the WARC-Target-URI field.

Added `_ensure_target_uri_format` method to `warcio.recorderloader.ArcWarcRecordLoader` that will detect if the value for the WARC-Target-URI field
contains the Wget bug and corrects it.
Updated `test_archiveiterator.py` to account for this change and added a small WARC (example-wget-bad-target-uir.warc.gz) that is affected by this bug.

ikreymer added a commit to webrecorder/warcio that referenced this issue Oct 5, 2018

Support reading and sanitizing wget 1.19.4 created WARCs with WARC-Ta…
…rget-URI: <uri>

Wget 1.19 incorrectly produces WARCs with WARC-Target-URI field values wrapped in angle brackets (see http://lists.gnu.org/archive/html/bug-wget/2017-11/msg00050.html for more details)

This caused warcio and downstream projects (eg. cdx indexing in webrecorder/pywb#294) to use an incorrect value for the WARC-Target-URI field.

Added `_ensure_target_uri_format` method to `warcio.recorderloader.ArcWarcRecordLoader` that will detect if the value for the WARC-Target-URI field
contains the Wget bug and corrects it.
Updated `test_archiveiterator.py` to account for this change and added a small WARC (example-wget-bad-target-uir.warc.gz) that is affected by this bug.
@ikreymer

This comment has been minimized.

Show comment
Hide comment
@ikreymer

ikreymer Oct 11, 2018

Member

warcio 1.6.x now supports reading the wget WARCs with WARC-Target-URI wrapped in angle brackets. (This has been in wget for a while and is quite widespread).

The change is in warcio, which can be upgraded independently of pywb with pip install -U warcio.

The above example WARC and other wget will then be read correctly by pywb.

Member

ikreymer commented Oct 11, 2018

warcio 1.6.x now supports reading the wget WARCs with WARC-Target-URI wrapped in angle brackets. (This has been in wget for a while and is quite widespread).

The change is in warcio, which can be upgraded independently of pywb with pip install -U warcio.

The above example WARC and other wget will then be read correctly by pywb.

@ikreymer ikreymer closed this Oct 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment