-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pywb 2.0.0 doesn't work with WARCs generated by Wget 1.19 #294
Comments
|
Looks like the CDX format used by The recommended way is to let pywb reindex the WARC (and this is also less commands to type!): |
|
@ikreymer That was actually what I tried first: just do If it helps: |
|
For what it's worth, I don't see anything in Webrecorder Player either. However, warcat is able to extract the content. |
|
Also, I was able to reproduce the issue using wget 1.19.4 compiled from source on a Debian VM. However, there seems to be no issue when using wget 1.18. Moreover, I believe I have determined the source of the problem: Wget 1.18 gives WARC output like: However, 1.19.4 outputs the following: So you can actually access the page, but you have to type The wget maintainers seem to have noticed this: http://lists.gnu.org/archive/html/bug-wget/2017-11/msg00050.html ("Validity of angle brackets around WARC-Target-URI value") |
|
Nice find, thanks for looking into this! It might make sense to strip the brackets now since there are WARCs being created with them in place. Would you mind attaching a sample WARC with brackets to this issue? |
|
Here's an example WARC with brackets: example.com.warc.gz |
… angle brackets which does does not follow the WARC spec (see http://lists.gnu.org/archive/html/bug-wget/2017-11/msg00050.html for more details). This causes warcio and projects that depend on warcio (e.g. produce cdx indexes webrecorder/pywb#294) to use the incorrect value for the WARC-Target-URI field. Added `_ensure_target_uri_format` method to `warcio.recorderloader.ArcWarcRecordLoader` that will detect if the value for the WARC-Target-URI field contains the Wget bug and corrects it. Updated `test_archiveiterator.py` to account for this change and added a small WARC (example-wget-bad-target-uir.warc.gz) that is affected by this bug.
… angle brackets which does does not follow the WARC spec (see http://lists.gnu.org/archive/html/bug-wget/2017-11/msg00050.html for more details). This causes warcio and projects that depend on warcio (e.g. produce cdx indexes webrecorder/pywb#294) to use the incorrect value for the WARC-Target-URI field. Added `_ensure_target_uri_format` method to `warcio.recorderloader.ArcWarcRecordLoader` that will detect if the value for the WARC-Target-URI field contains the Wget bug and corrects it. Updated `test_archiveiterator.py` to account for this change and added a small WARC (example-wget-bad-target-uir.warc.gz) that is affected by this bug.
…rget-URI: <uri> Wget 1.19 incorrectly produces WARCs with WARC-Target-URI field values wrapped in angle brackets (see http://lists.gnu.org/archive/html/bug-wget/2017-11/msg00050.html for more details) This caused warcio and downstream projects (eg. cdx indexing in webrecorder/pywb#294) to use an incorrect value for the WARC-Target-URI field. Added `_ensure_target_uri_format` method to `warcio.recorderloader.ArcWarcRecordLoader` that will detect if the value for the WARC-Target-URI field contains the Wget bug and corrects it. Updated `test_archiveiterator.py` to account for this change and added a small WARC (example-wget-bad-target-uir.warc.gz) that is affected by this bug.
|
warcio 1.6.x now supports reading the wget WARCs with WARC-Target-URI wrapped in angle brackets. (This has been in wget for a while and is quite widespread). The change is in warcio, which can be upgraded independently of pywb with The above example WARC and other wget will then be read correctly by pywb. |
Commands to reproduce:
Then go to
localhost:8080/example/, type inexample.com, click the record, and you should get an error like the following (or at least I do):The text was updated successfully, but these errors were encountered: