Skip to content
This repository has been archived by the owner on Sep 17, 2020. It is now read-only.

Cannot read warc create by wget 1.19.4 #57

Closed
joshuaavalon opened this issue Mar 22, 2018 · 6 comments
Closed

Cannot read warc create by wget 1.19.4 #57

joshuaavalon opened this issue Mar 22, 2018 · 6 comments

Comments

@joshuaavalon
Copy link

Version 1.0.9 64 bit
OS: Window

warc.sh

USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
SAVE_HOST="$1"
DATE=`date +%Y-%m-%d`
WARC_NAME="$SAVE_HOST-$DATE"
 
wget \
    -e robots=off --mirror --page-requisites \
    --waitretry 5 --timeout 60 --tries 5 --wait 1 \
    --warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" \
    -U "$USER_AGENT" "$SAVE_HOST"
sh warc.sh example.com

I have tested with wget 1.17.1 and wget 1.19.4. The program can only read created warc created by 1.17.1. It show blank page for warc created by 1.19.4.

@vitorio
Copy link

vitorio commented May 9, 2018

wget 1.19 writes WARC-Target-URI headers with brackets around the URL, which breaks some software. Maybe that's what happening here? If you're able, perhaps you could try rewriting the WARC without those brackets and see if that fixes it?

Here's some example Python code to do that using warcio:

>>> from warcio.archiveiterator import ArchiveIterator
>>> from warcio.warcwriter import WARCWriter
>>> output = open('lfes-not-in-ia-1.warc.gz', 'wb')
>>> writer = WARCWriter(output, gzip=True)
>>> with open('brackets.lfes-not-in-ia-1.warc.gz', 'rb') as stream:
...     for record in ArchiveIterator(stream):
...             if 'WARC-Target-URI' in record.rec_headers:                     
...                     record.rec_headers['WARC-Target-URI'] = record.rec_headers['WARC-Target-URI'].lstrip('<').rstrip('>')
...             writer.write_record(record)                                     
... 
>>> output.close()

@joshuaavalon
Copy link
Author

@vitorio Yes it works.

@nvanderperren
Copy link

I have the same issue. I'm not that experienced that I can rewrite the WARC. Can I use the python code? (and how do I do this?)

@ikreymer
Copy link
Member

ikreymer commented Oct 19, 2018

Good news! We've recently added support for wget 1.19.4 WARCs in our WARC reader library (webrecorder/warcio#42) so these types of WARCs should "just work" without any changes.

The next update release of Webrecorder Player will include this fix.

@nvanderperren
Copy link

this is great news! thank you!

@ikreymer
Copy link
Member

We've just released 1.6.0 (https://github.com/webrecorder/webrecorderplayer-electron/releases/tag/v1.6.0) and you should now be able to open wget 1.19.4+ WARCs

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants