Opening Up WARCs with warcio
===========================

This notebook contains examples of using the [`warcio`](https://github.com/webrecorder/warcio) Python library to open up and analyse WARC files. It's under active development, has few dependencies, supports Python 3, and is pretty straightforward to use. There's quite a few [other WARC implementations](https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem), but `warcio` is a good one to start with.

First, we need to make sure it's installed!

In [3]:
!pip install warcio

Collecting warcio
[?25l  Downloading https://files.pythonhosted.org/packages/90/c4/86bc02bc3bc33c34ab24e52af8a1c34eb6e03e7cd5b3904057ebcea311da/warcio-1.7.1-py2.py3-none-any.whl (41kB)
[K     |████████████████████████████████| 51kB 3.8MB/s eta 0:00:01
Installing collected packages: warcio
Successfully installed warcio-1.7.1


To see what we can do, we need an example WARC to work with. To keep things simpler, we'll use a WARC file generated by trying to capture a single fairly-complicated web page (specifically, the Wikipedia home page during the [SOPA Blackout](https://en.wikipedia.org/wiki/Protests_against_SOPA_and_PIPA), see [here](https://github.com/ukwa/webarchive-test-suite/tree/master/wikipedia-sopa-blackout-2012) for more details).

You can override this with your own if you're running this notepad locally.

In [25]:
warc_file = 'example-warcs/sopa-wikipedia-homepage.warc.gz'

Now we can use the `warcio` library to open up this file, and iterate through the records.

In [26]:
from warcio.archiveiterator import ArchiveIterator

with open(warc_file, 'rb') as stream:
    iterator = ArchiveIterator(stream, check_digests=True)
    for record in iterator:
        print(iterator.offset, record.rec_type, record.length, record.content_type, record.format)
        if record.rec_type == 'request' or record.rec_type == 'response':
            print(record.rec_headers)
            print(record.http_headers)
        

0 warcinfo 266 application/warc-fields warc
437 request 751 application/http;msgtype=request warc
WARC/1.0
WARC-Type: request
WARC-Record-ID: <urn:uuid:2C66DE4B-7C2A-4A18-A139-A33FC97FE1F4>
WARC-Date: 2012-01-18T14:31:20Z
Content-Length: 751
Content-Type: application/http;msgtype=request
WARC-Block-Digest: sha1:ES236IRD3H6BC4DFUBNVXATRIFFF6RTO
WARC-Target-URI: http://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.UserBuckets%2CmarkAsHelpful%7Cext.UserBuckets.AccountCreationUserBucket%7Cext.articleFeedback.startup%7Cext.articleFeedbackv5.startup%7Cext.gadget.wmfFR2011Style%7Cjquery.autoEllipsis%2CcheckboxShiftClick%2CclickTracking%2CcollapsibleTabs%2Ccookie%2CdelayedBind%2ChighlightText%2Cjson%2CmakeCollapsible%2CmessageBox%2CmwPrototypes%2Cplaceholder%2Csuggestions%2CtabIndex%7Cmediawiki.language%2Cuser%2Cutil%7Cmediawiki.legacy.ajax%2Cmwsuggest%2Cwikibits%7Cmediawiki.page.ready&skin=vector&version=20120118T020454Z&*
WARC-Warcinfo-ID: <urn:uuid:E02

In [20]:
with open(warc_file, 'rb') as stream:
    iterator = ArchiveIterator(stream, check_digests=True)
    rec_index = 0
    rec_info = []
    for record in iterator:
        rec_info.insert(rec_index, {'type': record.rec_type, 'offset': iterator.offset} )
        if rec_index > 0:
            rec_info[rec_index - 1]['length'] = iterator.offset - rec_info[rec_index - 1 ]['offset']
        #print(iterator.offset, record.rec_type, record.length, record.content_type, record.format)
        #if record.rec_type == 'request' or record.rec_type == 'response':
        #    print(record.rec_headers)
        rec_index += 1
        
    # And add in the last length calculation:
    rec_info[rec_index - 1]['length'] = iterator.offset - rec_info[rec_index - 1 ]['offset']
            
    print(len(rec_info),rec_info)

164 [{'type': 'warcinfo', 'offset': 0, 'length': 437}, {'type': 'request', 'offset': 437, 'length': 728}, {'type': 'request', 'offset': 1165, 'length': 517}, {'type': 'response', 'offset': 1682, 'length': 5725}, {'type': 'request', 'offset': 7407, 'length': 503}, {'type': 'response', 'offset': 7910, 'length': 22962}, {'type': 'request', 'offset': 30872, 'length': 411}, {'type': 'response', 'offset': 31283, 'length': 639}, {'type': 'request', 'offset': 31922, 'length': 548}, {'type': 'response', 'offset': 32470, 'length': 1393}, {'type': 'request', 'offset': 33863, 'length': 549}, {'type': 'response', 'offset': 34412, 'length': 3417}, {'type': 'request', 'offset': 37829, 'length': 549}, {'type': 'response', 'offset': 38378, 'length': 3549}, {'type': 'request', 'offset': 41927, 'length': 551}, {'type': 'response', 'offset': 42478, 'length': 886}, {'type': 'request', 'offset': 43364, 'length': 556}, {'type': 'response', 'offset': 43920, 'length': 1667}, {'type': 'request', 'offset': 45587