Unpacking WARCs with Warcat
==========================

This tutorial shows how to unpack a WARC file into a directory using the [Warcat](https://pypi.org/project/Warcat/).

You can download this, or run it on Binder using this link: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ukwa/opendata/beta?filepath=content%2Fnotebooks%2Funpacking-warcs-with-warcat.ipynb)

Installing Warcat
------------------------

We're using command-line calls rather than Python code here, so all the commands are prefixed with `!` (which is the convention for Python notebooks).

First, we need to ensure the library is installed:

In [1]:
!pip install warcat

Collecting warcat
[?25l  Downloading https://files.pythonhosted.org/packages/51/78/3abb1702eae1ac1dec44a0d1d366ff10394679894b7a2acc6b6efd0db898/Warcat-2.2.5.tar.gz (57kB)
[K    100% |████████████████████████████████| 61kB 4.1MB/s ta 0:00:01
[?25hCollecting isodate (from warcat)
[?25l  Downloading https://files.pythonhosted.org/packages/9b/9f/b36f7774ff5ea8e428fdcfc4bb332c39ee5b9362ddd3d40d9516a55221b2/isodate-0.6.0-py2.py3-none-any.whl (45kB)
[K    100% |████████████████████████████████| 51kB 15.5MB/s ta 0:00:01
Building wheels for collected packages: warcat
  Running setup.py bdist_wheel for warcat ... [?25ldone
[?25h  Stored in directory: /home/anj/.cache/pip/wheels/66/3a/66/b507615861da008d33d8a8db9d54a032dd9bfbb0137baac73c
Successfully built warcat
Installing collected packages: isodate, warcat
Successfully installed isodate-0.6.0 warcat-2.2.5


Now we can use it:

In [2]:
!python -m warcat -h

usage: __main__.py [-h] [--version] [--output FILE] [--gzip]
                   [--force-read-gzip] [--verbose] [--record RECORD]
                   [--preserve-block] [--output-dir OUTPUT_DIR] [--progress]
                   [--keep-going]
                   command [file [file ...]]

Tool for handling Web ARChive (WARC) files.

positional arguments:
  command               A command to run. Use "help" for a list.
  file                  Filename of file to be read.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --output FILE, -o FILE
                        Output to FILE instead of standard out
  --gzip, -z            When outputting a file, use gzip compression
  --force-read-gzip     Instead of guessing by filename, force reading
                        archives as gzip compressed
  --verbose             Increase verbosity. Can be used more than once.
  --record RECO

Unpacking an example WARC
-------------------------------------------

There are some test warcs in the `warcs` folder:

In [3]:
!ls warcs

flashfrozen-jwat-recompressed.warc.gz


so we can unpack it as files, like this:

In [5]:
!python -m warcat extract warcs/flashfrozen-jwat-recompressed.warc.gz --output-dir unpacked-warcs --progress

0 (=   | 
Done. 82 records processed.


So, lets see what we've got!

In [6]:
!find unpacked-warcs -type f

unpacked-warcs/bits.wikimedia.org/images/wikimedia-button.png
unpacked-warcs/bits.wikimedia.org/static-1.21wmf6/skins/common/images/magnify-clip.png
unpacked-warcs/bits.wikimedia.org/static-1.21wmf6/skins/common/images/poweredby_mediawiki_88x31.png
unpacked-warcs/bits.wikimedia.org/static-1.21wmf6/skins/vector/images/search-ltr.png_303-4
unpacked-warcs/bits.wikimedia.org/geoiplookup
unpacked-warcs/bits.wikimedia.org/en.wikipedia.org/load.php_debug=false&lang=en&modules=ext.gadget.DRN-wizard,ReferenceTooltips,charinsert,teahouse%7Cext.wikihiero%7Cmediawiki.legacy.commonPrint,shared%7Cmw.PopU_e0af80
unpacked-warcs/bits.wikimedia.org/en.wikipedia.org/load.php_debug=false&lang=en&modules=jquery.ui.button,core,dialog,draggable,mouse,position,resizable,widget&skin=vector&version=20121210T190559Z&_
unpacked-warcs/bits.wikimedia.org/en.wikipedia.org/load.php_debug=false&lang=en&modules=ext.Experiments.experiments,lib%7Cext.UserBuckets,eventLogging,markAsHelpful,postEdit%7Cext.articleFee

If we look closely, we can see the main `unpacked-warcs/en.wikipedia.org/wiki/Mona_Lisa` file, capturing the contents of https://en.wikipedia.org/wiki/Mona_Lisa, along with all the other resources required by that page.

Because we've unpacked the HTTP responses into files, we don't have the original HTTP headers, like the `Content-Type`. This means it's not obviously what the differnet types of files are.  While it's better to use that original metadata, it's also possible to deal with this by running the files through a format identification tool, like [`file`](https://en.wikipedia.org/wiki/File_(command)).

If you have the file command available, you should be able generate a textual summary like this:

```
find unpacked-warcs -type f -exec file {} \;
```

or a summary in terms of MIME types like this:

```
find unpacked-warcs -type f -exec file --mime {} \;
```

FIN