Unpacking a WARC with Warcat
--------------------------------

In this ntoebook, you will see how to extract web resources from WARC files as individual files.

We will use the [Warcat](https://pypi.org/project/Warcat/) Python package that provides tools for managing WARC files.

First, we need to install Warcat.

## Installation

You can install Warcat with the following command.  Note that, as is standard in Jupyter notebooks, all shell commands are prefixed with a `!`, and we're only using command-line programs here.


In [2]:
!pip install Warcat

Collecting Warcat
  Using cached https://files.pythonhosted.org/packages/51/78/3abb1702eae1ac1dec44a0d1d366ff10394679894b7a2acc6b6efd0db898/Warcat-2.2.5.tar.gz
Collecting isodate (from Warcat)
  Using cached https://files.pythonhosted.org/packages/9b/9f/b36f7774ff5ea8e428fdcfc4bb332c39ee5b9362ddd3d40d9516a55221b2/isodate-0.6.0-py2.py3-none-any.whl
Building wheels for collected packages: Warcat
  Building wheel for Warcat (setup.py) ... [?25ldone
[?25h  Created wheel for Warcat: filename=Warcat-2.2.5-cp37-none-any.whl size=34778 sha256=206557e1a2816979e0f9e17e817bc71b40e2099e64e383525b84d19bf2594b71
  Stored in directory: /home/anj/.cache/pip/wheels/66/3a/66/b507615861da008d33d8a8db9d54a032dd9bfbb0137baac73c
Successfully built Warcat
Installing collected packages: isodate, Warcat
Successfully installed Warcat-2.2.5 isodate-0.6.0


### Check it's working

Once it's installed, you can check it's working by looking at the command-line options:

In [19]:
!python -m warcat -h

usage: __main__.py [-h] [--version] [--output FILE] [--gzip]
                   [--force-read-gzip] [--verbose] [--record RECORD]
                   [--preserve-block] [--output-dir OUTPUT_DIR] [--progress]
                   [--keep-going]
                   command [file [file ...]]

Tool for handling Web ARChive (WARC) files.

positional arguments:
  command               A command to run. Use "help" for a list.
  file                  Filename of file to be read.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --output FILE, -o FILE
                        Output to FILE instead of standard out
  --gzip, -z            When outputting a file, use gzip compression
  --force-read-gzip     Instead of guessing by filename, force reading
                        archives as gzip compressed
  --verbose             Increase verbosity. Can be used more than once.
  --record RECO

## Extraction 

We need a WARC to experiment with. This notebook comes with a suitable WARC that contains a copy of [a page from Wikipedia](https://en.wikipedia.org/wiki/Mona_Lisa) from 2013. 

Now we can go on to inspect and unpack the WARC file. The following command will unpack the test WARC into a folder called `unpacked-warc`.


In [4]:
!python -m warcat extract example-warcs/flashfrozen-jwat-recompressed.warc.gz --output-dir unpacked-warc

This works silently. If you want more feedback, you could try:

    python -m warcat extract [input.warc.gz] --output-dir unpacked-warc --progress

or

    python -m warcat extract [input.warc.gz] --output-dir unpacked-warc --verbose
   
Once unpacked, we can easily list all the files:

In [12]:
!find unpacked-warc -type f

unpacked-warc/bits.wikimedia.org/images/wikimedia-button.png
unpacked-warc/bits.wikimedia.org/static-1.21wmf6/skins/common/images/magnify-clip.png
unpacked-warc/bits.wikimedia.org/static-1.21wmf6/skins/common/images/poweredby_mediawiki_88x31.png
unpacked-warc/bits.wikimedia.org/static-1.21wmf6/skins/vector/images/search-ltr.png_303-4
unpacked-warc/bits.wikimedia.org/geoiplookup
unpacked-warc/bits.wikimedia.org/en.wikipedia.org/load.php_debug=false&lang=en&modules=ext.gadget.DRN-wizard,ReferenceTooltips,charinsert,teahouse%7Cext.wikihiero%7Cmediawiki.legacy.commonPrint,shared%7Cmw.PopU_e0af80
unpacked-warc/bits.wikimedia.org/en.wikipedia.org/load.php_debug=false&lang=en&modules=jquery.ui.button,core,dialog,draggable,mouse,position,resizable,widget&skin=vector&version=20121210T190559Z&_
unpacked-warc/bits.wikimedia.org/en.wikipedia.org/load.php_debug=false&lang=en&modules=ext.Experiments.experiments,lib%7Cext.UserBuckets,eventLogging,markAsHelpful,postEdit%7Cext.articleFeedback.st

And do fancy command-line stuff if you like.  e.g. if `file` is installed, you could do:

    file unpacked-warc -type f -exec file {} \;

This would examine every file and report the format it appears to  be.


## Verification

The Warcat package also has a WARC verification option:

In [18]:
!python -m warcat verify example-warcs/flashfrozen-jwat-recompressed.warc.gz

Record <urn:uuid:2A7045F9-87FF-4EBA-88B3-D801A17A0FBE> failed validation
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/warcat/tool.py", line 282, in action
    action(record)
  File "/opt/conda/lib/python3.7/site-packages/warcat/tool.py", line 304, in verify_id_uniqueness
    raise VerifyProblem('Duplicate record ID.')
warcat.tool.VerifyProblem: ('Duplicate record ID.', None, True)
Validation failed. Problems: 1.


Interestingly, this revealed an issue with the WARC Record ID's in this file!
