warc extractor

recrm edited this page Jan 19, 2015 · 1 revision

warc-extractor.py

Warc-extractor.py is a tool designed to filter and extract files from warc archive files. This script is designed to perform three different purposes.

  • Provide basic information as to what a collection of warc files contain.
  • Create new warc files containing only filtered elements of old warc files.
  • Dump the file contents of a warc file to disk.

Basic usage.


python3 warc-extractor.py

Running the program without any arguments scans all of the warc files in the current directory and outputs some basic information about those files.

Warc-extractor.py accepts an unlimited number of filter options. A filter option controls which warc entries the script scans.

python3 warc-extractor.py warc-type:request

In the above example the script will output basic information about all of the warc entries where the warc header 'warc-type' is set to request (case insensitive). Substrings are allowed in the second part so 'warc-type:requ' would be equivalent while 'warc-type:re' would return both 'request' and 'response' entries.

Many warc entries also contain HTTP headers which can also be accessed by filter.

python3 warc-extractor.py http:content-type:pdf

The above script finds all warc entries that contain PDF's. Specifically it would filter out any warc entry that does not contain an HTTP header 'content-type' that contains the string 'pdf'. (Note: imputing any HTTP filter implicitly filters out any warc entry that does not contain an HTTP request or response.)

There is also some information found in an HTTP object's version line. This information can be access via some special operators: error, command, path, status, version. The most important being error.

python3 warc-extractor.py http:error:200

The above script would filter out any HTTP responses that did not return error code 200, as well as implicitly remove HTTP requests which do not contain error codes.

Additionally, negative searches are also allowed.

python3 warc-extractor.py \!http:content-type:pdf

The above script would return all warc entries that do not contain contain PDF's. (Note: the '' character is required because '!' is a reserved character in bash.)

Once you have verified that the script is only grabbing those warc entries that are required. The contents of the found warc entries can be dumped in two different ways.

python3 warc-extractor.py some:filter -dump warc

The above script would create a new warc file containing only the filtered elements.

python3 warc-extractor.py some:filter -dump content

The above script would attempt to extract the contents of the filtered entries. (Note: the -dump flag implicitly adds "warc-type:response" and "content-type:application/http" to the filters. As warc entries that do not match these filters do not contain file-like objects.)


Other Arguments


  • -h

    • Outputs the command line help screen.
    • example: python3 warc-extractor.py -h
  • -string

    • Limits which .warc files the extractor looks in.
    • example: python3 warc-extractor.py -string archive
    • (Will only look in .warc files that contain the string "archive" in its filename.
  • -path

    • Changes which folder the extractor looks in for .warc files.
    • example: python3 warc-extractor.py -path /path/to/folder
    • (Looks in folder /path/to/folder to find warc files.)
  • -output_path

    • Changes the folder dumped files are placed in.
    • example: python3 warc-extractor.py -output /path/to/folder
    • (All dumped files will be placed in /path/to/folder)
  • -output

    • Changes the name of the warc file the extractor outputs to.
    • example: python3 warc-extractor.py -output new-warc.warc
    • (new-warc.warc will be created instead of output.warc)
  • -dump

    • Triggers output of data. Defaults to no output.
    • Choices are 'content' and 'warc'.
    • 'warc' will output all warc entries that remain after filter to 'output.warc'.
    • 'content' will output the saved file in all warc entries that remain after filter.
    • example: python3 warc-extractor.py -dump content
  • -output

    • Changes the name of the warc file the extractor outputs to.
    • example: python3 warc-extractor.py -output new-warc.warc
    • (new-warc.warc will be created instead of output.warc)
  • -output

    • Changes the name of the warc file the extractor outputs to.
    • example: python3 warc-extractor.py -output new-warc.warc
    • (new-warc.warc will be created instead of output.warc)
  • -silence

    • Boolean variables, silences collection of index data and prevents script from writing to terminal.
  • -error

    • Debugging command, see troubleshooting below.

Examples


To create a warc file containing all HTTP responses that are not file-like objects.

python3 warc-extractor.py -dump warc warc-type:response \!content-type:application/http

To dump all PDF's from a warc file to disk.

python3 warc-extractor.py -dump content http:content-type:pdf

To dump everything a warc file contains to disk.

python3 warc-extractor.py -dump content http:error:200

Troubleshooting


Warc files are complicated and huge. Creating a single script that can properly handle all of the many strange and wonderful objects that might be hidden in a warc file is a large undertaking. Because of this bugs are inevitable.

The script contains a -error command script designed to make dealing with problematic warc entries a bit easier. If the -error tag is supplied to the script, the script will do it's best to skip all entries that cause errors then write all problematic entries to a new warc file 'error.warc'. Should this script error, please try running it again with the -error tag and then upload the resulting 'error.warc' file along with the bug report.

There are many possible problems a warc file could contain that are not limited to specific entries. In these situations the -error tag will not prevent the error and will not create the error.warc file. In these cases please still fill out a bug report. However, the problem is unlikely to be fixed unless I can get access to the warc file that created the problem.

One final note, this script was programmed and tested on a Linux platform. In theory it should work on any platform that Python 3 works on; however, I make no guarantees. Help on this issue would be greatly appreciated.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.