Skip to content
Python script to create CDX index files of WARC data
Arc Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
tests Add test case that illustrates warctools problem when Content-Type Apr 23, 2015
.travis.yml
LICENSE Add LICENSE. Fix typo. Closes #2 Nov 8, 2013
README.md
cdx_writer.py

README.md

cdx_writer.py

Python script to create CDX index files of WARC data.

Build Status

Usage

Usage: cdx_writer.py [options] warc.gz

Options:

-h, --help                  show this help message and exit
--format=FORMAT             A space-separated list of fields [default: 'N b a m s k r M S V g']
--use-full-path             Use the full path of the warc file in the 'g' field
--file-prefix=FILE_PREFIX   Path prefix for warc file name in the 'g' field.
                            Useful if you are going to relocate the warc.gz file
                            after processing it.
--all-records               By default we only index http responses. Use this flag
                            to index all WARC records in the file
--screenshot-mode           Special Wayback Machine mode for handling WARCs
                            containing screenshots
--exclude-list=EXCLUDE_LIST File containing url prefixes to exclude
--stats-file=STATS_FILE     Output json file containing statistics

Output is written to stdout. The first line of output is the CDX header. This header line begins with a space so that the cdx file can be passed through sort while keeping the header at the top.

Format

The supported format options are:

M meta tags (AIF) *
N massaged url
S compressed record size
V compressed arc file offset *
a original url **
b date **
g file name
k new style checksum *
m mime type of original document *
r redirect *
s response code *

* in alexa-made dat file
** in alexa-made dat file meta-data line

More information about the CDX format syntax can be found here: http://www.archive.org/web/researcher/cdx_legend.php

Installation

Unfortunately, this script is not propery packaged and cannot be installed via pip. See the .travis.yml file for hints on how to get it running.

Differences between cdx_writer.py and access-access cdx files

The CDX files produced by the archive-access package produce different CDX lines in these cases:

Differences in SURTs:

  • archive-access doesn't encode the %7F character in SURTs

Differences in MIME Type:

  • archive-access does not parse mime type for large warc payloads, and just returns 'unk'
  • If the HTTP Content-Type header is sent with a blank value, archive-access returns the value of the previous header as the mime type. cdx_writer.py returns 'unk' in this case. Example WARC Record (returns "close" as the mime type): ...Content-Length: 0\r\nConnection: close\r\nContent-Type: \r\n\r\n\r\n\r\n

Differences in Redirect urls:

  • archive-access does not escape whitespace, cdx_writer.py uses %20 escaping so we can split these files on whitespace.
  • archive-access removes unicode characters from redirect urls, cdx_writer.py version keeps them
  • archive-access does not decode html entities in redirect urls
  • archive-access sometimes does not turn relative URLs into absolute urls
  • archive-access sometimes does not remove /../ from redirect urls
  • archive-access uses the value from the previous HTTP header for the redirect url if the location header is empty
  • cdx_writer.py only looks for http-equiv=refresh meta tag inside head elements

Differences in Meta Tags:

  • cdx_writer.py only looks for meta tags in the head element
  • archive-access version doesn't parse multiple html meta tags, only the first one
  • archive-access misses FI meta tags sometimes
  • cdx_writer.py always returns tags in A, F, I order. archive-access does not use a consistent order

Differences in HTTP Response Codes

  • archive-access returns response code 0 if HTTP header line contains unicode: HTTP/1.1 302 D\xe9plac\xe9 Temporairement\r\n...
Something went wrong with that request. Please try again.