Shepherding our web archives from crawl to access.
Front-end for the UK Web Archive
w3act is an annotation and curation tool for building web archive collections
A pulsating crawl engine built around Heritrix3.
Django app. for calling PhantomJs.
WARC and ARC indexing and discovery tools.
ClamD in a container
Dockerised Heritrix based on LBS stable builds.
An OpenWayback in Docker.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Run the tinycdxserver in a Docker container
A Wayback RemoteResourceIndex server using RocksDB
Modules for Heritrix 3.
Prototype SOLR-powered web archive exploration UI.
Repository of documentation about the open datasets published by the UK Web Archive.
Web archive collection manager
Experiments in testable, scaleable crawler architectures
A web archive browser built on HBase
Web Archiving Domain Crawl Analysis Scripts
A simple site that uses GitHub pages to host resources for testing crawlers.
An acid test suite for crawlers.
Run warcprox inside Docker
GROBID (GeneRation Of BIbliographic Data) in a Docker container.
Hopefully off-setting some of the difficulties writing to WARCs (multiple open files, size limits, etc.).
Brozzler in a Docker container
Tracking the fortunes of our archived URLs.