Grow your team on GitHub
GitHub is home to over 28 million developers working together. Join them to grow your own development teams, manage permissions, and collaborate on projects.Sign up
A new user interface for the UK Web Archive
The UKWA Heritrix3 custom modules and Docker builder.
Shepherding our web archives from crawl to access.
The dockerized ensemble of services that run most of the UKWA crawl and ingest processes.
Core Python Web Archiving Toolkit for replay and recording of web archives
Dashboard and monitoring system for the UK Web Archive
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
w3act is an annotation and curation tool for building web archive collections
Serves our WARC files for playback, wherever they may lie.
This module builds our Waybacks in the various different configurations we require.
A RESTful API for rendering web pages in PhantomJS
A Python wrapper around the Heritrix API.
Trifecta docker image
Repository of documentation about the open datasets published by the UK Web Archive.
WARC and ARC indexing and discovery tools.
A simple web service for viewing crawl logs.
Apache Hadoop HttpFS for cdh3
Public documentation about the technical architecture of the UK Web Archive
The dockerized ensemble of services that provide main user access to UKWA material.
Yet Another Docker Container for Apache Zeppelin
A containerised Dat server for experimental dataset hosting.
Internal UKWA website nginx service
Hadoop running in a container, with HttpFS enabled.
A web archive browser built on HBase
Utilities for working with WARC files stored on HDFS.
Web Archiving Domain Crawl Analysis Scripts
Luigi tasks for running Hadoop jobs and managing material held on HDFS