Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
-
Updated
May 29, 2024 - Java
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
(used on swap vm 6/2020) Stanford's fork of iipc/openwayback, which is used on our "swap" (Stanford Web Archiving Portal) machines. (See also sul-dlss/swap which is intended as a replacement)
Partition (W)ARC Files by MIME Type and Year
This module builds our Waybacks in the various different configurations we require.
HTTP/S proxy server which replays content from a web archive
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
Chrome debugging protocol client for Java
Add a description, image, and links to the web-archiving topic page so that developers can more easily learn about it.
To associate your repository with the web-archiving topic, visit your repo's landing page and select "manage topics."