acid-crawl

An acid test suite for web archives.

The idea would be to run a workflow along these lines:

Spin up a Python server with test files and behaviour.
- Big files, little files, embeds, javascript tests, dodgy server behaviour, etc. etc.
Spin up a warcprox proxy to record what happens.
Unpack and configure Heritrix3 with a sample job configuration that tests all modules of interest.
Start up Heritrix (via it's API) and get it crawling the test site.
Perform various standard operations, like pausing and checkpointing, resuming from a checkpoint etc.
At the end, check the logs are sane, check the output has what is expected in it.
Use the warcprox recordings to cross-check the WARC contents from Heritrix3.

This repo is intended to provide docker images containing servers that deliver the test resources, and the expected results. The testing processes will be elsewhere (for now at least).

Cases

Normal GETs, 200s, redirects, etc.
Byte-range requests:
- Presumably we should grab the whole thing in this case?
- See this related OpenWayback issue
206 Partial Content
- See this example

Components

acid-simple-resources

This is a simple Java web application that serves simple static resources of various kinds.

acid-crawl-selftest

This is a simple test system that shows how to file up the test server and request resources.

Simulating Bad Servers

Some work done on the idea of simulating a very badly behaved server.

Original idea was to proxy some requests to cynic in order to test crawler behaviour under bad server behaviour. Unfortunately, cynic depends on select.poll() functionality that does not work under OS X. Cynic uses watched files to implement the server sockets, spawning child processes per request, and so it is difficult to disentangle things.

pip install bottle cynic wsgiproxy

Alternatively, the hamms package offers similar functionality and may be simpler to use.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
archival-acid-test		archival-acid-test
crawl-test-site @ be105d5		crawl-test-site @ be105d5
eicar-virus-site		eicar-virus-site
static-site		static-site
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

acid-crawl

Cases

Components

acid-simple-resources

acid-crawl-selftest

Simulating Bad Servers

About

Releases

Packages

Contributors 2

Languages

ukwa/acid-crawl

Folders and files

Latest commit

History

Repository files navigation

acid-crawl

Cases

Components

acid-simple-resources

acid-crawl-selftest

Simulating Bad Servers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages