Skip to content

Glossary

Naomi Dushay edited this page May 5, 2017 · 4 revisions

ArchiveIt

"A subscription web archiving service from the Internet Archive that helps organizations to harvest, build, and preserve collections of digital content. Through our user friendly web application Archive-It partners can collect, catalog, and manage their collections of archived content with 24/7 access and full text search available for their use as well as their patrons. Content is hosted and stored at the Internet Archive data centers."

https://archive-it.org/learn-more/

note that we use https://partner.archive-it.org as the URL to the API ...

bagit

a tool used to grab a bunch of WARC files

Crawl

Web crawl objects are single objects in the Stanford Digital Repository that represent a capture of web content associated with a url. They provide a wrapper around a set of 'WARC' files that pertain to a particular url.

heritrix

We do SUL web captures using heritrix. This creates WARC files, which we then accession into crawl SDR objects and make available via SWAP. We have no localizations to the heritrix code base, so there is no sul-dlss github code to document here.

We also may use heritrix to capture recent web data to add to existing crawl objects.

hrwa_manager

(this code no longer works for us)

This java code was used to "download" WARC files from Archive-It, which are then accessioned as crawl objects. It is made available to the robots in the app subdirectory of the /was_unaccessioned_data mount point.

More information is in consul: https://consul.stanford.edu/display/WARC/Downloading+WARCs+from+Archive-It

OpenWayBack

An open source project that grew out of Internet Archive's Wayback Machine. It is written in java, and is the key software used by web archives worldwide to display archived websites in the user's browser.

Seed

Web seed objects are simple Stanford Digital Repository objects meant to be exposed in discovery systems such as SearchWorks to allow users to interact with our preserved crawl objects.

SWAP

Stanford Web Archive Portal (SWAP) machines make individual web crawls available for end user interaction via the OpenWayback software. Stanford modifications include changing the earliest year allowed and some GUI changes.

WARC

"The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information.

"The WARC format is a revision of the Internet Archive's ARC File Format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.

"WARC is now recognised by most national library systems as the standard to follow for web archival."

https://en.wikipedia.org/wiki/Web_ARChive

WAS

Web Archiving Service(s)

WASAPI

Web Archiving Systems API

was-downloader

VM used to download WARCs, which are then set up to be accessioned as crawl objects. It is a separate box because we do checksums on the downloaded WARCs to ensure correctness, and this requires processing power that shouldn't be competing with, say was-robots or with heritrix crawling.

was-registrar

Rails app for registering web archiving service (was) objects, seeds and crawls, in the Stanford Digital Repository. Crawls are registered via a rake task; seeds via the Rails GUI.

Registering an object (seed or crawl) kicks off the appropriate robots (see was_robot_suite) to accession that object.

was_robot_suite

was_robot_suite is DOR workflow robot code for Web Archiving Service object accessioning and dissemination.

There is information about the specific workflows and their steps in the README of the was_robot_suite code. There are workflows for both seed and crawl objects.

was-thumbnail-service

Rails app to create and serve thumbnails for Web Archiving Seed objects in DOR. These thumbnails are intended to be used by discovery environments that include the seed objects, such as SearchWorks.

WASMetadataExtractor

Used by the was_robot_suite for accessioning crawl objects, WASMetadataExtractor extracts descriptive metadata from web archiving ARC and WARC files.

It is a java project that creates a jar which is deployed via jenkins, puppet, and the robot suite.