OAI-PMH harvester in shell.
Shell XSLT Makefile M4
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


The OAI-PMH Shell Harvester is able to harvest OAI-PMH targets. It supports multiple configurable targets which can be updated individually. Furthermore, it is able to execute a preset command for each record it updates or deletes.

Installing shell-harvester:
 - dependencies: xsltproc, bc, curl, coreutils (namely for md5sum and shuf)
 - git clone git://github.com/wimmuskee/shell-oaiharvester.git
 - use Makefile or:
  - cp oaiharvester /usr/bin/oaiharvester
  - mkdir /usr/share/shell-oaiharvester
  - cp libs/* /usr/share/shell-oaiharvester/.
  - cp config.xml /etc/shell-oaiharvester/config.xml

Using shell-harvester:

1. Configure the harvester
   Edit the paths in the config.xml file, 
   records are stored in <recordpath>/<repository_id> .

2. Configure a repository
   Edit a repository in the config file. By default, the example file
   in /etc/shell-oaiprovider is used. You use your own by calling:
   $ oaiharvester -c <config file>

   Some notes on the options:
    - deletecmd: executed before a record is deleted, optional
    - updatecmd: executed after a record is updated, optional
      - ${identifier} holds the identifier
      - ${filename) is the identifier with optional xz extension
      - ${path} holds the absolute path of the record
      - ${responsedatetime} the full oai page responsedate
      - ${responsedate} the YYYY-MM-DD format of responsedatetime
    - repository update and delete cmd's are executed before
      the generic ones if available
    - set, from, until, conditional and recordpath are optional

3. Run the harvester
   $ oaiharvester -r <repository_id>

4. Only retrieve identifiers
   $ oaiharvester -r <repository_id> -n

5. Not harvest, but validate the repository
   $ oaiharvester -r <repository_id> -t

*. Unless customized, the log file of the harvest process is stored at
   /tmp/shell-harvester/log.csv. The logger uses a simple csv format with
   a line for each downloaded page.
   YYYY-MM-DD HH:MM:SS,repository,record count,download time,process time

   The logtype can be configured in the harvester config file, either
   "combined" putting all instance logs in one file (default), or
   "instance", and have a log file for each started instance