OAI-PMH harvester in shell.
Shell XSLT Makefile M4
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
libs
COPYING
Makefile.am
README
config.xml
configure.ac
oaiharvester

README

The OAI-PMH Shell Harvester is able to harvest OAI-PMH targets. It supports multiple configurable targets which can be updated individually. Furthermore, it is able to execute a preset command for each record it updates or deletes.

Installing shell-harvester:
 - dependencies: xsltproc, bc, curl, coreutils (namely for md5sum and shuf)
 - git clone git://github.com/wimmuskee/shell-oaiharvester.git
 - use Makefile or:
  - cp oaiharvester /usr/bin/oaiharvester
  - mkdir /usr/share/shell-oaiharvester
  - cp libs/* /usr/share/shell-oaiharvester/.
  - cp config.xml /etc/shell-oaiharvester/config.xml

Using shell-harvester:

1. Configure the harvester
   Edit the paths in the config.xml file, 
   records are stored in <recordpath>/<repository_id> .

2. Configure a repository
   Edit a repository in the config file. By default, the example file
   in /etc/shell-oaiprovider is used. You use your own by calling:
   $ oaiharvester -c <config file>

   Some notes on the options:
    - deletecmd: executed before a record is deleted, optional
    - updatecmd: executed after a record is updated, optional
      - ${identifier} holds the identifier
      - ${filename) is the identifier with optional xz extension
      - ${path} holds the absolute path of the record
      - ${responsedatetime} the full oai page responsedate
      - ${responsedate} the YYYY-MM-DD format of responsedatetime
    - repository update and delete cmd's are executed before
      the generic ones if available
    - set, from, until, conditional and recordpath are optional

3. Run the harvester
   $ oaiharvester -r <repository_id>

4. Only retrieve identifiers
   $ oaiharvester -r <repository_id> -n

5. Not harvest, but validate the repository
   $ oaiharvester -r <repository_id> -t

*. Unless customized, the log file of the harvest process is stored at
   /tmp/shell-harvester/log.csv. The logger uses a simple csv format with
   a line for each downloaded page.
   YYYY-MM-DD HH:MM:SS,repository,record count,download time,process time

   The logtype can be configured in the harvester config file, either
   "combined" putting all instance logs in one file (default), or
   "instance", and have a log file for each started instance