#DPLAH Hydra Head README
DPLAH is a proof-of-concept aggregator for OAI-PMH metadata with additional features specific to the DPLA's discovery interface needs.
- Hydra and therefore everything Hydra needs
- Redis key-value store
- xsltproc for processing XSLT at the command line
- tcl for supporting make test in Redis
- Apache for serving harvesting logs in the browser, and for later production deployment
##Clone the source files
Clone the Hydra head from the Git repository:
git clone https://github.com/tulibraries/dplah.git
Execute the remaining tasks from the Hydra head application directory:
Then run the Setup script to quickly get your environemnt running
./setup.sh # you may need to `chmod +x` the setup file first"
The full details of configuring the application can be see below in the Configuration section.
To use this head, create a file under "
config/dpla.yml" (you may copy the example file, "
config/dpla.yml.example" and add the following:
harvest_data_directory: "/path/to/metadata/from/contributors" converted_foxml_directory: "/path/to/foxml/converted/metadata" pid_prefix: "changeme" partner: "Name of hub" human_log_path: '/abs/path/to/log/files' human_log_url: 'http://fullurltologfiles' human_catalog_url: 'http://fullurltocatalog' email_sender: "email@example.com" email_recipient: "firstname.lastname@example.org" noharvest_stopword: "string_in_record_metadata_that_signals_not_to_harvest" passthrough_url: "sub.example.com"
Substitute your own values as follows for the YML fields:
harvest_data_directoryrefers to the value of the absolute location on your filesystem where you want the application to store harvested, unprocessed metadata from providers
converted_foxml_directoryrefers to the value of the absolute location on your filesystem where you want the application to store converted FOXML files generated from the harvested metadata, for ingestion into Fedora
pid_prefixrefers to the namespace for your Fedora objects
partnerrefers to the name of the DPLA Partner hub
human_log_pathrefers to the absolute path on the file system where the OAI management logs should go
human_log_urlrefers to the full clickable URL path to the directory on your web server where OAI management logs live
human_log_urlrefers to the full clickable URL path to the Blacklight catalog on your web server
email_senderrefers to the sending address for harvest utility email reports
email_recipientrefers to the recipient address for harvest utility email reports
noharvest_stopwordrefers to an optional string that the aggregator can use as a signal to skip over any records containing that string in their metadata, excluding it from ingest
- `passthrough_url' refers to the URL of a passthrough workflow agent signified in identifiers (no need to use unless utilizing the passthrough workflow) ##Start up locally
To start up locally, be sure that your pid_prefix as defined in
config/dpla.yml matches the pid prefix in your
fedora.fcfg file under
Install the Ruby necessary gems:
Ensure you have jetty installed and configured, and all tables are migrated (can use the following commands if needed):
rake db:migrate rails g hydra:jetty rake jetty:config
Start the jetty server:
Give jetty time to start up (about 10-30 seconds) before starting the Rails application:
rails server -d -b 127.0.0.1
On your web browser, go to
http://localhost:3000 to verify that hydra head is up and running. If you get a SOLR connection error, wait a few more seconds for Jetty to complete its startup and try again.
##Redis and Resque
Harvest and delete jobs are backgrounded once assigned through the dashboard, with the use of Redis and Resque. In order to do this, Redis must be installed and configured.
- Tutorial on installing/configuring Redis on Ubuntu and CentOS from projecthydra-labs hydradam
- This Hydra head uses the resque-pool gem to make configuration of queues easier. See the
config/resque-pool.ymlfile for the default configuration. It is recommended that users configure two queues for instances of this application, one for Harvest jobs and one for DumpReindex jobs. These are set and allocate two workers per job by default. Adjust the number of workers as needed for your application.
- To initialize a resque pool's workers, issue the following command at the terminal:
env TERM_CHILD=1 VVERBOSE=1 COUNT='NUM' QUEUE=NAME_OF_QUEUE bundle exec rake resque:workers
NUMrefers to the number of workers you want to activate in this pool
NAME_OF_QUEUErefers to the name of the queue you are initializing (ie, harvest)
- Start and verify Redis.
- Clone the repository locally.
- Run Hydra generators/set up jetty/tomcat, etc.
- Start Fedora/Solr services.
- Initiate the Resque queues you'll be using for your backgrounded harvest and delete tasks -- these default to "harvest" and "delete."
- Start the Rails server.
###Managing OAI Seeds
- To begin harvesting, go to the relative path "/providers" (the "Manage OAI Seeds" dashboard -- you will need to create a devise user account first if using the default setup) and input data for an OAI-PMH harvestable repository (see "OAI Resources" below for tips on getting started)
- Save the provider.
- From the dashboard, click the "Harvest from OAI Source" button in the OAI seeds table. You should see the ajax spinner and a prompt to check the harvesting logs. Click this link to go to the directory listing page of OAI management logs. These are textual logs created every time an OAI seed action is performed from within the dashboard (that is, harvest, delete, etc). Harvesting/ingesting and deleting from the index can take a while, especially for seeds with many records, so you can monitor the progress of an ingest by refreshing the textual log periodically.
- Go to your Hydra head in the browser to see if the metadata was harvested and has been made discoverable.
####More Actions for OAI Seeds
- From the dashboard, you may delete all records from a collection, from an institution, or delete all records in the aggregator. You can also re-harvest records from seeds (not deleting before doing so will result in the older ones being overwritten), and harvest everything available from all seeds via the actions underneath the OAI seeds table. The "harvest all" and "delete all" tasks will take a while if you have many seeds, and many records in the index respectively.
- To harvest just the raw OAI from a specific OAI seed in the application, run the following in the terminal:
NUM= the ID of the provider/OAI seed that you are attempting to harvest
This will harvest the raw OAI available from the seed, save as XML (broken into separate files on the resumption token), and place in the directory defined by "harvest_data_directory" in
config/dpla.yml. This can be handy if you are trying to troubleshoot XML-related issues with a seed's OAI content as it is being delivered to the hub.
- To harvest all metadata and view the immediately-delivered OAI-PMH before it is converted and ingested, go to the the command line from the root of your Rails application and run the following:
If the provider is harvestable, you should see metadata fly past you in the terminal. If there are errors with the metadata, you should (hopefully) see those too. Go to the path on your filesystem specified in dpla.yml under the "harvest_data_directory" value, and you should see an XML file containing the harvested metadata in one large file.
- To see the FOXML output before it is ingested into the repository, go to the the command line from the root of your Rails application and run the following:
You should see a message displaying the absolute path to your harvested XML, stating that it was "converted." If you see errors, make sure you have xsltproc installed on your system. If you do and you still see errors, the XML may be invalid or contain problematic characters.
- To harvest and ingest all records from all OAI seeds present in the application, run the following in the terminal:
This will harvest, convert, normalize, and ingest all OAI-PMH records from all OAI seeds in the Hydra head into the repository.
- To remove all records from the aggregator index, run the following in the terminal:
This will delete all harvested records from the local repository.
##Tests From within the root of your Rails application, run the following:
rake jetty:config rake jetty:start redis-server rake spec
This will run the whole test suite, which may take several minutes, especially the first time.
This application currently tests baseline functionality of Providers and OAI Records. Additional code for scheduled harvests and less-easily-tested features of XML conversion are pending, so there should be 160 examples, 30 pending, 0 failures out of the box, if all system dependencies are up and running.
- OAI-PMH for Beginners Tutorial
- Sample OAI-PMH Requests from the Library of Congress
- Open Archives Registered Data Providers
- Note with this one, not all listed necessarily still provide OAI-PMH metadata in a harvestable format
Some code in this project (and much research) is based on the talented Chris Beer's work in the following invaluable repos:
This software has been developed by and is brought to you by the Hydra community. Learn more at the Project Hydra website.