Skip to content
Automated workflow for harvesting, transforming and indexing of metadata using metha, OpenRefine and Solr. Part of the Hamburg Open Science "Schaufenster" software stack.
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
cfg
.gitignore
LICENSE
README.md
flowchart.mmd
flowchart.png
init-solr-schema.sh
install.sh
load-new-data.sh
preview.gif
run.sh

README.md

HOS-MetadataTransformations

Codacy Badge

Automated workflow for harvesting, transforming and indexing of metadata using metha, OpenRefine and Solr. Part of the Hamburg Open Science "Schaufenster" software stack.

Use case

  1. Harvest metadata in different standards (dublin core, datacite, ...) from multiple OAI-PMH endpoints
  2. Transform harvested data with specific rules for each source to produce normalized and enriched data
  3. Load transformed data into a Solr search index (which serves as a backend for a discovery system, e.g. HOS-TYPO3-find)

Data Flow

mermaid flowchart

Source: flowchart.mmd (try mermaid live editor)

Preview

preview

Features

System requirements

Installation

tested with Ubuntu 16.04 LTS and Ubuntu 18.04 LTS

install git:

sudo apt install git

clone this git repository:

git clone https://github.com/subhh/HOS-MetadataTransformations.git
cd HOS-MetadataTransformations

install openjdk-8-jre-headless, zip, curl, jq, metha 1.29, OpenRefine 3.2 beta, openrefine-client 0.3.4 and Solr 7.3.1:

sudo ./install.sh

Configure Solr schema:

./init-solr-schema.sh

Usage

Data will be available after first run at:

Run workflow with data source "ediss" and load data into local Solr (-s) and local OpenRefine service (-d)

bin/ediss.sh -s http://localhost:8983/solr/hos -d http://localhost:3333

Run workflow with all data sources in parallel and load data into local Solr (-s) and local OpenRefine service (-d):

./run.sh -s http://localhost:8983/solr/hos -d http://localhost:3333

Run workflow with all data sources and load data into two external Solr cores (-s) and external OpenRefine service (-d)

./run.sh -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -s https://openscience.hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80

Solr authentication

If your external Solr is secured with username/password (Basic Authentication Plugin), you may provide the credentials by copying cfg/solr/credentials.example to cfg/solr/credentials and fill in username and password.

cp cfg/solr/credentials.example cfg/solr/credentials
nano cfg/solr/credentials
chmod 400 cfg/solr/credentials

Cronjob

Example for daily cronjob at 00:35 AM to run workflow with all data sources, load data into external Solr core (-s) and external OpenRefine service (-d) and delete files older than 7 days (-x)

command="$(readlink -f run.sh) -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80 -x 7"
job="35 0 * * * $command"
cat <(fgrep -i -v "$command" <(crontab -l)) <(echo "$job") | crontab -

Add a data source

  • Step 1: Harvest new OAI-PMH endpoint and load data into OpenRefine. Example for a new data source called yourdatasource with OAI-PMH endpoint http://ediss.sub.uni-hamburg.de/oai2/oai2.php:
./load-new-data.sh -c yourdatasource -i http://ediss.sub.uni-hamburg.de/oai2/oai2.php
  • Step 2: Explore the data in OpenRefine at http://localhost:3333 (project yourdatasource_new) and create transformations until data looks fine and suits the Solr schema.

  • Step 3: Extract the OpenRefine project history in json format and save it in a subdirectory of cfg/, e.g. cfg/yourdatasource/transformation.json.

  • Step 4: Copy an existing bash shell script (e.g. bin/ediss.sh to bin/yourdatasource.sh and edit line 17 (codename of the source, e.g. yourdatasource) and line 18 (url to OAI-PMH endpoint, e.g. http://ediss.sub.uni-hamburg.de/oai2/oai2.php). If you load a big dataset you may need to allocate more memory to OpenRefine (line 19).

cp -a bin/ediss.sh bin/yourdatasource.sh
gedit bin/yourdatasource.sh
  • Step 5: Run your shell script (or full workflow)
bin/yourdatasource.sh -s http://localhost:8983/solr/hos -d http://localhost:3333
You can’t perform that action at this time.