Combine is a Django application to facilitate the harvesting, transformation, analysis, and publishing of metadata records by Service Hubs for inclusion in the Digital Public Library of America (DPLA).
The name "Combine", pronounced /kämˌbīn/, is a nod to the combine harvester used in farming famous for, "combining three separate harvesting operations - reaping, threshing, and winnowing - into a single process" Instead of grains, we have metadata records! These metadata records may come in a variety of metadata formats, various states of transformation, and may or may not be valid in the context of a particular data model. Like the combine equipment used for farming, this application is designed to provide a single point of interaction for multiple steps along the way of harvesting, transforming, and analyzing metadata in preperation for inclusion in DPLA.
Documentation is available at Read the Docs.
Combine has a fair amount of server components, dependencies, and configurations that must be in place to work, as it leverages Apache Spark, among other applications, for processing on the backend. There are a couple of deployment options.
A GitHub repository Combine-Docker exists to help stand up an instance of Combine as a series of interconnected Docker containers.
To this end, use the repository, Combine-playbook, which has been created to assist with provisioning a server with everything neccessary, and in place, to run Combine. This repository provides routes for server provisioning via Vagrant and/or Ansible. Please visit the Combine-playbook repository for more information about installation.
The whole app is a Django app.
The system configuration is stored in MySQL. This includes users, organizations, record groups, jobs, transformations, validation scenarios, and so on.
The harvested and transformed Records themselves are stored in MongoDB, to deal with MySQL's scaling problems.
We use ElasticSearch for indexing and searching the contents of Records.
Celery runs background tasks that don't deal with largescale data, like prepping job reruns or importing/exporting state.
Redis is just keeping track of Celery's job queue.
Livy is a REST interface to make it easier to interact with Spark.
Spark runs all the Jobs that harvest or alter records, for better scalability.
Hadoop is just backing up Spark.
Field Mappers let you make changes when mapping a Record from XML to key/value pairs (JSON).
Transformations let you take a Record in one format and turn it into a new Record in another format.
You can run Validation Scenarios against the Records in a Job to find out which records do or do not meet the requirements of the Validation Scenario.
RITS are used to transform a Record's Identifier, which is used for publishing and for uniqueness checks.