Skip to content
Darren Hardy edited this page Nov 16, 2016 · 47 revisions

The dor_indexing_app is the primary API for indexing DOR objects into the DOR Index in the Solr cloud. Its purpose is to keep data consistent between DOR objects and the Solr index via an automated pipeline. Its challenges are to ensure that all DOR objects can be indexed (i.e., handling unexpected data problems) and to keep the latency down and throughput up.

Architecture

  • All incoming requests are synchronous (i.e., blocks until finished)
  • All DOR objects are indexed in parallel (i.e., not batched).
  • All DOR objects are read from Fedora and the Workflow service
  • All Solr documents are written to the Solr cloud

Reindexing

The /dor/reindex/:pid route (see API documentation) basically does the following:

obj = Dor.load_instance pid # initializes an ActiveFedora, e.g., Dor::Item, Dor::Collection, Dor::AdminPolicyObject
solr_doc = obj.to_solr      # loads all datastreams and related objects via ActiveFedora
Dor::SearchService.solr.add(solr_doc, options) # updates the Solr index via RSolr

We have a detailed description of the indexing itself. Note that that documentation addresses "bulk indexing" in great detail, which is outside the normal reindexing pipeline processes, and is used for building an index "from scratch."

We have logging for the time taken by each API call to /dor/reindex/:pid. There is some instrumentation (i.e., benchmarking, metrics, etc.) at the sub-route level. The logic has these 3 main parts:

  • (a) reading the object from Fedora and the Workflow Service,
  • (b) converting the object into a Solr document, and
  • (c) updating the Solr index.

Notably the distinction between (a) and (b) is blured due to the "lazy loading" approach that ActiveFedora uses to load all the datastreams for a given object. You can view a sample Solr document for an object by adding ".json" to the end of the Argo object view page, e.g., https://argo.stanford.edu/catalog/druid:bb021tj7970.json

Messaging

The dor_indexing_app's incoming traffic is solely from an ActiveMQ consumer. Fedora sends ActiveMQ messages on every update or delete to an object to the fedora.apim.update topic, as does the Workflow Service on every change in status.

There is a single consumer of those messages that translates them into GET requests on /dor/reindex/:pid. You can see in the ActiveMQ configuration that the fedora.apim.update topic is consumed by doing a reindexing API call to dor-indexing-app. The messages are aggregated into batches by this ActiveMQ consumer, but its effectiveness and ability to de-duplicate messages is unknown (as of 11/4/16).

The fedora.apim.update topic also receives delete object messages (called purgeObject). These messages are consumed and then do a GET request on /dor/delete_from_index/:pid API.

Note that we have two ActiveMQ brokers running (and "a" and a "b" node) and they are not actively load balanced but are configured for failover. That is, if sending a message to the "a" node fails, then the messsage is sent to the "b" node. This failover method is configured in the Fedora and Workflow Service ActiveMQ configuration files:

failover:(tcp://mqhost1,tcp://mqhost2)?timeout=5000

There are ActiveMQ dashboards available at /activemqweb/ and /hawtio/, and dor-indexing-app has a /dor/queue_size route that will return the current incoming queue size.

Stack

Gems

These are some of the notable gems in the stack (versions are in Gemfile.lock):

  • dor-services:
    • This holds all the application logic for converting a DOR object into a Solr document
  • dor-workflow-service:
    • Used by dor-services to get information about the workflows datastream
    • This is the API to our Workflow Service HTTP API
  • ActiveFedora (on the 8.x branch, the latest for the Fedora 3 releases):
    • Used by dor-services to do CRUD operations on Fedora objects
    • This is the ActiveRecord-like API to objects stored in Fedora
  • rubydora:
    • Used by ActiveFedora
    • This is the API to Fedora's HTTP API -- Fedora v3 only
  • RSolr:
    • Used by dor-services, ActiveFedora to query and index Solr documents
    • This is the API to Solr's HTTP API
  • rest-client:
    • Used by several gems for HTTP request/response processing
    • This is an HTTP client gem
  • rails:
    • The webapp platform

External services

  • Fedora (for reading)
  • Workflow Service (for reading)
  • Solr (for writing, although it apparently uses a read/query too)
  • ActiveMQ (generates incoming traffic)