Skip to content

Indexing EAD in ArcLight

Gregory Wiedeman edited this page Dec 19, 2023 · 26 revisions

Now that you have your ArcLight application up and running, we need to index data into it.

EAD requirements

Currently, ArcLight's indexer expects the following:

  • Valid and well-formed EAD 2002 according to its XSD schema. If we can't parse the finding aid, we can't index it. (Indexing DTD-compliant EAD 2002 might work, but we can't guarantee it.)
  • All components have at least a <unittitle/> or <unitdate/>. Without either, we won't be able to display anything!

EAD recommendations

  • Components should all have unique IDs applied to them. These IDs are used as "slugs" for the identifiers of the documents in Arclight. Maintaining these identifiers allows an EAD to be updated and re-indexed while maintaining the URL that the component resides at (retaining any user bookmarks, etc). We will mint IDs for components that do not have them, but this is done using the location of the component w/i the hierarchy of the EAD. This means if components are moved around, the metadata that resides at a given URL may change in unexpected ways. See Customizing behavior of indexing components w/o IDs below for more info.

Download sample EAD

First we need to download or access our EAD's. Let's create a directory where we can store these within our application.

mkdir eads

Now let's add some data there.

# This command will save one of our test datasets to the directory you just created
wget -P eads/ https://raw.githubusercontent.com/projectblacklight/arclight/main/spec/fixtures/ead/nlm/alphaomegaalpha.xml

Repository configuration

Next we need to run our indexing task and tell the task which "Repository" the EAD file is linked to. By default, your ArcLight application should have a file config/repositories.yml that was generated. This file contains information about the repositories for your instance. For example, in the EAD alphaomegaalpha.xml we want to link it to the first repository in that file, nlm:

nlm:
  name: 'National Library of Medicine. History of Medicine Division'
  description: 'NLM’s History of Medicine Division collects, preserves, makes available, and interprets for diverse audiences one of the world’s richest collections of historical material related to human health and disease.'
  building: 'Building 38, Room 1E-21'
  address1: '8600 Rockville Pike'
  address2: ''
  city: 'Bethesda'
  state: 'MD'
  zip: '20894'
  country: 'USA'
  phone: ''
  contact_info: 'hmdref@nlm.nih.gov'
  thumbnail_url: "https://collections.nlm.nih.gov/pageturnerserver/ajaxp?theurl=http://localhost:8080/fedora/get/nlm:nlmuid-101421040-img/THUMB"
  google_request_url: 'https://docs.google.com/a/stanford.edu/forms/d/e/1FAIpQLSeOamhY_IcFw4sPnz0ddwWWkrPaHbM5wp7JVbOLOL_mIusEyw/viewform'
  google_request_mappings: "document_url=entry.1980510262&collection_name=entry.619150170&collection_creator=entry.14428541&eadid=entry.996397105&containers=entry.1125277048&title=entry.862815208"

We recommend that your config/repositories.yml contain only the repositories for which you have EADs to index.

Configuring a repository for Google Form Requests

ArcLight Repositories can be configured to enable items to be requestable through Google Forms. To enable this functionality, please provide the following keys in your configured repository in config/repository.yml under the request_types key:

  • request_url - this url is the url to the user facing version of your request form
  • request_mappings - this string represents an encoded form field mapping for your custom form fields and ArcLight. The configurable ArcLight fields are:
    • collection_name
    • collection_creator
    • eadid
    • containers

To get the Google Form field identifiers, use the "pre-filled" form to get a crafted url with a similar format to the request_mappings format. See Google Forms support for more information.

An example of a correctly configured form looks like this:

  request_types:
    google_form:
      request_url: 'https://docs.google.com/a/stanford.edu/forms/d/e/1FAIpQLSeOamhY_IcFw4sPnz0ddwWWkrPaHbM5wp7JVbOLOL_mIusEyw/viewform'
      request_mappings: "document_url=entry.1980510262&collection_name=entry.619150170&collection_creator=entry.14428541&eadid=entry.996397105&containers=entry.1125277048&title=entry.862815208"

Configuring a repository for Aeon Web EAD requests

ArcLight Repositories can be configured to enable items to be requestable through Aeon Web EAD requests. To enable this functionality, please provide the following keys in your configured repository in config/repository.yml at the request_types key:

  • request_url - this url is the url of the Aeon instance which will handle the request
  • request_mappings - this string represents an encoded query params mapping for your request and ArcLight. This can contain a method name which is to be used as the EAD url.

An example of a correctly configured form looks like this:

  request_types:
    aeon_web_ead:
      request_url: 'https://sample.request.com'
      request_mappings: "Action=10&Form=31&Value=ead_url"

Indexing a single file

ArcLight now uses Traject for indexing EAD XML files. We can use the traject command to index our EAD.

REPOSITORY_ID=nlm bundle exec traject -u http://127.0.0.1:8983/solr/blacklight-core -i xml -c lib/arclight/traject/ead2_config.rb eads/alphaomegaalpha.xml

Traject requires arguments that specify XML input, the Solr instance, the path to the EAD file, and the path to the ArcLight Traject config file. Optionally, you can specify a REPOSITORY_ID environment variable to index an EAD under a certain repository.

  • -i xml specifies that we are indexing an XML file
  • -u denotes the URL to the Solr core
  • -c denotes the path to the Arclight Traject config file
    • If you installed ArcLight as a gem, you can use $GEM_HOME to find the location.
    • The config file should be at $GEM_HOME/gems/arclight-x.x.x/lib/arclight/traject/ead2_config.rb
  • If you are running from within a cloned repo, the config file is at lib/arclight/traject/ead2_config.rb
  • PATH the path to the XML file you are indexing
  • (optional) REPOSITORY_ID= environment variable for indexing an EAD file under a repository as defined in repositories.yml.

Optionally, you can shorten this command a bit by using the rails arclight:index task, which will find the config file for you.

FILE=eads/alphaomegaalpha.xml SOLR_URL=http://127.0.0.1:8983/solr/blacklight-core REPOSITORY_ID=nlm bundle exec rails arclight:index

You can then store the arguments as environment variables.

SOLR_URL=http://127.0.0.1:8983/solr/blacklight-core
REPOSITORY_ID=nlm
FILE=eads/alphaomegaalpha.xml bundle exec rails arclight:index

Adding more finding aids and repositories

You can add new repositories to the config/repositories.yml file. The key that begins a repository is the same value you will use as the REPOSITORY_ID in the indexing rake task.

We recommend that you organize EADs by repository and put them all in a directory using the repository's key. Then, run the rake arclight:index_dir using the DIR and REPOSITORY_ID environment variables to index files all to the same repository:

# this assumes there's a directory with EAD files called /tmp/sul-spec, and a repository configured with the ID "spec"
DIR=/tmp/sul-spec REPOSITORY_ID=sul-spec bundle exec rake arclight:index_dir

Configuring Downloads for Collections

We use the config/downloads.yml file for configuration of how we provide download links to resources that can be generated from metadata indexed into the collection (e.g. PDF and EAD links). Accessors from the SolrDocument class can be interpolated using the ruby string formatting %{method_name} when using the template key (instead of the href key). This allows an Arclight implementer to use existing accessors to interpolate values or create their own to do any sort of custom URL generation that they would like (note that non-URL values will be URL escaped).

There is a default configuration that you can use to configure behavior for all collections.

default:
  pdf:
    template: http://example.com/%{unitid}.pdf

Collection specific behavior can be configured using the <unitid>. For example, if you have a Collection with the <unitid> of "MS C 271", you would provide links to the downloads and their sizes like so (note this is not using interpolation so a plain href key can be provided):

MS C 271:
  pdf:
    href: 'http://example.com/MS+C+271.pdf'
    size: '1.23MB'
  ead:
    href: 'http://example.com/MS+C+271.xml'
    size: 123456

If you need to remove links to a specific collection (or disable by default and enable for specific collections) you can set the disabled key to true. Note: the generated downloads.yml disables links by default.

MS C 271:
  disabled: true

The size of the download can be hardcoded as the size key (as above), or an accessor on the solr document can be provided (as a string). For instance, if you have a #finding_aid_size method on your SolrDocument class that can return the size for a file, you can reference that and it will be used to provide the size in the download link text (it is okay to not provide a size at all).

MS C 271:
  pdf:
    template: http://example.com/%{pdf_id}.pdf
    size: finding_aid_size

There are custom values that can be interpolated into the URL as well. Currently this includes repository_id which is the key that is being used in the repositories.yml configuration for that document's repository.

Since this is using string interpolation, the accessor can return the entire URL to be provided (and in this case, it will not escape the URL as it will w/ other values).

MS C 271:
  pdf:
    template: %{finding_aid_url}

Advanced: Using another Solr instance

If you have another Solr instance that you are using that's not on the default location on localhost, you can provide the SOLR_URL environment variable to index into that service:

SOLR_URL=http://solr.example.com/solr FILE=myead.xml REPOSITORY_ID=myid bundle exec rake arclight:index

Advanced: Purging your Solr instance

Normal indexing will overwrite your content with the ArcLight index software. You may, however, want to remove all of your Solr documents if your content has changed, then re-index your current content.

bundle exec rake arclight:destroy_index_docs
bundle exec rake arclight:index ...

Advanced: Customizing behavior of indexing components w/o IDs

While it is highly recommended that you index EAD that has consistent IDs for all components, we do mint an ID for you if we encounter a component without an ID. This can be customized in a few ways.

By default, the indexer will use something similar to an xpath to the component (but including indexes to make sure always have a unique value for each component) and uses SHA1 to create a hexdigest. This will then be added to the ID of the collection to generate the document ID (similar to other documents that have IDs).

It's possible to use another algorithm by updating Arclight::HashAbsoluteXpath.hash_algorithm

Arclight::HashAbsoluteXpath.hash_algorithm = Digest::SHA256

This can be any object that will respond to #hexdigest with the value to be hashed as the parameter and return the hashed value.

An entirely different strategy can also be used by updating Arclight::MissingIdStrategy.selected

Arclight::MissingIdStrategy.selected = MyMissingIdStrategy

The class being used as a strategy can take the XML node as a parameter to the initializer and must return the minted ID (minus the collection ID, which will be automatically added) in response to the #to_hexdigest method.

Advanced: component_identifier_format in ArcLight

ArcLight has default configuration for how IDs are minted for components from an EAD. These are used internally in the application for navigation. As of a breaking change in this PR, the default format for component IDs includes an underscore: <root_id>_<ref_id>. Here root_id is the root EAD document, and ref_id is a particular component in the hierarchy. In practice this looks something like umich-bhl-851981_aspace_ffa8f2e89cab96c9fa8c25b55ddb1e16.

How to customize component_identifier_format

(This is also the process to retain the default format that existed prior to PR)

Provide the component_identifier_format setting in the ead2_component_config.rb file. You need to provide this as a Ruby “named format string”.

Our default looks like this:

provide 'component_identifier_format', '%<root_id>s_%<ref_id>s'

Examples of customization: If instead of umich-bhl-851981_aspace_ffa8f2e89cab96c9fa8c25b55ddb1e16, you want aspace_ffa8f2e89cab96c9fa8c25b55ddb1e16 (no root prefix), you could provide the following traject setting:

provide 'component_identifier_format', '%<ref_id>s'

If you want to retain the previous default format, you could provide the following traject setting (no underscore):

provide 'component_identifier_format', '%<root_id>s%<ref_id>s’

For implementers upgrading beyond v1.0.1

To incorporate the changes merged in this PR and start to use underscores in your component IDs, you will need to run a full reindex because these IDs are stored in Solr. Without doing this, navigating an EAD tree of components will break with routes not found (ArcLight will be looking for /catalog/aoa271_aspace_24d96d896c187b4e90ebb6c910f0462f when your component is stored as /catalog/aoa271aspace_24d96d896c187b4e90ebb6c910f0462f).

To retain the old-style IDs, customize the component_identifier_format as described above before running a full reindex.