Princeton Prosody Archive
The Princeton Prosody Archive is a Drupal-based site that allows users to browse, search, correct, and annotate a collection of several thousand works on the study of poetic meter and verse form. The current version of the the archive includes mostly monographs written in English that are in the public domain in the United States. The code in this repository (and related repositories) was developed by Travis Brown first as an independent contractor and then in his role as Assistant Director at the Maryland Institute for Technology in the Humanities as part of a partnership between the Princeton Prosody Archive and MITH.
Most of the code developed for the project has been moved from this repository to a more general HathiTrust utilities project maintained by MITH (all Prosody Archive-specific code remains here).
Text and Structural Metadata
Most of the volumes in the archive are from the HathiTrust Digital Library. The Library delivered the textual content of these volumes to the Prosody Archive as a set of Pairtree archives containing METS records describing the page structure of each volume and zip files containing the individual page text (which is most cases was produced by an optical character recognition system without human supervision or correction).
The Prosody Archive has developed software for working with Pairtree archives (and specifically archives following the conventions used by the HathiTrust) in both Haskell and Scala. This repository also includes code for reading volume structure (including page types, numbers, etc.) out of the structure maps in the METS files.
In some cases we have had to recover missing files or other data from the
HathiTrust Data API (application programming interface).
During the first
months of this phase of work on the Prosody Archive, access to the first
version of the Data API was disabled. Version 2 requires validation via a
somewhat unusual flavor of OAuth
authentication) that is not supported by most OAuth client libraries.
The Prosody Archive has developed generalized code for one-legged
OAuth authentication in both Haskell (built on the
and Scala (built on
The Prosody Archive relies much more heavily on the HathiTrust Bibliographic API, since the metadata in the METS files is entirely structural and does not include any bibliographical information about authors, titles, dates of publication, etc. The Archive has developed code for requesting JSON from the Bibliographic API and parsing it into record and item metadata (see the Bibliographic API documentation for a full definition of these terms in this context; in short each record describes a bibliographic entity, while items are physical volumes).
Much of the bibliographic metadata of interest to the archive is not directly available in the JSON, but is instead embedded in the JSON as MARC records (serialized as XML and placed in a JSON string; if this sounds like an encoding nightmare, it is). The Archive has developed Scala code that uses the Argonaut library to parse the JSON and MARC4J to parse this embedded MARC XML.
Aligning Data and Metadata
Much of the code in this repository is designed to solve the problem of gathering pieces of data and metadata from all of these different sources into a single coherent model. Because processing all of this data for thousands of volumes can be time consuming, the code uses Scalaz's disjunction and validation sum types to model failure at the value-level instead of relying on exceptions. While this involves some syntactic overhead, it makes it much easier to be able to start a processing run and come back ten minutes later to find a clean, comprehensive list of validation errors instead of a single useless stack trace.
The user-facing parts of the Archive are built on Drupal 7.
The Prosody Archive site uses the Drupal Bibliography Module to provide the Drupal data model and views for its contents. We extracted the MARC XML records, enrich them with additional metadata from the METS files and Bibliographic API JSON, and load them into Drupal via the Bibliography Module's import functionality.
Similarly, in the first version of the site, we used the Solr backend for the Drupal Search API module, but this also proved unwieldy at the scale of several thousand volumes, and difficult to integrate with the faceted search provided by the Bibliography Module. In the current version we have followed the model that MITH used in the development of the Shelley-Godwin Archive, in which the search functionality is mostly managed on the client side in a Backbone.js application, which communicates with Solr through a proxy that only allows read-only queries.
In order to set up a new installation, you need to have the HathiTrust Pairtree structures
and Bibliographic API JSON metadata saved locally. You also need a spreadsheet that
lists the volumes that you want to be included in the archive (and which collections they
should be included in). See the
metadata directory for the Prosody Archive spreadsheet,
and see the MITH HathiTrust utilities repository for
information about how to access the Bibliographic API, etc.
You should have an installation of Drupal 7 available (there is currently a configured
but unloaded backup of the database in
The Drupal theme and module in the
drupal directory should be copied to the appropriate location (generally
sites/all/modules) and enabled via Drush.
solr-search/assets directory should be copied to
require.js file to
Next you'll need to create the MARC record files for import. First compile the application:
And then run the MARC generation:
java -jar core/target/scala-2.10/ppa-assembly-0.0.0-SNAPSHOT.jar \ --marc metadata/ppa-volumes.xlsx
Next you'll need to navigate to the Drupal installation and run the Drush script to perform the import:
drush php-script ~/code/projects/prosody/scripts/import-marc.php
This will take a few minutes (and will display some warnings). Next run the following to add the relationships between records, volumes, and collections:
java -jar core/target/scala-2.10/ppa-assembly-0.0.0-SNAPSHOT.jar \ --itemize metadata/ppa-volumes.xlsx
And finally run the Solr indexer:
java -jar core/target/scala-2.10/ppa-assembly-0.0.0-SNAPSHOT.jar \ --index metadata/ppa-volumes.xlsx
This will take up to several hours, so you may want to use
nohup to avoid
broken connection errors:
nohup java -jar core/target/scala-2.10/ppa-assembly-0.0.0-SNAPSHOT.jar \ --index metadata/ppa-volumes.xlsx &
And then the installation will be ready for use.
Note about version control
This repository uses Git subtrees
to include the history of customized third-party projects (the Drupal theme and
modules and the Solr client application). This allows upstream changes to be
incorporated with (relative) ease—just check out the relevant branch, pull from
the upstream repository, handle any conflicts, etc., switch back to
git merge --squash -s subtree --no-commit subproject-branch
See the documentation for more information about how subtrees work.
The MITH HathiTrust utilities library and the
project in this repository are released under the Apache License, Version 2.0.
The customized theme and modules in the
drupal repository are released under the
GNU General Public License, Version 2.
solr-search directory is released under the
MIT License. Please see the individual projects
in these directories for full information about copyright and licensing.