Skip to content

PDF Batch Ingest Guide

Eben English edited this page Sep 12, 2019 · 5 revisions

NewspaperWorks provides functionality for batch ingest of issue-level PDF files via a command-line rake task.

How to run it

To invoke the rake task, run the following command from the home directory of your application:

$ rake newspaper_works:ingest_pdf_issues -- --path=/path/to/your/pdf/batch

In addition to path, the rake task also accepts arguments for admin_set, depositor, and visibility, as in:

$ rake newspaper_works:ingest_pdf_issues -- --path=/path/to/your/pdf/batch --admin_set=admin_set/default --depositor=admin_user@example.com --visibility=open

What it does

When run, the rake task will:

  1. Create a NewspaperTitle object for the publication represented in the batch
  2. Iterate over the directories in the batch, creating a NewspaperIssue object for each PDF file
  3. Split the PDF into constituent pages, creating a NewspaperPage object for each
  4. Perform page-level OCR and word-coordinate analysis of the page text
  5. Attach existing page-level derivatives (ALTO, PDF, JSON, etc.) to the NewspaperPage objects
  6. Index OCR text to Solr for full-text searching
  7. Create a word-coordinate JSON derivative file to facilitate page-image search hit highlighting
  8. Add metadata to the created objects based on the directory and file names.
    • LCCN
    • publication title
    • place of publication
    • publication date
    • edition number

Notes:

  • If a NewspaperTitle object with the LCCN in the batch already exists, objects will be associated with the existing NewspaperTitle.
  • If no admin_set is specified, the default AdminSet (admin_set/default) will be used.
  • If no depositor is specified, objects will have a depositor value of User.batch_user.user_key by default.
  • If visibility is not specified, objects will have visibility value of open by default.
  • A log file of the batch process will be output to your application's log/ingest.log.

Prerequisites

The ingest script makes the following assumptions:

  1. There MUST be a main/home parent directory that contains all PDF files.
  2. The name of this main/home directory MUST correspond to the LCCN for the publication.
  3. Each PDF file MUST correspond to a single edition of a single issue.
  4. Each PDF file MUST be named according to the following convention: YYYYMMDDEE.pdf, where:
    • YYYY represents the 4-digit year
    • MM represents the 2-digit month
    • DD represents the 2-digit day
    • EE represents the 2-digit edition number (default is 01)

For an example of a PDF batch, see newspaper_works_fixtures.