Skip to content

Image Format and Derivative Notes

Jacob Reed edited this page Aug 5, 2020 · 7 revisions

This page describes how NewspaperWorks handles derivative and image creation for different ingest scenarios, generally.

Ingest scenarios supported by NewspaperWorks

  • PDF issue upload (single or batch)
    • Automatic creation of constituent child Page works.
    • All derivatives enumerated below are created for Page, thumbnail is created for Issue from PDF.
  • NDNP batch ingest
    • Support original NDNP batches including TIFF primary, as well as sample batches with TIFF original omitted (in which case, NewspaperWorks creates a TIFF for use based other images included in sample batch for page).
    • Pre-existing derivatives (ALTO, JP2, PDF)
    • NewspaperIssue PDFs are created from PDF derivatives ingested into and stored on NewspaperPage works; issue PDFs are stored as a primary file.
    • All derivatives enumerated below are created for page, thumbnail is created for issue from compiled PDF.
  • TIFF or JP2 page batch upload
    • Automatic creation of parent Issue works.
    • All derivatives enumerated below are created for page, thumbnail is created for Issue from compiled PDF.

What Hyrax Does for Derivatives, Out of the Box

Hyrax does the following:

  • Stores derivatives on filesystem, not in Fedora.
    • In a "pairtree" directory structure (based on FileSet id) within tmp/derivatives directory of your app.
  • Keys derivatives per-FileSet on a "destination name", which is usually file extension.
    • Limitation: only one derivative file per extension.
  • Provides a means to lookup derivative storage path by FileSet id and destination name.
  • Ships a derivative creation service that (via actor stack and asynchronous jobs) that:
    • is responsible for creating a thumbnail, and in some cases attempts to extract plain full text using Solr/Tika (for PDF primary files);
    • relies on the mime-type of the primary file, detected during characterization, to determine which kind of derivatives to create.

Hyrax 2.x Limitations

  • The derivative service architecture of Hyrax is pluggable, but only for purposes of replacement. To achieve non-monolithic derivative service plugins, NewspaperWorks provides a configurable intermediary service that can run multiple derivative creation plugins.
  • Hyrax does not store derivatives in Fedora (instaed, a pairtree on filesystem).
  • Hyrax does not create any other image derivatives beside thumbnail.
  • Hyrax does not appear to make whether to run text extraction configurable for PDF primary files. Much of this text extraction out of the box could be duplicative with more precise text extraction done by NewspaperWorks derivative plugins.

Added Value that NewspaperWorks Provides

  • NewspaperWorks generates constituent child pages (Newspaper Page work, as ordered members) of a Newspaper Issue work, and in these page works stores most derivative content.
  • NewspaperWorks provides a pluggable derivative service responsible for running, in sequence, both the default Hyrax derivative service and a handful of derivative service components that output the following types for Newspaper Pages:
    • JP2 Image
      • For grayscale or monochrome images, JP2 images conform as closely as possible to NDNP JP2 profile.
      • JP2 images are created using OpenJPEG 2.0 tools.
      • Details on JP2 profile configuration are provided later in this page.
    • JSON word coordinates (resulting from running tesseract OCR), and extracting coordinates from hOCR.
    • Plain text (also via tesseract, preserving line breaks found in OCR)
      • This is indexed in Solr for page and issue works.
    • ALTO 2.0 XML
      • (also via tesseract, containing word coordinates, but without block or line segmentation). This process is lossy from hOCR original, but is sufficient for preservation of word coordinates for highlighting.
    • PDF (if primary file is not a PDF, creates 150ppi page images within)
    • Thumbnail (via Hyrax default plugin)
  • Some ingests (e.g. NDNP batch, TIFF/JP2 batch) will build an issue PDF from the constituent page PDF derivatives, once available.
    • Because in some cases one must wait for the page PDFs to be ready, there is a Rails 5.1+ dependency to support ActiveJob retry, used by the job composing multi-page issue PDFs.

The Hyrax Derivative Process, Out of the Box

  • (Hyrax) CharacterizeJob queues CreateDerivativesJob after characterization is complete.
  • (Hyrax) CreateDerivativesJob calls
    • FileSet.create_derivatives implemented by
      • Hyrax::FileSet::Derivatives mixin, _which delegates the create_derivatives call to →
        • Hyrax::DerivativeService a configurable runner of one or more actual derivative services, including →
          • Hyrax::FileSetDerivativesService, a class that selects what to create (using hydra-derivatives) based on mime-type (which presumes characterization of the work's files is already done).

How NewspaperWorks Extends the Hyrax Derivative Processing

  • While Hyrax::DerivativeService is "pluggable" it only allows a single plugin/service to run actual characterization on its behalf.
  • NewspaperWorks modifies this arrangement by making that single "plugin" its NewspaperWorks::PluggableDerivativeService, which is also configure to support multiple service plugins that run in sequence. This is configured in lib/newspaper_works/engine in NewspaperWorks, but may similarly be extended with additional derivative plugins by your application, as long as the injected service has an interface equivalent to Hyrax::DerivativeService (which all derivative service plugins shipped by NewspaperWorks do).
  • For NewspaperPage works, these plugins create a variety of text and image derivatives.
  • NewspaperWorks uses internal file attachment components to support pre-existing derivatives ingested from NDNP batch data.

Image Creation Trivia

  • Most scanned images of historic newspaper pages cannot be relied upon to have physical page dimensions (cm, inches) — both in TIFF images and PDF documents of scans. As such, pixels-per-inch is an imprecise measure. NewspaperWorks assumes that most pages, without any other geometry to work with, are 11.0" wide (Tabloid page size).
  • JP2 images are created by command-line tools of OpenJPEG 2.x (opj_compress, with fallback support to 1.x image_to_j2k command). Most TIFF images are transformed into intermediate/temporary NetPBM pgm/ppm files before encoding to JP2, due to tool input limitations.
  • JP2 creation flow looks like this:

Flow diagram for NewspaperWorks::JP2DerivativeService.create_derivatives

System dependencies for image processing by NewspaperWorks

  • Ubuntu 18.04 ("bionic"), Ubuntu 16.04 ("xenial"), or later:
    • ghostscript
    • poppler-utils
    • tesseract-ocr
    • libopenjp2-tools
    • imagemagick
  • Ubuntu 14.04 (trusty) substitutions:
    • libopenjpeg-tools (1.x, in lieu of libopenjp2-tools 2.x)
    • Note: Ubuntu 14.04 is no longer officially supported or tested by the NewspaperWorks development team.
  • macOS / homebrew:
    • openjpeg
    • imagemagick
    • poppler
    • tesseract

JP2 profiles (for OpenJPEG 2000)

  • NewspaperWorks uses OpenJPEG 2000 tools to encode, instead of kakadu (not open-source), or jasper. This choice is driven in large part because this library is accessible in most major Linux distributions.

Below are example commands used by OpenJPEG internally by NewspaperWorks. They are included here for reference and/or verification of conformity (e.g. NDNP).

Encoding NDNP Profile Grayscale JP2 with OpenJPEG 2.x opj_compress

opj_compress -i out.pgm -o out_2k.jp2 \
  -d 0,0 -b 64,64 -n 6 -p RLCP -t 1024,1024 -I -M 1 \
  -r 64,53.821,45.249,40,32,26.911,22.630,20,16,14.286,11.364,10,8,6.667,5.556,4.762,4,3.333,2.857,2.500,2,1.667,1.429,1.190,1

Encoding Color JP2 with OpenJPEG opj_compress

opj_compress -i color.ppm -o color_2k.jp2 \
  -d 0,0 -b 64,64 -n 6 -p RPCL -t 1024,1024 -I -M 1 \
  -r 2.4,1.48331273,.91673033,.56657224,.35016049,.21641118,.13374944,.0944,.08266171
  • Examples presume pgm/ppm source file, which NewspaperWorks creates from original TIFF or PDF primary files.
  • image_to_j2k is fallback command (1.x), usually for older OS distributions.