-
Notifications
You must be signed in to change notification settings - Fork 2
Image Format and Derivative Notes
Jacob Reed edited this page Aug 5, 2020
·
7 revisions
This page describes how NewspaperWorks handles derivative and image creation for different ingest scenarios, generally.
- PDF issue upload (single or batch)
- Automatic creation of constituent child Page works.
- All derivatives enumerated below are created for Page, thumbnail is created for Issue from PDF.
- NDNP batch ingest
- Support original NDNP batches including TIFF primary, as well as sample batches with TIFF original omitted (in which case, NewspaperWorks creates a TIFF for use based other images included in sample batch for page).
- Pre-existing derivatives (ALTO, JP2, PDF)
- NewspaperIssue PDFs are created from PDF derivatives ingested into and stored on NewspaperPage works; issue PDFs are stored as a primary file.
- All derivatives enumerated below are created for page, thumbnail is created for issue from compiled PDF.
- TIFF or JP2 page batch upload
- Automatic creation of parent Issue works.
- All derivatives enumerated below are created for page, thumbnail is created for Issue from compiled PDF.
Hyrax does the following:
- Stores derivatives on filesystem, not in Fedora.
- In a "pairtree" directory structure (based on FileSet id) within
tmp/derivatives
directory of your app.
- In a "pairtree" directory structure (based on FileSet id) within
- Keys derivatives per-FileSet on a "destination name", which is usually file extension.
- Limitation: only one derivative file per extension.
- Provides a means to lookup derivative storage path by FileSet id and destination name.
- Ships a derivative creation service that (via actor stack and asynchronous jobs) that:
- is responsible for creating a thumbnail, and in some cases attempts to extract plain full text using Solr/Tika (for PDF primary files);
- relies on the mime-type of the primary file, detected during characterization, to determine which kind of derivatives to create.
- The derivative service architecture of Hyrax is pluggable, but only for purposes of replacement. To achieve non-monolithic derivative service plugins, NewspaperWorks provides a configurable intermediary service that can run multiple derivative creation plugins.
- Hyrax does not store derivatives in Fedora (instaed, a pairtree on filesystem).
- Hyrax does not create any other image derivatives beside thumbnail.
- Hyrax does not appear to make whether to run text extraction configurable for PDF primary files. Much of this text extraction out of the box could be duplicative with more precise text extraction done by NewspaperWorks derivative plugins.
- NewspaperWorks generates constituent child pages (Newspaper Page work, as ordered members) of a Newspaper Issue work, and in these page works stores most derivative content.
- NewspaperWorks provides a pluggable derivative service responsible for running, in sequence, both the default Hyrax derivative service and a handful of derivative service components that output the following types for Newspaper Pages:
-
JP2 Image
- For grayscale or monochrome images, JP2 images conform as closely as possible to NDNP JP2 profile.
- JP2 images are created using OpenJPEG 2.0 tools.
- Details on JP2 profile configuration are provided later in this page.
- JSON word coordinates (resulting from running tesseract OCR), and extracting coordinates from hOCR.
-
Plain text (also via tesseract, preserving line breaks found in OCR)
- This is indexed in Solr for page and issue works.
-
ALTO 2.0 XML
- (also via tesseract, containing word coordinates, but without block or line segmentation). This process is lossy from hOCR original, but is sufficient for preservation of word coordinates for highlighting.
- PDF (if primary file is not a PDF, creates 150ppi page images within)
- Thumbnail (via Hyrax default plugin)
-
JP2 Image
- Some ingests (e.g. NDNP batch, TIFF/JP2 batch) will build an issue PDF from the constituent page PDF derivatives, once available.
- Because in some cases one must wait for the page PDFs to be ready, there is a Rails 5.1+ dependency to support ActiveJob retry, used by the job composing multi-page issue PDFs.
-
(Hyrax)
CharacterizeJob
queuesCreateDerivativesJob
after characterization is complete. -
(Hyrax)
CreateDerivativesJob
calls →-
FileSet.create_derivatives
implemented by →-
Hyrax::FileSet::Derivatives
mixin, _which delegates thecreate_derivatives
call to →-
Hyrax::DerivativeService
a configurable runner of one or more actual derivative services, including →-
Hyrax::FileSetDerivativesService
, a class that selects what to create (using hydra-derivatives) based on mime-type (which presumes characterization of the work's files is already done).
-
-
-
-
- While
Hyrax::DerivativeService
is "pluggable" it only allows a single plugin/service to run actual characterization on its behalf. - NewspaperWorks modifies this arrangement by making that single "plugin" its
NewspaperWorks::PluggableDerivativeService
, which is also configure to support multiple service plugins that run in sequence. This is configured inlib/newspaper_works/engine
in NewspaperWorks, but may similarly be extended with additional derivative plugins by your application, as long as the injected service has an interface equivalent toHyrax::DerivativeService
(which all derivative service plugins shipped by NewspaperWorks do). - For
NewspaperPage
works, these plugins create a variety of text and image derivatives. - NewspaperWorks uses internal file attachment components to support pre-existing derivatives ingested from NDNP batch data.
- Most scanned images of historic newspaper pages cannot be relied upon to have physical page dimensions (cm, inches) — both in TIFF images and PDF documents of scans. As such, pixels-per-inch is an imprecise measure. NewspaperWorks assumes that most pages, without any other geometry to work with, are 11.0" wide (Tabloid page size).
- JP2 images are created by command-line tools of OpenJPEG 2.x (
opj_compress
, with fallback support to 1.ximage_to_j2k
command). Most TIFF images are transformed into intermediate/temporary NetPBM pgm/ppm files before encoding to JP2, due to tool input limitations. - JP2 creation flow looks like this:
- Ubuntu 18.04 ("bionic"), Ubuntu 16.04 ("xenial"), or later:
- ghostscript
- poppler-utils
- tesseract-ocr
- libopenjp2-tools
- imagemagick
- Ubuntu 14.04 (trusty) substitutions:
- libopenjpeg-tools (1.x, in lieu of libopenjp2-tools 2.x)
- Note: Ubuntu 14.04 is no longer officially supported or tested by the NewspaperWorks development team.
- macOS / homebrew:
- openjpeg
- imagemagick
- poppler
- tesseract
- NewspaperWorks uses OpenJPEG 2000 tools to encode, instead of kakadu (not open-source), or jasper. This choice is driven in large part because this library is accessible in most major Linux distributions.
Below are example commands used by OpenJPEG internally by NewspaperWorks. They are included here for reference and/or verification of conformity (e.g. NDNP).
opj_compress -i out.pgm -o out_2k.jp2 \
-d 0,0 -b 64,64 -n 6 -p RLCP -t 1024,1024 -I -M 1 \
-r 64,53.821,45.249,40,32,26.911,22.630,20,16,14.286,11.364,10,8,6.667,5.556,4.762,4,3.333,2.857,2.500,2,1.667,1.429,1.190,1
opj_compress -i color.ppm -o color_2k.jp2 \
-d 0,0 -b 64,64 -n 6 -p RPCL -t 1024,1024 -I -M 1 \
-r 2.4,1.48331273,.91673033,.56657224,.35016049,.21641118,.13374944,.0944,.08266171
- Examples presume pgm/ppm source file, which NewspaperWorks creates from original TIFF or PDF primary files.
image_to_j2k
is fallback command (1.x), usually for older OS distributions.