Skip to content

Article segmentation in METS ALTO

Eben English edited this page Dec 17, 2019 · 3 revisions

Article segmentation in newspaper digitization may take a variety of forms. Due to the expense associated with this level of digitization and analysis, not many institutions have content in this format, so establishing standard models of content representation are tricky. This guide should be considered a work in progress.

METS + ALTO with article segmentation

In this case, image files are created at the page level. Each page has at least one type of image file (typically either TIF or JP2, sometimes both) and a corresponding ALTO XML file. There is also an accompanying METS XML file that provides data about the page file order, and article-level metadata for each article in the issue.

Page-level ALTO

The key elements in page-level ALTO file are <TextBlock> (typically paragraphs) and <ComposedBlock> (images, possibly with accompanying captions).

<TextBlock> example:

<alto>
  <Layout>
    <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="4754" WIDTH="3716" PC="0.911">
      <PrintSpace ID="P1_PS00001" HPOS="146" VPOS="191" WIDTH="3475" HEIGHT="4323">
        <TextBlock ID="P1_TB00013" HPOS="834" VPOS="620" WIDTH="926" HEIGHT="178" language="en" STYLEREFS="TXT_6 PAR_LEFT">
          <TextLine ID="P1_TL00130" HPOS="837" VPOS="622" WIDTH="923" HEIGHT="75">
            <String ID="P1_ST00791" HPOS="837" VPOS="622" WIDTH="537" HEIGHT="72" CONTENT="Hovenanian" WC="0.79" CC="0651020030"/>

To obtain the text, parse the <TextBlock><TextLine><String @CONTENT> values.

<ComposedBlock> example:

<alto>
  <Layout>
    <Page ID="P1" PHYSICAL_IMG_NR="1" HEIGHT="4754" WIDTH="3716" PC="0.911">
      <PrintSpace ID="P1_PS00001" HPOS="146" VPOS="191" WIDTH="3475" HEIGHT="4323">
        <ComposedBlock ID="P1_CB00001" HPOS="152" VPOS="547" WIDTH="565" HEIGHT="648" STYLEREFS="TXT_12 PAR_LEFT" TYPE="Illustration">
          <GraphicalElement ID="P1_CB00001_SUB" HPOS="152" VPOS="547" WIDTH="565" HEIGHT="648"/>
        </ComposedBlock>

Issue-level METS

The METS file contains bibliographic metadata for each article, and structural metadata indicating which <TextBlock> and <ComposedBlock> elements from the page-level ALTO files belong to each article.

Specifically, the <area @BEGIN> attribute value corresponds to the @ID value of <TextBlock> and <ComposedBlock> elements from the page-level ALTO files.

Bibliographic metadata for articles example:

<mets>
  <dmdSec ID="MODSMD_ARTICLE1">
    <mdWrap MIMETYPE="text/xml" MDTYPE="MODS" LABEL="Bibliographic meta-data of article 0">
      <xmlData>
        <MODS:mods>
          <MODS:titleInfo ID="MODSMD_ARTICLE1_TI1" xml:lang="en">
            <MODS:title>Council reverses itself; will pay Liacos’ fee for Largey investigation</MODS:title>
          </MODS:titleInfo>
          <MODS:language>
            <MODS:languageTerm type="code" authority="rfc3066">en</MODS:languageTerm>
          </MODS:language>
        </MODS:mods>
      </xmlData>
    </mdWrap>
  </dmdSec>
  <dmdSec ID="MODSMD_PICT1">
    <mdWrap MIMETYPE="text/xml" MDTYPE="MODS" LABEL="Bibliographic meta-data of picture">
      <xmlData>
        <MODS:mods>
          <MODS:titleInfo ID="MODSMD_PICT1_TI1" xml:lang="en">
            <MODS:title>CORCORAN</MODS:title>
          </MODS:titleInfo>
        </MODS:mods>
      </xmlData>
    </mdWrap>
  </dmdSec>

Structural metadata example:

<mets>
  <structMap LABEL="Logical Structure" TYPE="LOGICAL">
    <div ID="DIVL1" TYPE="Newspaper" LABEL="Cambridge Chronicle no.  04.01.1973">
      <div ID="DIVL2" TYPE="VOLUME" DMDID="MODSMD_PRINT MODSMD_ELEC" LABEL="Cambridge Chronicle no. 04.01.1973">
        <div ID="DIVL3" TYPE="ISSUE" DMDID="MODSMD_ISSUE1" LABEL="Cambridge Chronicle no. 04.01.1973">
<div ID="DIVL12" TYPE="ARTICLE" DMDID="MODSMD_ARTICLE1" LABEL="Council reverses itself; will pay Liacos’ fee for Largey investigation">
          <div ID="DIVL13" TYPE="HEADING">
            <div ID="DIVL14" TYPE="TITLE">
              <fptr>
                <area BETYPE="IDREF" FILEID="ALTO00001" BEGIN="P1_TB00006"/>
              </fptr>
            </div>
          </div>
          <div ID="DIVL15" TYPE="BODY">
            <div ID="DIVL16" TYPE="BODY_CONTENT">
              <div ID="DIVL17" TYPE="PARAGRAPH" ORDER="1">
                <div ID="DIVL18" TYPE="TEXT">
                  <fptr>
                    <area BETYPE="IDREF" FILEID="ALTO00001" BEGIN="P1_TB00007"/>
                  </fptr>
                </div>
              </div>
              <div ID="DIVL19" TYPE="PARAGRAPH" ORDER="2">
                <div ID="DIVL20" TYPE="TEXT">
                  <fptr>
                    <area BETYPE="IDREF" FILEID="ALTO00001" BEGIN="P1_TB00008"/>
                  </fptr>
                </div>
              </div>
            </div>
            <div ID="DIVL21" TYPE="ILLUSTRATION" ORDER="1" DMDID="MODSMD_PICT1" LABEL="CORCORAN">
              <div ID="DIVL22" TYPE="IMAGE">
                <fptr>
                  <area BETYPE="IDREF" FILEID="ALTO00001" BEGIN="P1_CB00001"/>
                </fptr>
              </div>
              <div ID="DIVL23" TYPE="CAPTION">
                <fptr>
                  <area BETYPE="IDREF" FILEID="ALTO00001" BEGIN="P1_TB00009"/>
                </fptr>
              </div>
            </div>
          </div>
        </div>

Notes

We will need to:

  • use the corresponding <TextBlock> and <ComposedBlock> elements to establish which NewspaperPage object(s) should be associated with a NewspaperArticle
  • extract the bounding box values of the <TextBlock> and <ComposedBlock> elements that make up an article
    • What type of structure would this have?
    • Data modeling could get tricky -- articles may have multiple bounding boxes on multiple pages, so we need to be able to associate a bounding box (or array of boxes) with a NewspaperPage id.
    • Should these be combined into a single uber-box for each page?
    • Should these values be stored as metadata on NewspaperArticle objects?
    • Should we create an additional JSON derivative file to hold this data?
      • Would be stored with NewspaperArticle

Possible data structure:

  • we don't bother trying to list actual Page ids here, just keep things in order
  • boxes encoded as [minX, minY, width, height]
{
  "pages": [
    [
      [834, 620, 926, 178],
      [146, 191, 3475, 4323]
    ],
    [
      [753, 834, 549, 404]
    ]
  ]
}