Skip to content

tacman/nitf-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NITF Parser

A PHP library for parsing NITF (News Industry Text Format) XML documents into a flat, searchable structure optimized for Meilisearch and similar full-text search engines.

Installation

composer require tacman/ntif-parser

Requirements

  • PHP 8.4+

Quick Start

use Tacman\NTF\NTF;

// Parse from file
$ntf = NTF::fromFile('article.xml');

// Or from XML string
$ntf = NTF::fromXml($xmlString);

// Or from a zip archive containing multiple NITF files
foreach (NTF::fromZip('articles.zip') as $ntf) {
    echo $ntf->headline;
}

// Get a flat array ready for indexing
$searchable = $ntf->toSearchable();

Available Fields

The NTF class provides these public properties:

Property Type Description
$id string Document ID (from doc-id/@id-string)
$headline string Main headline (from hl1)
$subhead string Sub-headline (from hl2)
$byline string Author byline
$summary string Article summary/abstract
$body string Full body text (all <p> elements joined)
$keywords string[] Keywords from key-list
$categories array Classifications as ['type' => '...', 'value' => '...']
$images array Media references with source, name, mimeType
$publishedAt ?DateTime Publication date
$modifiedAt ?DateTime Last modification date
$section ?string Publication section
$type ?string Publication type

Meilisearch Integration

The toSearchable() method returns a flat array ready for direct indexing:

$ntf = NTF::fromFile('article.xml');
$searchable = $ntf->toSearchable();

// Index directly into Meilisearch
$client->index('articles')->addDocuments([$searchable]);

The searchable array includes all fields with:

  • publishedAt and modifiedAt as ISO 8601 strings
  • keywords as an array
  • categories and images as JSON arrays

Zip File Processing

Process large archives efficiently using the generator:

// Iterate through all NITF files in a zip
$count = 0;
foreach (NTF::fromZip('archive.zip') as $ntf) {
    $count++;
    // Process each document
}

// Or get all as an array
$all = NTF::allFromZip('archive.zip');

The zip parser:

  • Only processes .xml files
  • Skips invalid XML files silently
  • Uses a generator for memory efficiency

Example

Given a NITF XML file:

<?xml version="1.0" encoding="UTF-8"?>
<nitf xmlns="http://iptc.org/std/NITF/2006-10-18/">
  <head>
    <docdata>
      <doc-id id-string="abc123"/>
      <date.release norm="2026-01-15T00:01:00Z"/>
      <key-list>
        <keyword key="#news"/>
        <keyword key="#sports"/>
      </key-list>
    </docdata>
    <pubdata type="web" position.section="news/sports"/>
  </head>
  <body>
    <body.head>
      <hedline>
        <hl1>Big Game Today</hl1>
        <hl2>Preview and analysis</hl2>
      </hedline>
      <byline>By John Smith</byline>
    </body.head>
    <body.content>
      <p>First paragraph of the article...</p>
      <p>Second paragraph...</p>
    </body.content>
  </body>
</nitf>

You get:

$ntf->id;           // "abc123"
$ntf->headline;    // "Big Game Today"
$ntf->subhead;     // "Preview and analysis"
$ntf->byline;      // "By John Smith"
$ntf->body;        // "First paragraph...\n\nSecond paragraph..."
$ntf->keywords;     // ["#news", "#sports"]
$ntf->section;     // "news/sports"
$ntf->publishedAt; // DateTime object

Testing

./vendor/bin/phpunit

License

MIT

About

PHP parser for a subset of the NITF xml newspaper exchange format

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages