Skip to content

scriptotek/simplemarcparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SimpleMarcParser

Build Status Coverage Status Code Quality StyleCI Latest Stable Version Total Downloads

SimpleMarcParser is currently a minimal MARC21/XML parser for use with QuiteSimpleXMLElement, with support for the MARC21 Bibliographic, Authority and Holdings formats.

Note: This project is not actively developed anymore, but I will still process issues. The aim of this project was to produce “simple” JSON representations of MARC21 records. I'm now working on php-marc, a wrapper for File_MARC.

Example:

require_once('vendor/autoload.php');

use Danmichaelo\QuiteSimpleXMLElement\QuiteSimpleXMLElement,
    Scriptotek\SimpleMarcParser\Parser;

$data = file_get_contents('http://sru.bibsys.no/search/biblio?' . http_build_query(array(
	'version' => '1.2',
	'operation' => 'searchRetrieve',
	'recordSchema' => 'marcxchange',
	'query' => 'bs.isbn="0-521-43291-x"'
)));

$doc = new QuiteSimpleXMLElement($data);
$doc->registerXPathNamespaces(array(
        'srw' => 'http://www.loc.gov/zing/srw/',
        'marc' => 'http://www.loc.gov/MARC21/slim',
        'd' => 'http://www.loc.gov/zing/srw/diagnostic/'
    ));

$parser = new Parser();

$record = $parser->parse($doc->first('/srw:searchRetrieveResponse/srw:records/srw:record/srw:recordData/marc:record'));

print $record->title;

foreach ($record->subjects as $subject) {
	print $subject['term'] . '(' . $subject['system'] . ')';
}

Transformation/normalization

This parser is aimed at producing machine actionable output, and does some non-reversible transformations to achieve this. Transformation rules expect AACR2-like records, and are tested mainly against the Norwegian version of AACR2 (Norske katalogregler), but might work well with other editions as well.

Examples:

  • title is a combination of 300 $a and $b, separated by :.
  • year is an integer extracted from 260 $c by extracting the first four digit integer found (c20132013, 2009 [i.e. 2008]2009 (this might be a bit rough…))
  • pages is an integer extracted from 300 $a. The raw value, useful for e.g. non-verbal content, is stored in extent
  • creators[].name are transformed from ', ' to ' '

Form and material

Form and material is encoded in the leader and in control fields 006, 007 and 008. Encoding this information in a format that makes sense is a work-in-progress.

Electronic and printed material is currently distinguished using the boolean valued electronic key.

Printed book:

{
	"material": "book",
	"electronic": false
}

Electronic book:

{
	"material": "book",
	"electronic": true
}

About

A simple MARC21/XML parser [not actively developed anymore]

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages