Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Corpus-tagged Text Converter

Circle CI

A PHP library for converting files tagged with corpus metadata to JSON, PHP, or XML.

Screenshot of Conversion


Corpus linguistics researchers use a markup-like syntax to provide metadata about texts. For consumption by applications, this syntax needs to be converted into a more universal, machine-readable format. The format chosen was JSON.

Basic Usage

The included /demo/index.php file contains a conversion form demonstration.

Make your code aware of the TagConverter class via your favorite method (e.g., use or require)

Then pass a string of text into the class:

$text = TagConverter::json('<MyTag: 123>My tagged text here');
echo $text;
// Returns {"MyTag":"123","text":"My tagged text here"}

$text = TagConverter::php('<MyTag: 123>My tagged text here');
echo $text;
// Returns array('MyTag' => '123', 'text' => 'My tagged text here')

$text = TagConverter::xml('<MyTag: 123>My tagged text here');
echo $text;
// Returns <?xml version="1.0"?><root><MyTag>123</MyTag><text>My tagged text here</text></root>

Expected input format

The corpus style tagging syntax expected by the library is defined as follows:

  1. Tags must be wrapped in < and >
  2. Tag names and tag values may only alphanumeric characters, spaces, underscores, and hypens.
  3. Tag names must be separated from tag values by a :
  4. Spaces at the beginning at end of tag names or tag values are ignored; spaces within tag values will be preserved
  5. Everything not wrapped in < and > will be considered "text"
Status Tag Example Explanation
Good <MyTag:SomeText>
Good <My Tag:Some Text> Spaces in tag names & values OK
Good < My Tag : Some Text > Spaces padding tag names & values OK
Good < My-Tag : Some_Text > Underscores & hyphens OK
Good ```< My-Tag : Value 1 Value 2 >```
Good < My-Tag : Value 1 ; Value 2 > Semicolon separators for multiple values
Bad < My/Tag : Some:Text > Other characters not OK


Unit Tests can be run (after composer install) by executing vendor/bin/phpunit