Skip to content
Michael Wayne Goodman edited this page Mar 23, 2015 · 6 revisions

HomeAPI ReferenceCodecs

The xigt.codecs package contains modules (currently only one: XigtXml) for reading and writing Xigt corpora.

Corpora using the [Xigt Data Model](Data Model) can be stored losslessly in a Xigt serialization format. Alternatively, corpora in another format can be imported into the Xigt data model or Xigt corpora can be exported to other formats, but information may be lost in the process.

While currently there is only the XML format, it is possible to implement a non-XML format, e.g. in JSON or something else. Codecs should generally follow Python's pickle API and implement load(), loads(), dump(), and dumps() functions.

XigtXML

XigtXML is the canonical format for encoding Xigt data.

Reading

XigtXML data can be loaded from a file on disk or streamed (e.g. through a pipeline). The load() function is able to read a corpus incrementally, which can be useful for very large files.

# codecs.xigtxml.load(fh, mode='full')

Return a XigtCorpus loaded from the file object fh, which could be from a file on disk, or a stream like sys.stdin. The optional mode parameter determines how the file will be loaded, with possible values as follows:

mode description
full load the entire corpus into memory
incremental load one Igt upon iteration and store it in memory
transient load one Igt upon iteration, but don't store it

Both the incremental and transient modes allow code to begin working with Igts as soon as they are read. The transient mode does not keep the loaded Igts in memory after it has read them, so you can read through an entire corpus with near-constant memory usage.

# codecs.xigtxml.loads(s)

Return a XigtCorpus loaded from the string s. Otherwise this is the same as load() with the default mode='full' parameter.

Writing

# codecs.xigtxml.dump(fh, xc, encoding='utf-8', indent=2)

Write the XigtCorpus xc to the file object fh, which could be a file on disk, or a stream like sys.stdout. The optional encoding parameter determines the character encoding to use for the data and, for all values except unicode, outputs an XML encoding declaration (e.g. <?xml version="1.0" encoding="utf-8"?>).

The optional parameter indent determines if the XML is to be pretty-printed. Any non-zero integer value will result in newlines after each element. The default value of 2 indents each level by two spaces. All levels are de-indented by 2 spaces (regardless of any non-zero value of indent) so that <igt> elements begin at column 0 for the default case. For example, when indent is 4, the <igt> elements begin at column 2, <tier> elements begin at column 6, etc.

# codecs.xigtxml.dumps(xc, encoding='unicode', indent=2)

Return a string representation of the XigtCorpus xc. The default encoding is unicode, which means an XML encoding declaration will not be printed. Otherwise, this is similar to dump().