Codecs
Home ▸ API Reference ▸ Codecs
The xigt.codecs
package contains modules (currently only one:
XigtXml) for reading and writing Xigt corpora.
Corpora using the [Xigt Data Model](Data Model) can be stored losslessly in a Xigt serialization format. Alternatively, corpora in another format can be imported into the Xigt data model or Xigt corpora can be exported to other formats, but information may be lost in the process.
While currently there is only the XML format, it is possible to
implement a non-XML format, e.g. in JSON or
something else. Codecs should generally follow Python's
pickle API and implement
load()
, loads()
, dump()
, and dumps()
functions.
XigtXML is the canonical format for encoding Xigt data.
XigtXML data can be loaded from a file on disk or streamed (e.g. through a pipeline). The load() function is able to read a corpus incrementally, which can be useful for very large files.
# codecs.xigtxml.load(fh, mode='full')
Return a XigtCorpus loaded from the file object fh, which could be from
a file on disk, or a stream like sys.stdin
. The optional mode parameter
determines how the file will be loaded, with possible values as follows:
mode | description |
---|---|
full |
load the entire corpus into memory |
incremental |
load one Igt upon iteration and store it in memory |
transient |
load one Igt upon iteration, but don't store it |
Both the incremental and transient modes allow code to begin working with Igts as soon as they are read. The transient mode does not keep the loaded Igts in memory after it has read them, so you can read through an entire corpus with near-constant memory usage.
# codecs.xigtxml.loads(s)
Return a XigtCorpus loaded from the string s. Otherwise this is the same as load() with the default mode='full' parameter.
# codecs.xigtxml.dump(fh, xc, encoding='utf-8', indent=2)
Write the XigtCorpus xc to the file object fh, which could be a file
on disk, or a stream like sys.stdout
. The optional encoding parameter
determines the character encoding to use for the data and, for all values
except unicode
, outputs an XML encoding declaration (e.g.
<?xml version="1.0" encoding="utf-8"?>
).
The optional parameter indent determines if the XML is to be pretty-printed.
Any non-zero integer value will result in newlines after each element.
The default value of 2 indents each level by two spaces. All levels are
de-indented by 2 spaces (regardless of any non-zero value of indent) so that
<igt>
elements begin at column 0 for the default case. For example, when
indent is 4, the <igt>
elements begin at column 2, <tier>
elements
begin at column 6, etc.
# codecs.xigtxml.dumps(xc, encoding='unicode', indent=2)
Return a string representation of the XigtCorpus xc. The default
encoding is unicode
, which means an XML encoding declaration will not be
printed. Otherwise, this is similar to dump().