# From TEI to spacy world and back

First, let us look at this oversimplified TEI document

In [None]:
!pip install standoffconverter
!pip install spacy

In [2]:
from lxml import etree
from standoffconverter import Converter

input_xml = '''<TEI>
    <teiHeader>
    </teiHeader>
    <text>
        <body>
            <p>1 2 3 4 5 6 7 9 10</p>
            <p> 11 12 13 14</p>
        </body>
    </text>
</TEI>'''

We will first parse it with lxml and then initialize the Converter

In [3]:
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.fromstring(input_xml, parser=parser)
converter = Converter(tree)
converter.ensure_cache()

With this converter, you have access to the different data structures, for example the 

* the simple standoff table `converter.table`
* the tree `converter.text_el`
* just the text ``converter.plain`
* the json of all annotations `converter.json`
* and (which I like a lot, the collapsed standoff table) `converter.collapsed_table`

In [4]:
converter.table.head()

Unnamed: 0,sos,text
0,"[text, body, p]",1.0
1,"[text, body, p]",
2,"[text, body, p]",2.0
3,"[text, body, p]",
4,"[text, body, p]",3.0


In [5]:
converter.text_el

<Element text at 0x119e9e548>

In [6]:
converter.plain

'1 2 3 4 5 6 7 9 10 11 12 13 14'

In [7]:
converter.json

'[{"tag": "text", "attrib": {}, "begin": 0, "end": 30, "depth": 0}, {"tag": "body", "attrib": {}, "begin": 0, "end": 30, "depth": 1}, {"tag": "p", "attrib": {}, "begin": 0, "end": 18, "depth": 2}, {"tag": "p", "attrib": {}, "begin": 18, "end": 30, "depth": 2}]'

In [8]:
converter.collapsed_table

Unnamed: 0,context,text
0,"[text, body, p]",1 2 3 4 5 6 7 9 10
1,"[text, body, p]",11 12 13 14


To illustrate how to do something with spacy and then pass the token-level information back to the standoff, we would like to annotate all numbers that are divisible by 2 with the imaginary <divisibleby2>-tag.
In the next cell, just the plain text from the converter is tokenized with spacy and individual tokens are classified into divisible/non-divisible.

In [9]:
from spacy.tokenizer import Tokenizer
import spacy

def tokenize(str_):
    nlp = spacy.blank('en')
    return Tokenizer(nlp.vocab)(str_)

def it_annotations(doc, labels):
    for token, label in zip(doc, labels):
        begin, end = token.idx, token.idx+len(token)
        if label is not None:
            yield begin, end, label

candidates = tokenize(converter.plain)
labels = ['divisible_by2' if int(tok.text)%2==0 else None for tok in candidates]

With token.idx, spacy keeps track of the character offset of the token. that way, we can get back the position of the token afterwards. Here, with `converter.add_inline` we add annotations on character level.

In [10]:
for begin, end, label in it_annotations(candidates, labels):

    converter.add_inline(
        begin=begin,
        end=end,
        tag="divisibleby2",
        depth=None,
        attrib={}
    )

In [11]:
converter.collapsed_table

Unnamed: 0,context,text
0,"[text, body, p]",1
1,"[text, body, p, divisibleby2]",2
2,"[text, body, p]",3
3,"[text, body, p, divisibleby2]",4
4,"[text, body, p]",5
5,"[text, body, p, divisibleby2]",6
6,"[text, body, p]",7 9
7,"[text, body, p, divisibleby2]",10
8,"[text, body, p]",11
9,"[text, body, p, divisibleby2]",12


In [12]:
etree.tostring(converter.text_el).decode("utf-8")

'<text><body><p>1 <divisibleby2>2</divisibleby2> 3 <divisibleby2>4</divisibleby2> 5 <divisibleby2>6</divisibleby2> 7 9 <divisibleby2>10</divisibleby2></p><p> 11 <divisibleby2>12</divisibleby2> 13 <divisibleby2>14</divisibleby2></p></body></text>'