# From TEI to spacy world and back

First, let us look at this oversimplified TEI document

In [1]:
!pip install standoffconverter
!pip install spacy

You should consider upgrading via the '/Users/davidlassner/Envs/wh/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/davidlassner/Envs/wh/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
from lxml import etree
from standoffconverter import Standoff, View

input_xml = '''<TEI>
    <teiHeader>
    </teiHeader>
    <text>
        <body>
            <p>1 2 3 4 5 6 7 9 10</p>
            <p> 11 12 13 14 </p>
        </body>
    </text>
</TEI>'''

We will first parse it with lxml and then initialize the Converter

In [3]:
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.fromstring(input_xml, parser=parser)
so = Standoff(tree)

With this converter, you have access to the different data structures, for example the 

* the simple standoff table `so.table`
* the tree `so.text_el`
* just the text ``so.plain`
* the json of all annotations `so.json`
* and (which I like a lot, the collapsed standoff table) `so.collapsed_table`

In [4]:
so.table.df.head()

Unnamed: 0,position,row_type,el,depth,text
0,0,open,"[[[], []]]",0.0,
1,0,open,"[[], []]",1.0,
2,0,open,[],2.0,
3,0,text,,,1 2 3 4 5 6 7 9 10
4,18,close,[],2.0,


In [5]:
so.text_el

<Element text at 0x11a16ee40>

In [6]:
so.plain

'1 2 3 4 5 6 7 9 10 11 12 13 14 '

In [7]:
so.json

'[{"tag": "text", "attrib": {}, "begin": 0, "end": 31, "depth": 0}, {"tag": "body", "attrib": {}, "begin": 0, "end": 31, "depth": 1}, {"tag": "p", "attrib": {}, "begin": 0, "end": 18, "depth": 2}, {"tag": "p", "attrib": {}, "begin": 18, "end": 31, "depth": 2}]'

In [8]:
so.collapsed_table

Unnamed: 0,context,text
0,"[[[<Element p at 0x11a16eec0>, <Element p at 0...",1 2 3 4 5 6 7 9 10
1,"[[[<Element p at 0x11a16eec0>, <Element p at 0...",11 12 13 14


To illustrate how to do something with spacy and then pass the token-level information back to the standoff, we would like to annotate all numbers that are divisible by 2 with the imaginary <divisibleby2>-tag.
In the next cell, just the plain text from the converter is tokenized with spacy and individual tokens are classified into divisible/non-divisible.

In [9]:
from spacy.tokenizer import Tokenizer
import spacy

def tokenize(str_):
    nlp = spacy.blank('en')
    return Tokenizer(nlp.vocab)(str_)

def it_annotations(doc, labels):
    for token, label in zip(doc, labels):
        begin, end = token.idx, token.idx+len(token)
        if label is not None:
            yield begin, end, label

view = View(so)
plain = view.get_plain()
candidates = tokenize(plain)
labels = ['divisible_by2' if int(tok.text)%2==0 else None for tok in candidates]

With token.idx, spacy keeps track of the character offset of the token. that way, we can get back the position of the token afterwards. Here, with `converter.add_inline` we add annotations on character level.

In [10]:
for begin, end, label in it_annotations(candidates, labels):

    so.add_inline(
        begin=view.get_table_pos(begin),
        end=view.get_table_pos(end),
        tag="divisibleby2",
        depth=None,
        attrib={}
    )

In [11]:
so.collapsed_table

Unnamed: 0,context,text
0,"[[[<Element p at 0x15a538a00>, <Element p at 0...",1
1,"[[[<Element p at 0x15a538a00>, <Element p at 0...",2
2,"[[[<Element p at 0x15a538a00>, <Element p at 0...",3
3,"[[[<Element p at 0x15a538a00>, <Element p at 0...",4
4,"[[[<Element p at 0x15a538a00>, <Element p at 0...",5
5,"[[[<Element p at 0x15a538a00>, <Element p at 0...",6
6,"[[[<Element p at 0x15a538a00>, <Element p at 0...",7 9
7,"[[[<Element p at 0x15a538a00>, <Element p at 0...",10
8,"[[[<Element p at 0x15a538a00>, <Element p at 0...",11
9,"[[[<Element p at 0x15a538a00>, <Element p at 0...",12


In [12]:
etree.tostring(so.text_el).decode("utf-8")

'<text><body><p>1 <divisibleby2>2</divisibleby2> 3 <divisibleby2>4</divisibleby2> 5 <divisibleby2>6</divisibleby2> 7 9 <divisibleby2>10</divisibleby2></p><p> 11 <divisibleby2>12</divisibleby2> 13 <divisibleby2>14</divisibleby2> </p></body></text>'