# SourceData Augmentation

In this notebook we will check how to read the data encoded in XML files, using the different classes, methods and functions already available at soda-roberta.

In [1]:
from smtag.encoder import XMLEncoder
import os
from lxml.etree import fromstring, Element

In [2]:
XML_FOLDER = '/app/data/xml/sd_panels/'

In [3]:
subsets = ["train", "eval", "test"]

In [4]:
source_file_path = os.path.join(XML_FOLDER, subsets[0]) + ".txt"
source_file_path

'/app/data/xml/sd_panels/train.txt'

We can see here how the text of a single panel looks like in XML

Taking a single line to ease the process of understanding and trying out the XML encoder

In [5]:
with open(source_file_path) as f:
    lines = f.readlines()
    print(type(lines))

line = lines[2]
print(len(lines))

<class 'list'>
48771


We use the `lxml.etree.fromstring` function to convert the `str` line into an XML element.

In [101]:
xml_example = fromstring(line)
xml_example

<Element sd-panel at 0xffff5b4002c0>

The xpath to `sd-tag` will allow to search for the different attributes of the different XML tags in the text. The reason we are interested in `sd-tag` is because it is here where the information to be labeled can be found.

In [102]:
test = xml_example.xpath("sd-tag")
for item in test:
    print(item.attrib)


{'id': 'sdTag15', 'source': 'sdapi', 'category': 'None', 'entity_type': 'protein', 'role': 'assayed', 'text': 'ATZ polymers', 'ext_ids': 'P01009', 'ext_dbs': '', 'in_caption': 'True', 'ext_names': 'SERPINA1', 'ext_tax_ids': '9606', 'ext_tax_names': 'Homo sapiens', 'ext_urls': 'https://www.uniprot.org/uniprot/'}
{'id': 'sdTag16', 'source': 'sdapi', 'category': 'assay', 'entity_type': 'None', 'role': 'None', 'text': 'immunoisolated', 'ext_ids': 'BAO_0002508', 'ext_dbs': '', 'in_caption': 'True', 'ext_names': 'immunoprecipitation', 'ext_tax_ids': '', 'ext_tax_names': '', 'ext_urls': 'https://bioportal.bioontology.org/ontologies/BAO/?p=classes&conceptid=http%3A%2F%2Fwww.bioassayontology.org%2Fbao%23'}
{'id': 'sdTag17', 'source': 'sdapi', 'category': 'None', 'entity_type': 'cell', 'role': 'component', 'text': 'MEF', 'ext_ids': 'CL:2000042', 'ext_dbs': '', 'in_caption': 'True', 'ext_names': 'embryonic fibroblast', 'ext_tax_ids': '10090', 'ext_tax_names': 'Mus musculus', 'ext_urls': ' https:/

In [103]:
test = xml_example.xpath("sd-tag")
print(f"{'Entity'}\t\t{'Category'}\t\t{'Role'}\t\t{'Text'}\t\t{'Extended names'}")
for item in test:
    if item.attrib.get("entity_type", None):
        print(f'{item.attrib["entity_type"][0:4]}\t\t{item.attrib["category"]}\t\t{item.attrib["role"]}\t\t{item.attrib["text"]}\t\t{item.attrib["ext_names"]}')


Entity		Category		Role		Text		Extended names
prot		None		assayed		ATZ polymers		SERPINA1
None		assay		None		immunoisolated		immunoprecipitation
cell		None		component		MEF		embryonic fibroblast
mole		None		intervention		BafA1		bafilomycin A1
mole		None		intervention		BafA1		bafilomycin A1
None		assay		None		Immunoprecipitation		immunoprecipitation
None		assay		None		IP		immunoprecipitation
prot		None		assayed		ATZ polymers		SERPINA1
prot		None		reporter		HA		
None		assay		None		western blot		western blot
None		assay		None		WB		western blot


In [104]:
xml_encoder = XMLEncoder(xml_example)

In [105]:
for element in xml_encoder.element.itertext():
    print(element)

E 
ATZ polymers
 
immunoisolated
 from lysates of WT 
MEF
 mock treated (lane 1), incubated for 12 h with 
BafA1
 (lane 2) and 4 h after 
BafA1
 wash-out (lane 3). 
Immunoprecipitation
 (
IP
) of 
ATZ polymers
 with polymer-specific 2C1 antibody, transfer on PVDF membrane, revealed with anti-
HA
 antibody on 
western blot
 (
WB
). 
F
 Quantification of 
E
, n=3, mean ± SEM. Unpaired two-tailed 
t
-test, ns P>0.05, * P<0.05.
 Data information: Scale bars: 10 μm. 


In [106]:
text = []
for element in xml_encoder.element.itertext():
    text.append(element)
"".join(text)

'E ATZ polymers immunoisolated from lysates of WT MEF mock treated (lane 1), incubated for 12 h with BafA1 (lane 2) and 4 h after BafA1 wash-out (lane 3). Immunoprecipitation (IP) of ATZ polymers with polymer-specific 2C1 antibody, transfer on PVDF membrane, revealed with anti-HA antibody on western blot (WB). F Quantification of E, n=3, mean ± SEM. Unpaired two-tailed t-test, ns P>0.05, * P<0.05. Data information: Scale bars: 10 μm. '

At this point we have a text that can be tagged using the information of SDtags and convert into a string of characters. From here on we could begin to generate a proper word labelling algorithm. 

To be able to generate augmented data we need to go up and change the name of the different entities.

## Creating augmented data using the template of a given caption for panelization

What we have seen up to know is the summary of how we can get to the point of begin to work with the data from `xml` files. 

Now we will see how to get the data into a position on which we can edit part of it. Since the XML labels encode the labels we need for the text, in general the entire work will be related to get these lables and modify them. 

The main problem might arise from the fact that in the case of panels, the characters or words defining different panels are not standarized and not contained into `xml` tags. This will make the data augmentation procedure a bit more complicated. 

Note that `sd_panels` stores the info of just one panel at a time while `sd_fig` does it for all the panels in a figure. Looks like the second is a better chance to work on the data augmentation for the panelization task.

In [116]:
line

'<fig id="22315"><title>.</title><label>Figure 2</label><graphic href="https://api.sourcedata.io/file.php?figure_id=22315"/><sd-panel panel_id="65030">A <sd-tag id="sdTag38" source="sdapi" category="None" entity_type="cell" role="component" text="HEK293" ext_ids="CVCL_0045" ext_dbs="" in_caption="True" ext_names="HEK293" ext_tax_ids="9606" ext_tax_names="Homo sapiens" ext_urls="https://identifiers.org/cellosaurus:">HEK293</sd-tag> cells transfected with empty vector (lanes 1, 6), <sd-tag id="sdTag39" source="sdapi" category="None" entity_type="gene" role="intervention" text="ATZ" ext_ids="5265" ext_dbs="" in_caption="True" ext_names="SERPINA1" ext_tax_ids="9606" ext_tax_names="Homo sapiens" ext_urls="http://www.ncbi.nlm.nih.gov/gene/">ATZ</sd-tag>-<sd-tag id="sdTag40" source="sdapi" category="None" entity_type="gene" role="reporter" text="HA" ext_ids="" ext_dbs="" in_caption="True" ext_names="" ext_tax_ids="" ext_tax_names="" ext_urls="">HA</sd-tag> (2, 7), <sd-tag id="sdTag41" source=

In [7]:
XML_FOLDER = '/app/data/xml/sd_fig/'
subsets = ["train", "eval", "test"]
source_file_path = os.path.join(XML_FOLDER, subsets[0]) + ".txt"

with open(source_file_path) as f:
    lines = f.readlines()
    print(type(lines))

line = lines[0]

xml_example = fromstring(line)
print(len(lines))

<class 'list'>
12108


We try now to see how to get the first characters of a panel. We expect them to be the panel ID in the text. `A`, `B` or so... In this case, unfortunately this is not so easy. We see how from A, it goes all the way down to `E, F`.

The reason is that the other tags are encoded into `strong` tags, that make the things difficult. 

In [143]:
test = xml_example.xpath("sd-panel")
for item in test:
    print(item.text)

    
xml_encoder = XMLEncoder(xml_example)    
    
text = []
for element in xml_encoder.element.itertext():
    text.append(element)
"".join(text)    

A Intracellular 
E 
G-K Same as 


".Figure 1A Intracellular localization of total (HA) and polymeric ATZ (2C1) in WT MEF mock-treated, Confocal Laser Scanning Microscopy (CLSM). B Same as A for MEF exposed to 50 nM BafA1 for 12 h. C Same as A, 4 h after BafA1 wash-out. D Quantification of ATZ-positive, LAMP1-positive endolysosomes (EL) (n=13, 10, 11 cells, respectively). One-way ANOVA and Dunnett's multiple comparisons test, ns P>0.05, **** P<0.0001. Data information: Scale bars: 10 μm. E ATZ polymers immunoisolated from lysates of WT MEF mock treated (lane 1), incubated for 12 h with BafA1 (lane 2) and 4 h after BafA1 wash-out (lane 3). Immunoprecipitation (IP) of ATZ polymers with polymer-specific 2C1 antibody, transfer on PVDF membrane, revealed with anti-HA antibody on western blot (WB). F Quantification of E, n=3, mean ± SEM. Unpaired two-tailed t-test, ns P>0.05, * P<0.05. Data information: Scale bars: 10 μm. G-K Same as B in WT MEF, in cells exposed to 20 mM CST and in Cnx-, Crt- and ERp57-KO MEF. L Quantificati

We check the elements. We see only a few of them. Thsi means that using sd-panels does not give everything. The reason is that inside sd-panels are other elements. We would need to get the `x-path` to these elements in order to get what we need. That is, the first character of each panel. 

In [144]:
for element in xml_encoder.element:
    print(element)

<Element title at 0xffff5b4b8d80>
<Element label at 0xffff5b4b8f80>
<Element graphic at 0xffff5b4b8940>
<Element sd-panel at 0xffff70073440>
<Element sd-panel at 0xffff5b6b9dc0>
<Element sd-panel at 0xffff5b19ed40>


In [148]:
test = xml_example.xpath("sd-panel/p/strong")
for item in test:
    print(item.text)

B
A
C
A
D
F
E
L


We will check now how easy would be to pick up porgramatically the panel IDs in the figure. I will do a small statistical experiment. We will `.split()` each panel and print to check if this would already give us the panel id.

In [192]:
XML_FOLDER = '/app/data/xml/sd_panels/'
subsets = ["train", "eval", "test"]
source_file_path = os.path.join(XML_FOLDER, subsets[2]) + ".txt"

with open(source_file_path) as f:
    lines = f.readlines()
first_chars = []
for line in lines: 
    xml_example = fromstring(line)  
    xml_encoder = XMLEncoder(xml_example)
    text = []
    for element in xml_encoder.element.itertext():
        text.append(element)
    ids = "".join(text).split()[0]
    if len(ids) < 5:
        first_chars.append(ids)

In [193]:
from collections import Counter

In [194]:
import numpy as np
counter = Counter(first_chars)
print(len(lines), np.array(list(counter.values())).sum())
sorted(counter.items())


7178 5298


[("'", 2),
 ('(A', 10),
 ('(A)', 369),
 ('(A),', 1),
 ('(A,', 35),
 ('(A1)', 1),
 ('(A2)', 1),
 ('(A3)', 1),
 ('(B', 5),
 ('(B)', 292),
 ('(B),', 1),
 ('(B).', 1),
 ('(B,', 14),
 ('(B-', 1),
 ('(B1)', 1),
 ('(C', 3),
 ('(C)', 284),
 ('(C,', 19),
 ('(C2)', 1),
 ('(D', 4),
 ('(D)', 238),
 ('(D,', 11),
 ('(D;', 1),
 ('(E', 6),
 ('(E)', 172),
 ('(E,', 20),
 ('(E1,', 1),
 ('(F', 1),
 ('(F)', 146),
 ('(F,', 3),
 ('(G', 2),
 ('(G)', 107),
 ('(G,', 8),
 ('(H)', 66),
 ('(H,', 5),
 ('(I', 2),
 ('(I)', 42),
 ('(I,', 6),
 ('(J)', 30),
 ('(J,', 3),
 ('(K)', 19),
 ('(K,', 5),
 ('(L)', 13),
 ('(L,', 1),
 ('(M', 1),
 ('(M)', 11),
 ('(M,', 1),
 ('(N)', 9),
 ('(N,', 1),
 ('(O)', 3),
 ('(O,', 1),
 ('(P)', 2),
 ('(P,', 2),
 ('(Q)', 1),
 ('(Q,', 1),
 ('(R)', 1),
 ('(S)', 2),
 ('(T)', 1),
 ('(U)', 1),
 ('(U,', 1),
 ('(V)', 1),
 ('(a)', 90),
 ('(a,', 5),
 ('(ai)', 1),
 ('(b)', 81),
 ('(b,', 1),
 ('(c)', 68),
 ('(c,', 5),
 ('(d)', 50),
 ('(d,', 1),
 ('(e)', 44),
 ('(e).', 1),
 ('(e,', 4),
 ('(f)', 30),
 ('(f,

We can see a great variability in the definition of panels in the figure caption. We also see cases on which the text is not available for the panel identification. It will be important to find the best way to train the dataset in this sense. 

In [196]:
line

'<sd-panel panel_id="60831">(D-F) <sd-tag id="sdTag413" source="sdapi" category="assay" entity_type="None" role="None" text="Immunohistological" ext_ids="BAO_0000415" ext_dbs="" in_caption="True" ext_names="immunohistochemistry" ext_tax_ids="" ext_tax_names="" ext_urls="https://bioportal.bioontology.org/ontologies/BAO/?p=classes&amp;conceptid=http%3A%2F%2Fwww.bioassayontology.org%2Fbao%23">Immunohistological</sd-tag> analysis of <sd-tag id="sdTag414" source="sdapi" category="None" entity_type="tissue" role="component" text="splenic" ext_ids="UBERON:0002106" ext_dbs="" in_caption="True" ext_names="spleen" ext_tax_ids="" ext_tax_names="" ext_urls="https://www.ebi.ac.uk/ols/ontologies/uberon/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_">splenic</sd-tag> sections <sd-tag id="sdTag415" source="sdapi" category="assay" entity_type="None" role="None" text="stained" ext_ids="OBI_0302887" ext_dbs="" in_caption="True" ext_names="staining" ext_tax_ids="" ext_tax_names="" ext_urls="ht

Let us combine the panel and figure information to create a training set where we know that the panels belong to the figures.

In [226]:
XML_FOLDER = '/app/data/xml/sd_fig/'
subsets = ["train", "eval", "test"]
source_file_path = os.path.join(XML_FOLDER, subsets[2]) + ".txt"

with open(source_file_path) as f:
    lines = f.readlines()
# first_chars = []
# for line in lines: 
#     xml_example = fromstring(line)  
#     xml_encoder = XMLEncoder(xml_example)
#     text = []
#     for element in xml_encoder.element.itertext():
#         text.append(element)
#     ids = "".join(text).split()[0]
#     if len(ids) < 5:
#         first_chars.append(ids)

line = lines[7]
xml_example = fromstring(line)
xml_encoder = XMLEncoder(xml_example)
test = xml_example.xpath("*")
for item in test:
    print(item.text)

.
Figure 6
None
A, B Triple 
Triple 
Triple 
G Quantification of 
H Quantification of double-positive 


from collections import Counter

In [245]:
XML_FOLDER = '/app/data/xml/sd_fig/'
subsets = ["train", "eval", "test"]
source_file_path = os.path.join(XML_FOLDER, subsets[2]) + ".txt"

with open(source_file_path) as f:
    lines = f.readlines()
    first_chars = []
    for line in lines: 
        xml_example = fromstring(line)  
        xml_encoder = XMLEncoder(xml_example)
        text = []
        for element in xml_encoder.element.itertext():
            text.append(element)

        print("".join(text))
        

.Figure 1A Intracellular localization of total (HA) and polymeric ATZ (2C1) in WT MEF mock-treated, Confocal Laser Scanning Microscopy (CLSM). B Same as A for MEF exposed to 50 nM BafA1 for 12 h. C Same as A, 4 h after BafA1 wash-out. D Quantification of ATZ-positive, LAMP1-positive endolysosomes (EL) (n=13, 10, 11 cells, respectively). One-way ANOVA and Dunnett's multiple comparisons test, ns P>0.05, **** P<0.0001. Data information: Scale bars: 10 μm. E ATZ polymers immunoisolated from lysates of WT MEF mock treated (lane 1), incubated for 12 h with BafA1 (lane 2) and 4 h after BafA1 wash-out (lane 3). Immunoprecipitation (IP) of ATZ polymers with polymer-specific 2C1 antibody, transfer on PVDF membrane, revealed with anti-HA antibody on western blot (WB). F Quantification of E, n=3, mean ± SEM. Unpaired two-tailed t-test, ns P>0.05, * P<0.05. Data information: Scale bars: 10 μm. G-K Same as B in WT MEF, in cells exposed to 20 mM CST and in Cnx-, Crt- and ERp57-KO MEF. L Quantificatio

.Figure 4Real time PCR analysis of P. aeruginosa specific immune response genes in naive and 1-undecene odor exposed N2 worms. n = 3. * P ≤ 0.05, ** P ≤ 0.01 as determined by two-tailed unpaired t-test. Error bars indicate SEM. The negative values are arrived at by representing FC value less than 1 as (-1/FC). Real time PCR analysis of irg-1, irg-2 and irg-3 genes in N2 and zip-2(tm4248) worms exposed to 1-undecene odor upon respective naive worms. n = 3. ns (not significant) P > 0.05, * P ≤ 0.05, ** P ≤ 0.01 as determined by two-tailed unpaired t-test. Error bars indicate SEM. irg-1p::GFP induction in naive worms and worms exposed to P. aeruginosa (6 h) and 1-undecene odor (2 h). Scale bar = 500 µm. Real time PCR analysis of irg-1, irg-2 and irg-3 genes in N2, odr-3(n2150) and odr-3(n2046) worms exposed to 1-undecene odor upon respective naive worms. n = 3. * P ≤ 0.05, ** P ≤ 0.01 as determined by two-tailed unpaired t-test. Error bars indicate SEM. Real time PCR analysis of irg-1, ir

.Figure 4T22-GFP-H6-FdU depletes CXCR4+ cancer cells from SW1417 CRC tumor tissue after a 100 µg single dose administration. Note the reduction in CXCR4+ cell fraction in the tumor 24h after injection, their almost complete elimination at 48h and the re-emergence of CXCR4+ cells 72h post-administration, using anti-CXCR4 IHC. In contrast, the CXCR4+ cancer cell fraction (CXCR4+ CCF) in tumor tissue remains constant along time after free oligo-FdU treatment. The three day time-lapse for CXCR4+ tumor cell re-appearance defines the dosage interval used in a repeated dose schedule of nanoconjugate administration in the experiments to evaluate its antimetastatic effect. (N=5: 5 mice/group; 1 samples/mouse). Scale bar, 50 µm. Data expressed as mean±s.e.m Significant reduction in the number of spheroid formed (C. optical microscope) and their bioluminescence emission (D, IVIS Spectrum 200), generated by 1x106 disaggregated cells (cultured in stem cell conditioned media and low-adhesion plates)

## Generating HuggingFace datasets for panelization

In [248]:
TEXT = """Figure 2 Rendering images. Rendering a high resolution figure from a set of subfigures. 
(A) Whole figure layout (consiting of three subfigures, denoted A, B and C, in the low resolution page space. 
(B) Actual subfigure dimensions. (C-D) (Reconstructed high resolution figure."""


import re
re.findall("\(.?\)", TEXT), re.findall("\(.?-.?\)", TEXT)

(['(A)', '(B)'], ['(C-D)'])

## Generating HuggingFace datasets for Roles or syntactic segmentation

In [1]:
from smtag.xml2labels import SourceDataCodes as sdc


In [2]:
sdc.SMALL_MOL_ROLES.all_labels

['CONTROLLED_VAR', 'MEASURED_VAR']

In [11]:
XML_FOLDER = '/app/data/xml/sd_fig/'
subsets = ["train", "eval", "test"]
source_file_path = os.path.join(XML_FOLDER, subsets[1]) + ".txt"

code_maps = [sdc.BORING, sdc.ENTITY_TYPES, sdc.GENEPROD_ROLES, sdc.PANELIZATION, sdc.SMALL_MOL_ROLES]

def innertext(xml):
    return "".join([t for t in xml.itertext()])

with open(source_file_path) as f:
    lines = f.readlines()
    #print(lines[20])
    
    for line in lines:
        xml_example = fromstring(line)

        xml_encoder = XMLEncoder(xml_example)
        #print(xml_encoder.element.itertext())
        inner_text = innertext(xml_encoder.element)

        label_dict_chars, label_dict_words = {}, {}
        label_dict_chars['text'] = list(inner_text)
        if inner_text.startswith("Antigenicity"):
            break
    
    for code_map in code_maps:
        # At this point we have a tag for each character.
        # It is here where I should put chars together into words
        words, label_words = [], []
        xml_encoded = xml_encoder.encode(code_map)
        label_dict_chars[code_map.name] = xml_encoded['label_ids']

        word = ''
        label_word = ''
        for i, char in enumerate(label_dict_chars['text']):
            if char.isalnum():
                word += char
                label_word += str(label_dict_chars[code_map.name][i]).replace("None", "O")
            elif char == " ":
                if word not in [""]:
                    words.append(word)
                    label_words.append(label_word[0])
                word = ''
                label_word = ''
            else:
                if word not in [""]:
                    words.append(word)
                    label_words.append(label_word[0])
                words.append(char)
                label_words.append(str(label_dict_chars[code_map.name][i]).replace("None", "O"))
                word = ''
                label_word = ''
                
        if code_map.name == "panel_start":
            print(words)
            print(label_words)
            stop             
        #label_dict_words[code_map.name] = label_words
        iob2_labels = []
        print(code_map)
        
        for idx, label in enumerate(label_words):
            if code_map.name == "panel_start":
                iob2_labels.append("O")
            
            if code_map.name != "panel_start":    
                if label == "O":
                    iob2_labels.append(label)

                if (label != "O"):
                    if idx == 0:
                        iob2_labels.append(code_map.iob2_labels[int(label)*2])
                    if (idx > 0) and (label_words[idx-1] != label):
                        iob2_labels.append(code_map.iob2_labels[int(label)*2])
                    if (idx > 0) and (label_words[idx-1] == label):
                        iob2_labels.append(code_map.iob2_labels[int(label)*2-1])
                
        label_dict_words[code_map.name] = iob2_labels
        iob2_labels = []
       
                
        #print(words, label_words)
    label_dict_words['words'] = words
    
    print(label_dict_words)
        
    for key in label_dict_words.keys():
        print(len(label_dict_words[key]))

SourceDataCodes.BORING
SourceDataCodes.ENTITY_TYPES
SourceDataCodes.GENEPROD_ROLES
['Antigenicity', 'of', 'the', '2019', '-', 'nCoV', 'RBD', '.', 'Figure', '4', '.', 'The', 'SARS', '-', 'CoV', 'RBD', 'is', 'shown', 'as', 'a', 'white', 'molecular', 'surface', '(', 'PDB', 'ID', ':', '2AJF', ')', ',', 'with', 'residues', 'that', 'vary', 'in', 'the', '2019', '-', 'nCoV', 'RBD', 'colored', 'red', '.', 'The', 'ACE2', 'binding', 'site', 'is', 'outlined', 'with', 'a', 'black', 'dotted', 'line', '.', 'A', 'biolayer', 'interferometry', 'sensorgram', 'that', 'shows', 'binding', 'to', 'ACE2', 'by', 'the', '2019', '-', 'nCoV', 'RBD', '-', 'SD1', '.', 'Binding', 'data', 'are', 'shown', 'as', 'a', 'black', 'line', 'and', 'the', 'best', 'fit', 'of', 'the', 'data', 'to', 'a', '1', ':', '1', 'binding', 'model', 'is', 'shown', 'in', 'red', '.', 'Biolayer', 'interferometry', 'to', 'measure', 'cross', '-', 'reactivity', 'of', 'the', 'SARS', '-', 'CoV', 'RBD', '-', 'directed', 'antibodies', 'S230', ',', 'm3

NameError: name 'stop' is not defined

In [32]:
XML_FOLDER = '/app/data/xml/sd_fig/'
subsets = ["train", "eval", "test"]
source_file_path = os.path.join(XML_FOLDER, subsets[1]) + ".txt"

code_maps = [sdc.BORING, sdc.ENTITY_TYPES, sdc.GENEPROD_ROLES, sdc.PANELIZATION, sdc.SMALL_MOL_ROLES]

def innertext(xml):
    return "".join([t for t in xml.itertext()])

with open(source_file_path) as f:
    lines = f.readlines()
    #print(lines[20])
    
    for line in lines:
        xml_example = fromstring(line)

        xml_encoder = XMLEncoder(xml_example)
        #print(xml_encoder.element.itertext())
        inner_text = innertext(xml_encoder.element)

        label_dict_chars, label_dict_words = {}, {}
        label_dict_chars['text'] = list(inner_text)
        if inner_text.startswith("Antigenicity"):
            break
    
    code_map = sdc.PANELIZATION
    # At this point we have a tag for each character.
    # It is here where I should put chars together into words
    words, label_words = [], []
    xml_encoded = xml_encoder.encode(code_map)
    label_dict_chars[code_map.name] = xml_encoded['label_ids']
        
    output = ["O"] * len(xml_encoded['label_ids'])
    offsets = xml_encoded["offsets"]
    for offset in offsets:
        output[offset[0]] = "B-PANEL_START"
        

    word, label_word = '', ''
    for i, char in enumerate(label_dict_chars['text']):
        if char.isalnum():
            word += char
            label_word += str(output[i])
        elif char == " ":
            if word not in [""]:
                words.append(word)
                if "B-PANEL_START" in label_word:
                    label_words.append("B-PANEL_START")
                else:
                    label_words.append("O")
            word = ''
            label_word = ''
        else:
            if word not in [""]:
                words.append(word)
                if "B-PANEL_START" in label_word:
                    label_words.append("B-PANEL_START")
                else:
                    label_words.append("O")
            words.append(char)
            label_words.append(output[i])
            word = ''
            label_word = ''
                               
    for word, label in zip(words, label_words):
        print(word, label)
    


Antigenicity O
of O
the O
2019 O
- O
nCoV O
RBD O
. O
Figure O
4 O
. O
The B-PANEL_START
SARS O
- O
CoV O
RBD O
is O
shown O
as O
a O
white O
molecular O
surface O
( O
PDB O
ID O
: O
2AJF O
) O
, O
with O
residues O
that O
vary O
in O
the O
2019 O
- O
nCoV O
RBD O
colored O
red O
. O
The O
ACE2 O
binding O
site O
is O
outlined O
with O
a O
black O
dotted O
line O
. O
A B-PANEL_START
biolayer O
interferometry O
sensorgram O
that O
shows O
binding O
to O
ACE2 O
by O
the O
2019 O
- O
nCoV O
RBD O
- O
SD1 O
. O
Binding O
data O
are O
shown O
as O
a O
black O
line O
and O
the O
best O
fit O
of O
the O
data O
to O
a O
1 O
: O
1 O
binding O
model O
is O
shown O
in O
red O
. O
Biolayer B-PANEL_START
interferometry O
to O
measure O
cross O
- O
reactivity O
of O
the O
SARS O
- O
CoV O
RBD O
- O
directed O
antibodies O
S230 O
, O
m396 O
and O
80R O
. O
Sensortips O
with O
immobilized O
antibodies O
were O
dipped O
into O
wells O
containing O
2019 O
- O
nCoV O
RBD O
- O
SD1 O
and O
the O
resulting

In [17]:
["O"] * 3

['O', 'O', 'O']

In [51]:
sdc.ENTITY_TYPES.constraints.get(1)

{'label': 'SMALL_MOLECULE',
 'tag': 'sd-tag',
 'attributes': {'entity_type': ['molecule']}}

In [10]:
ord('ß')

223