## Using xmltodict

https://pypi.org/project/xmltodict/

In [1]:
import xmltodict

In [2]:
with open("selecao_xml_br/rbepid/v20n1/1980-5497-rbepid-20-01-00115.xml") as f:
    xml_document_dict = xmltodict.parse(f.read())

In [3]:
xml_document_dict["article"]["front"]["article-meta"]

OrderedDict([('article-id',
              OrderedDict([('@pub-id-type', 'doi'),
                           ('#text', '10.1590/1980-5497201700010010')])),
             ('article-categories',
              OrderedDict([('subj-group',
                            OrderedDict([('@subj-group-type', 'heading'),
                                         ('subject',
                                          'ARTIGOS ORIGINAIS')]))])),
             ('title-group',
              OrderedDict([('article-title',
                            'Ingestão de energia e nutrientes segundo consumo de alimentos fora do lar na Região Nordeste: uma análise do Inquérito Nacional de Alimentação 2008-2009')])),
             ('contrib-group',
              OrderedDict([('contrib',
                            [OrderedDict([('@contrib-type', 'author'),
                                          ('name',
                                           OrderedDict([('surname',
                                                 

Sometimes a list, sometimes a raw value, tricky to walk. Maybe it's easier directly from the ElementTree. 

## ElementTree to lines

The goal here is to "serialize" a XML into a single text line for each element in a document tree.

In [4]:
from operator import concat
from functools import reduce
from lxml import etree

In [5]:
with open("selecao_xml_br/rbepid/v20n1/1980-5497-rbepid-20-01-00115.xml") as f:
    doctree = etree.parse(f)

In [6]:
def etree_to_lines(root):
    result = []
    def step(branch, path):
        path += "/" + branch.tag
        for k, v in sorted(branch.items()):
            path += f"@{k}={v!r}"
        for node in branch.xpath("node()"):
            if not isinstance(node, str):
                step(node, path)
            else:
                node_strip = node.strip()
                if node_strip:
                    result.append(f"{path}#text={node_strip!r}")
    step(root, "")
    return result

In [7]:
list(etree_to_lines(doctree.xpath("//aff")[0]))

["/aff@id='aff1'/label#text='I'",
 "/aff@id='aff1'/institution@content-type='original'#text='Universidade de Fortaleza - Fortaleza (CE), Brasil.'",
 "/aff@id='aff1'/institution@content-type='normalized'#text='Universidade de Fortaleza'",
 "/aff@id='aff1'/institution@content-type='orgname'#text='Universidade de Fortaleza'",
 "/aff@id='aff1'/addr-line/named-content@content-type='city'#text='Fortaleza'",
 "/aff@id='aff1'/addr-line/named-content@content-type='state'#text='CE'",
 "/aff@id='aff1'/country@country='BR'#text='Brazil'"]

In [8]:
reduce(concat, map(etree_to_lines, doctree.iter("aff")))

["/aff@id='aff1'/label#text='I'",
 "/aff@id='aff1'/institution@content-type='original'#text='Universidade de Fortaleza - Fortaleza (CE), Brasil.'",
 "/aff@id='aff1'/institution@content-type='normalized'#text='Universidade de Fortaleza'",
 "/aff@id='aff1'/institution@content-type='orgname'#text='Universidade de Fortaleza'",
 "/aff@id='aff1'/addr-line/named-content@content-type='city'#text='Fortaleza'",
 "/aff@id='aff1'/addr-line/named-content@content-type='state'#text='CE'",
 "/aff@id='aff1'/country@country='BR'#text='Brazil'",
 "/aff@id='aff2'/label#text='II'",
 "/aff@id='aff2'/institution@content-type='original'#text='Curso de Mestrado Acadêmico em Nutrição e Saúde (CMANS) da Universidade Estadual do Ceará - Fortaleza (CE), Brasil.'",
 "/aff@id='aff2'/institution@content-type='normalized'#text='Universidade Estadual do Ceará'",
 "/aff@id='aff2'/institution@content-type='orgname'#text='Universidade Estadual do Ceará'",
 "/aff@id='aff2'/addr-line/named-content@content-type='city'#text='

In [9]:
lines = reduce(concat, map(etree_to_lines, doctree.xpath("//front//aff|//front//contrib")))
lines

["/contrib@contrib-type='author'/name/surname#text='Cavalcante'",
 "/contrib@contrib-type='author'/name/given-names#text='Jessica Brito'",
 "/contrib@contrib-type='author'/xref@ref-type='aff'@rid='aff1'/sup#text='I'",
 "/contrib@contrib-type='author'/name/surname#text='Moreira'",
 "/contrib@contrib-type='author'/name/given-names#text='Tyciane Maria Vieira'",
 "/contrib@contrib-type='author'/xref@ref-type='aff'@rid='aff1'/sup#text='I'",
 "/contrib@contrib-type='author'/name/surname#text='Mota'",
 "/contrib@contrib-type='author'/name/given-names#text='Caroline da Costa'",
 "/contrib@contrib-type='author'/xref@ref-type='aff'@rid='aff1'/sup#text='I'",
 "/contrib@contrib-type='author'/name/surname#text='Pontes'",
 "/contrib@contrib-type='author'/name/given-names#text='Carolinne Reinaldo'",
 "/contrib@contrib-type='author'/xref@ref-type='aff'@rid='aff1'/sup#text='I'",
 "/contrib@contrib-type='author'/name/surname#text='Bezerra'",
 "/contrib@contrib-type='author'/name/given-names#text='Ilana 

## Approximate regex matching

The external [regex](https://pypi.org/project/regex/) Python library have an *Approximate “fuzzy” matching* feature which might be useful for us here.

In [10]:
import regex

In [11]:
match_surname = regex.match(r"/contrib@contrib-type='author'/name/surname#text=(?<surname>.*)$", lines[0])
match_surname

<regex.Match object; span=(0, 61), match="/contrib@contrib-type='author'/name/surname#text='Cavalcante'">

In [12]:
match_surname.groupdict()

{'surname': "'Cavalcante'"}

Evaluating the internal string:

In [13]:
from ast import literal_eval

In [14]:
{k: literal_eval(v) for k, v in match_surname.groupdict().items()}

{'surname': 'Cavalcante'}

Let's try now with something that has typos:

In [15]:
example_surname = "/contribb@contrib-type='autor'/name/surrname#text='Bellini'"

In [16]:
regex_surname_fuzzy = regex.compile(r"""(?xb)
  /(?:contrib){e<=2}
    @(?:contrib-type){e<=2}
      ='(?:author){e<=2}'
  /(?:name/surname){e<=2}
  \#text=(?<surname>.*)$
""")

In [17]:
match_surname_fuzzy = regex_surname_fuzzy.match(example_surname)
match_surname_fuzzy

<regex.Match object; span=(0, 59), match="/contribb@contrib-type='autor'/name/surrname#text='Bellini'", fuzzy_counts=(0, 2, 1)>

In [18]:
{k: literal_eval(v) for k, v in match_surname_fuzzy.groupdict().items()}

{'surname': 'Bellini'}

## Extracting data using a list of regexes

In [19]:
from itertools import chain

In [20]:
def etree_line_gen(root):
    def step(branch, path):
        path += "/" + branch.tag
        for k, v in sorted(branch.items()):
            path += f"@{k}={v!r}"
        #yield path
        for node in branch.xpath("node()"):
            if not isinstance(node, str):
                yield from step(node, path)
            else:
                node_strip = node.strip()
                if node_strip:
                    yield f"{path}#text={node_strip!r}"
    yield from step(root, "")

In [21]:
def apply_etree_regexes(root, regexes):
    lines = "\n".join(etree_to_lines(root))
    matches_gen = (r.search(lines) for r in regexes)
    matches_pairs = chain.from_iterable(match.groupdict().items()
                                        for match in matches_gen if match)
    return {k: literal_eval(v) for k, v in matches_pairs}

In [22]:
aff_regexes = list(map(regex.compile, [
    "/aff@(?:id){e<=1}=(?<id>'.*?')",
    "(?b)/(?:institution){e<=2}@(?:content-type){e<=4}='(?:orgname){e<=3}'#text=(?<institution_orgname>.*)",
    "(?b)/(?:institution){e<=2}@(?:content-type){e<=4}='(?:normalized){e<=4}'#text=(?<institution_orgname_rewritten>.*)",
    "(?b)/(?:institution){e<=2}@(?:content-type){e<=4}='(?:orgdiv){e<=2}1'#text=(?<institution_orgdiv1>.*)",
    "(?b)/(?:institution){e<=2}@(?:content-type){e<=4}='(?:orgdiv){e<=2}2'#text=(?<institution_orgdiv2>.*)",
    "(?b)/(?:institution){e<=2}@(?:content-type){e<=4}='(?:original){e<=3}'#text=(?<institution_original>.*)",
    "(?b)/(?:email){e<=1}#text=(?<contrib_email>.*)",
]))

In [23]:
with open("selecao_xml_br/rbort/v48n1/0102-3616-rbort-48-01-0041.xml") as f:
    unverified_doctree = etree.parse(f)

In [24]:
for aff in unverified_doctree.iter("aff"):
    print(apply_etree_regexes(aff, aff_regexes))

{'id': 'aff1', 'institution_orgname': 'Paraná Club', 'institution_orgdiv1': 'Ninho da Gralha Player Training Center'}
{'id': 'aff2', 'institution_orgname': 'Paraná Clube', 'institution_orgdiv1': 'Centro de Formação de Atletas Ninho da Gralha'}


In [25]:
print(etree.tostring(unverified_doctree.xpath("//aff")[1], encoding="utf-8").decode("utf-8"))

<aff xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" id="aff2">
					<label>1</label>
				Médico Ortopedista e Traumatologista do
					<institution content-type="orgdiv1">Centro de Formação de Atletas Ninho da Gralha</institution>
					<institution content-type="orgname">Paraná Clube</institution>
					<addr-line>
						<named-content content-type="city">Quatro Barras</named-content>,
		<named-content content-type="state">Paraná</named-content>
					</addr-line>
					<country>Brasil</country>
				</aff>
			


Now with a typo in the tag name:

In [26]:
print(etree.tostring(unverified_doctree.xpath("//contrib")[1], encoding="utf-8").decode("utf-8"))

<contrib xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" contrib-type="author">
					<name>
						<surname>Carvalho</surname>
						<given-names>Daniel Augusto de</given-names>
					</name>
					<xref ref-typr="aff" rid="aff2">1</xref>
					<xref ref-type="corresp" rid="cor2">*</xref>
				</contrib>
				


In [27]:
contrib_regexes = list(map(regex.compile, [
    r"(?b)/(?:contrib){e<=1}.*/(?:surname){e<=2}#text=(?<contrib_surname>.*)",
    r"(?b)/(?:contrib){e<=1}.*(?:given-names){e<=3}#text=(?<contrib_given_names>.*)",
    r"(?b)/(?:contrib){e<=1}.*(?:prefix){e<=2}#text=(?<contrib_prefix>.*)",
    r"(?b)/(?:contrib){e<=1}.*(?:suffix){e<=2}#text=(?<contrib_suffix>.*)",
    r"(?b)/(?:xref){e<=1}@(?:ref-type){e<=2}='(?:aff){e<=1}'@(?:rid){e<=1}=(?<xref_aff>'.*')",
    r"(?b)/(?:xref){e<=1}@(?:ref-type){e<=2}='(?:corresp){e<=1}'@(?:rid){e<=1}=(?<xref_corresp>'.*')",
]))

In [28]:
for contrib in unverified_doctree.iter("contrib"):
    print(apply_etree_regexes(contrib, contrib_regexes))

{'contrib_surname': 'Carvalho', 'contrib_given_names': 'Daniel Augusto de', 'xref_aff': 'aff1', 'xref_corresp': 'cor1'}
{'contrib_surname': 'Carvalho', 'contrib_given_names': 'Daniel Augusto de', 'xref_aff': 'aff2', 'xref_corresp': 'cor2'}


## Finding `<aff>` and `<contrib>` nodes using the *regex approximation* approach

In [29]:
def etree_tag_line_gen(root, start=""):
    start += "/" + root.tag
    yield start, root
    for node in root.iterchildren():
        yield from etree_tag_line_gen(node, start)

In [30]:
document_regex_searchers = {k: regex.compile(v).search for k, v in [
    ("article_id", r"/(?:front){e<=1}/.*/(?:article-id){e<=2}$"),
    ("contrib", r"/(?:front){e<=1}/(?:article-meta){e<=4}/.*/(?:contrib){e<=2}$"),
    ("aff", r"/(?:front){e<=1}/(?:article-meta){e<=4}/(?:aff){e<=1}$"),
]}

In [31]:
for tag_line, element in etree_tag_line_gen(doctree.getroot()):
    for k, searcher in document_regex_searchers.items():
        if searcher(tag_line):
            print(k, tag_line, element)

article_id /article/front/article-meta/article-id <Element article-id at 0x7f87fd4fc888>
contrib /article/front/article-meta/contrib-group/contrib <Element contrib at 0x7f87fd4fc8c8>
contrib /article/front/article-meta/contrib-group/contrib <Element contrib at 0x7f87fd4fc808>
contrib /article/front/article-meta/contrib-group/contrib <Element contrib at 0x7f87fd4fc9c8>
contrib /article/front/article-meta/contrib-group/contrib <Element contrib at 0x7f87fd4fc8c8>
contrib /article/front/article-meta/contrib-group/contrib <Element contrib at 0x7f87fd4fc808>
aff /article/front/article-meta/aff <Element aff at 0x7f87fd4fc988>
aff /article/front/article-meta/aff <Element aff at 0x7f87fd4fc808>


## Building `<aff>` and `<contrib>` matching pairs

In [32]:
from collections import defaultdict

In [33]:
def get_article_id_contrib_aff(element):
    result = defaultdict(list)
    for tag_line, element in etree_tag_line_gen(element):
        for k, searcher in document_regex_searchers.items():
            if searcher(tag_line):
                result[k].append(element)
    return result

In [34]:
{k: len(v) for k, v in get_article_id_contrib_aff(doctree.getroot()).items()}

{'article_id': 1, 'contrib': 5, 'aff': 2}

To be continued...