# Using scielo-clea to analyze a XML

The version in this example is [scielo-clea v0.1.0](https://pypi.org/project/scielo-clea/0.1.0/).

In [1]:
from clea.core import Article
from clea.join import aff_contrib_inner_join

This XML has a `<contrib>` with two `<xref>`, one for each `<aff>` available:

In [2]:
aff_contrib_inner_join(Article("1984-4689-zool-35-e14641.xml"))

[{'institution_orgname': 'Smithsonian Tropical Research Institute',
  'journal_publisher_id': 'zool',
  'addr_state': '',
  'issn_epub': '1984-4689',
  'xref_aff': 'aff1; aff2',
  'contrib_role': '',
  'xref_aff_text': '1; 2',
  'publisher_name': 'Sociedade Brasileira de Zoologia',
  'institution_original': 'Smithsonian Tropical Research Institute, Box 0843-03092, Balboa, Republic of Panama.',
  'contrib_prefix': '',
  'journal_title': 'Zoologia (Curitiba)',
  'xref_corresp_text': '*',
  'addr_city': 'Balboa',
  'issn_ppub': '1984-4670',
  'aff_text': '1 Smithsonian Tropical Research Institute, Box 0843-03092, Balboa, Republic of Panama. Smithsonian Tropical Research Institute Smithsonian Tropical Research Institute Balboa Panama',
  'contrib_type': 'author',
  'article_title': 'Thermal tolerance of the zoea I stage of four Neotropical crab species (Crustacea: Decapoda)',
  'contrib_given_names': 'Adriana P.',
  'label': '1',
  'institution_orgdiv1': '',
  'contrib_degrees': '',
  'con

It's pretty hard to extract high level knowledge just by looking at huge JSON-like representations like that. The goal should be to let some algorithm help on that.

# Generate a CSV with all information from `scielo-clea`

In [3]:
import collections, csv, pathlib, multiprocessing

In [4]:
def path2rows(path):
    try:
        return aff_contrib_inner_join(Article(str(path)))
    except:
        return []

In [5]:
def multiprocess_row_generator(paths):
    with multiprocessing.Pool() as pool:
        for rows in pool.imap(path2rows, paths):
            for row in rows:
                yield row

In [6]:
%%time
paths = pathlib.Path("selecao_xml_br").glob("**/*.xml")
rows = iter(multiprocess_row_generator(paths))
row = next(rows)
header = sorted(row.keys())
with open("inner_join_2018-06-04.csv", "w") as output_file:
    cw = csv.writer(output_file)
    cw.writerow(header)
    cw.writerow(row[col_name] for col_name in header)
    for row in rows:
        cw.writerow(row[col_name] for col_name in header)

CPU times: user 6.99 s, sys: 934 ms, total: 7.92 s
Wall time: 11min 32s


The generated `inner_join_2018-06-04.csv` had 74024497 bytes (71 MB)