# Applying FastText directly on ISO data

In [1]:
from functools import reduce
from pathlib import Path
import re

import pandas as pd

from ioisis import iso

## Loading ISO data with [`ioisis`](https://github.com/scieloorg/ioisis) 0.1.0 and Pandas

This is the directory with ISO data,
it can be the mounted `isis2mongo` source data directory,
a copy of it,
or a symbolic link to the actual path.

In [2]:
iso_root_dir = "./iso_files/2019-11-13/"

The structure of this directory should be a `COLLECTION/DATASET.iso`,
where `COLLECTION` is the 3 lowercase letters code of a collection,
and `DATASET` is the name of a dataset
(`title`, `artigo`, `bib4cit` or `issue`).
For now we'll use just the `artigo.iso` files:

In [3]:
iso_file_names_map = {iso_file.parts[-2]: str(iso_file.resolve())
                      for iso_file in Path(iso_root_dir).glob("???/artigo.iso")}

These functions will let us get the first records
from each of these datasets.
As we are going to use only the records of the *header* type
(the `artigo.iso` data files have more than one schema,
 one can think that there's a distinct "table"
 for each value of the field $706$).

In [4]:
def filter_headers(iterator):
    for el in iterator:
        if el["706"] == ["h"]:
            yield el

In [5]:
def take(n, iterator):
    return [el for unused, el in zip(range(n), iterator)]

Let's get the first record from the SciELO Brazil ISO file
to see how it looks like:

In [6]:
scl1, = take(1, filter_headers(iso.iter_records(iso_file_names_map["scl"])))
scl1

defaultdict(list,
            {'1': ['br1.1'],
             '2': ['S0044-5967(04)03400101'],
             '4': ['v34n1'],
             '10': ['^rND^1A01^nSilvio Roberto Miranda dos^sSantos',
              '^rND^1A01^nIzildinha de Souza^sMiranda',
              '^rND^1A01^nManoel Malheiros^sTourinho'],
             '12': ['Estimativa de biomassa de sistemas agroflorestais das várzeas do rio juba, Cametá, Pará^lpt',
              'Biomass estimation of agroforestry systems of the Juba river floodplain in Cametá, Pará^len'],
             '14': ['^f01^l08'],
             '30': ['Acta Amaz.'],
             '31': ['34'],
             '32': ['1'],
             '35': ['0044-5967'],
             '38': ['ILUS', 'TAB'],
             '40': ['pt'],
             '42': ['1'],
             '49': ['AA010'],
             '58': ['Projeto Várzea/UFRA'],
             '65': ['20040000'],
             '70': ['Universidade Federal Rural da Amazônia-UFRA^iA01^cBelém^sPA^pBrasil^e<a href="mailto:izildinha@ufra.

It's a dictionary of lists of strings,
in the `{field_id: [field, field, field, ...]}` format.
Each field string have multiple subfields,
where each subfield starts with a `^` symbol
and its single-char identifier.
This regular expression converts a field string
to a list of subfield pairs
(similar to the format type 2 of
[isis2json](https://github.com/scieloorg/isis2json)):

In [7]:
get_all_subfields_pairs = re.compile(r"(?:^|\^(.))([^^]*)").findall

What matters most for us
is a way to get the pairs of affiliation and contributors
with their data in a "first normal form" sense joined together,
keeping the affiliation data even if it has no linked contributor.
We can get a dataframe like that with Pandas:

In [8]:
def field_df(record, field_id):
    return pd.DataFrame([
        {k: v for k, v in get_all_subfields_pairs(field) if v}
        for field in record[field_id]
    ]).rename(columns=lambda name: field_id + (name or "_"))

In [9]:
scl1_affs = field_df(scl1, "70")
scl1_affs

Unnamed: 0,70_,70i,70c,70s,70p,70e
0,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,"<a href=""mailto:izildinha@ufra.edu.br"">izildin..."


In [10]:
scl1_contribs = field_df(scl1, "10")
scl1_contribs

Unnamed: 0,10r,101,10n,10s
0,ND,A01,Silvio Roberto Miranda dos,Santos
1,ND,A01,Izildinha de Souza,Miranda
2,ND,A01,Manoel Malheiros,Tourinho


In [11]:
scl1_aff_contrib_pairs = pd.merge(scl1_affs, scl1_contribs,
                                  how="left", left_on="70i", right_on="101")
scl1_aff_contrib_pairs

Unnamed: 0,70_,70i,70c,70s,70p,70e,10r,101,10n,10s
0,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,"<a href=""mailto:izildinha@ufra.edu.br"">izildin...",ND,A01,Silvio Roberto Miranda dos,Santos
1,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,"<a href=""mailto:izildinha@ufra.edu.br"">izildin...",ND,A01,Izildinha de Souza,Miranda
2,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,"<a href=""mailto:izildinha@ufra.edu.br"">izildin...",ND,A01,Manoel Malheiros,Tourinho


From the ISO data, we have less information
than what we would get from the XML data,
e.g. it doesn't have the citations, nor issue information,
but we've got some other information, like the PID.
We can "join" (in a cartesian product sense)
the multiple values of some fields:

In [12]:
def cartesian_df(*dfs, free_name="_empty"):
    return reduce(
        pd.DataFrame.join,
        [df.assign(**{free_name: 0}).set_index(free_name) for df in dfs],
    ).reset_index(drop=True)

In [13]:
scl1_other = cartesian_df(field_df(scl1, "2"), field_df(scl1, "12"), field_df(scl1, "936"))
scl1_other

Unnamed: 0,2_,12_,12l,936i,936y,936o
0,S0044-5967(04)03400101,Estimativa de biomassa de sistemas agroflorest...,pt,0044-5967,2004,1
1,S0044-5967(04)03400101,Biomass estimation of agroforestry systems of ...,en,0044-5967,2004,1


That article has a title in more than one language.
We can use the cartesian product again to join these rows
with the affiliation-contributor pairs
in order to get a dataframe from a single ISO record,
and we can also replace these field/subfield IDs
by some meaningful names:

In [14]:
COLUMN_NAMES = {
    "70_": "orgname",
    "701": "orgdiv1",
    "702": "orgdiv2",
    "703": "orgdiv3",
    "704": "normalized",
    "708": "c8",
    "709": "original",
    "70e": "email",
    "70c": "city",
    "70d": "division",
    "70s": "state",
    "70p": "country",
    "70q": "country_full",
    "70z": "zip",
    "70i": "affid",
    "70l": "label",
    "101": "affid",
    "10s": "surname",
    "10n": "names",
    "10p": "prefix",
    "10z": "suffix",
    "10r": "role",
    "10k": "orcid",
    "2_": "pid",
    "12_": "title",
    "12l": "title_lang",
    "936i": "issn",
    "936y": "year",
    "936o": "number",
}

In [15]:
def record2df(record):
    field_dfs = {field_id: field_df(record, field_id)
                 for field_id in ["70", "10", "2", "12", "936"]}
    return cartesian_df(
        pd.merge(field_dfs.pop("70"), field_dfs.pop("10"),
                 how="left", left_on="70i", right_on="101"),
        *field_dfs.values(),
    ).rename(columns=COLUMN_NAMES)

In [16]:
record2df(scl1).drop(columns=["email"])

Unnamed: 0,orgname,affid,city,state,country,role,affid.1,names,surname,pid,title,title_lang,issn,year,number
0,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,A01,Silvio Roberto Miranda dos,Santos,S0044-5967(04)03400101,Estimativa de biomassa de sistemas agroflorest...,pt,0044-5967,2004,1
1,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,A01,Silvio Roberto Miranda dos,Santos,S0044-5967(04)03400101,Biomass estimation of agroforestry systems of ...,en,0044-5967,2004,1
2,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,A01,Izildinha de Souza,Miranda,S0044-5967(04)03400101,Estimativa de biomassa de sistemas agroflorest...,pt,0044-5967,2004,1
3,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,A01,Izildinha de Souza,Miranda,S0044-5967(04)03400101,Biomass estimation of agroforestry systems of ...,en,0044-5967,2004,1
4,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,A01,Manoel Malheiros,Tourinho,S0044-5967(04)03400101,Estimativa de biomassa de sistemas agroflorest...,pt,0044-5967,2004,1
5,Universidade Federal Rural da Amazônia-UFRA,A01,Belém,PA,Brasil,ND,A01,Manoel Malheiros,Tourinho,S0044-5967(04)03400101,Biomass estimation of agroforestry systems of ...,en,0044-5967,2004,1
