# Analyzing the DOI / PID

In [1]:
import pandas as pd

There are two CSVs being analyzed in the previous experiments,
but their rows weren't joined/matched,
so we don't know when rows of the manually normalized CSV
and rows of the Clea's CSV generated from raw XML documents
regard to the same article.
The CSV created with Clea has $89451$ distinct rows,
yet most of them don't have the `article_publisher_id`:

In [2]:
dataset = pd.read_csv("inner_join_2018-06-04.csv",
                      dtype=str,
                      keep_default_na=False) \
            .drop_duplicates()

In [3]:
dataset.shape

(89451, 39)

In [4]:
pd.DataFrame({
    "filled_in": (dataset.applymap(len) > 0).sum(),
    "distinct": dataset.apply(lambda col: col[col.apply(len) > 0].unique().size),
})

Unnamed: 0,filled_in,distinct
addr_city,79922,2728
addr_country,85968,266
addr_country_code,67724,124
addr_postal_code,1533,425
addr_state,68545,671
aff_email,14973,10067
aff_id,89451,60
aff_text,89451,49959
article_doi,89428,20550
article_publisher_id,25059,5598


The table above has the number of filled and distinct filled values
in each column.
It doesn't count the null/empty values.

We can use several fields to try to find a match,
but the two main ones are the article IDs:
`article_doi` and `article_publisher_id`.
They should be paired as $1:1$,
and there are at least two public sources of information
that can be used for matching those fields:

- SciELO's ArticleMeta, whose Python client `articlemetaapi`
  is available in
  [PyPI](https://pypi.org/project/articlemetaapi/) and
  [GitHub](https://github.com/scieloorg/articlemetaapi);

- Crossref, whose
  [API documentation](https://github.com/Crossref/rest-api-doc)
  are quite explicit on its request rate limit policy.

As most rows have `article_doi` filled in,
it'll be used to find the missing `article_publisher_id` values.
Yet, even if we do the reverse (get `article_doi` from `article_publisher_id`),
there are a few rows that wouldn't get filled.
$23$ rows don't have a DOI number (`article_doi` is empty),
but most of these have a `article_publisher_id`:

In [5]:
no_doi_dataset = dataset[dataset["article_doi"] == ""]
no_doi_dataset["article_publisher_id"]

63008    S0365-05962013000100001
63009    S0365-05962013000100001
63010    S0365-05962013000100001
63011    S0365-05962013000100001
63038    S0365-05962013000100001
63039    S0365-05962013000100001
63040    S0365-05962013000100001
63041    S0365-05962013000100001
86122                           
86123                           
86124                           
86125                           
86126                           
86831    S1984-29612012000100016
86832    S1984-29612012000100016
86833    S1984-29612012000100016
86834    S1984-29612012000100016
86835    S1984-29612012000100016
86836    S1984-29612012000100016
86837    S1984-29612012000100016
86838    S1984-29612012000100016
86839    S1984-29612012000100016
86840    S1984-29612012000100016
Name: article_publisher_id, dtype: object

The same analysis can be performed on the manually normalized CSV file:

In [6]:
ndata = pd.read_csv("aff_n15.csv",
                    sep="|",
                    dtype=str,
                    keep_default_na=False) \
          .drop_duplicates()

In [7]:
ndata.shape

(506786, 15)

In [8]:
pd.DataFrame({
    "filled_in": (ndata.applymap(len) > 0).sum(),
    "distinct": ndata.apply(lambda col: col[col.apply(len) > 0].unique().size),
})

Unnamed: 0,filled_in,distinct
coleção,506786,1
PID,506786,263431
ano de publicação,506786,98
tipo de documento,506786,15
título,506786,360
número,506786,3182
normalizado,506786,2
id de afiliação,506548,138
instituição original,506785,7303
paises original,306760,279


There's no DOI (`article_doi`) column in this CSV,
but every single row has a `PID` (`article_publisher_id`) value.

## Crossref

The goal is to get the crossref data from a given DOI,
so that we can find the PID for \emph{almost} every row in the Clea's output CSV,
and hopefully match the rows from both CSV files. 

In order to do so, a `fetch_crossref.py` script was created,
which tries to fulfill the Crossref specifications regarding the rate limit
while asynchronously downloading the full data
for every DOI in an `article_doi` column in the CSV input.

In [9]:
!./fetch_crossref.py --help

Usage: fetch_crossref.py [OPTIONS] CSV_FILE

Options:
  -e, --email TEXT           E-mail for using the polite pool.
  -i, --interval INTEGER     Minimum duration in seconds between bursts.
  -l, --limit INTEGER        Maximum number of requests allowed in an
                             interval.
  --update / --no-update     Update limit/interval following the X-Rate-Limit
                             headers.
  -I, --interval-rate FLOAT  Wait this rate times the interval between bursts.
  -L, --limit-rate FLOAT     Fetch this rate times the request limit in a
                             burst.
  -d, --out-dir DIRECTORY    Directory to store the collected JSON files.
  --help                     Show this message and exit.
