# Matching and normalizing PID/DOI

This resumes [2018-07-12 experiments](experiments_2018-07-12.ipynb)
on analyzing the PID and DOI columns from the CSV files,
which should be seen as keys to match rows from distinct CSV files.

## Creating a "PID from DOI" CSV from Crossref data

From the previous experiments, we've got $20$ thousand files
from the [Crossref API](https://github.com/Crossref/rest-api-doc)
using the `fetch_crossref.py` script:

In [1]:
!ls crossref-by-doi/*.json | wc -l

20364


We can create a single CSV from these JSON files
with two columns: DOI and PID.
However, the `;` character isn't a recommended separator
since some DOI values have it:

In [2]:
!cat crossref-by-doi/*.json | jq -r .message.DOI | grep ';' | wc -l

36


But there are no comma:

In [3]:
!cat crossref-by-doi/*.json | jq -r .message.DOI | grep ',' | wc -l

0


This shell script should create the desired CSV:

In [4]:
!( \
  echo DOI,PID && \
  cat crossref-by-doi/*.json \
    | jq -r '.message | select(has("alternative-id")) | "\(.DOI),\(.["alternative-id"][0])"' \
) > crossref-pid-from-doi.csv

Loading the data:

In [5]:
import pandas as pd

In [6]:
doi_pid = pd.read_csv("crossref-pid-from-doi.csv",
                      dtype=str,
                      keep_default_na=False)
pd.concat([doi_pid.head(), doi_pid.tail()])

Unnamed: 0,DOI,PID
0,10.1590/0103-6440201801735,S0103-64402018000100060
1,10.1016/j.bjan.2013.03.013,S0034709413000640
2,10.1016/j.bjan.2013.03.019,S0034709413000986
3,10.1016/j.bjan.2013.04.007,S0034709413000974
4,10.1016/j.bjane.2012.06.011,S0104001414000426
17374,10.5935/1808-8694.20140015,S1808869414500158
17375,10.5935/1808-8694.20140016,S180886941450016X
17376,10.5935/1808-8694.20140017,S1808869414500171
17377,10.5935/1808-8694.20140018,S1808869414500183
17378,10.1590/0037-8682-0155-2015,S0037-86822015000800001


Even the DOI values aren't normalized: there's a single duplicate in data:

In [7]:
doi_pid.shape[0], doi_pid.drop_duplicates().shape[0]

(17379, 17378)

But the relationship is $1:1$:

In [8]:
doi_pid["DOI"].unique().shape[0], doi_pid["PID"].unique().shape[0]

(17378, 17378)

Probably the duplicate regards to some prefix,
like these not-normalized DOI values:

In [9]:
!ls crossref-by-doi/ | grep -v ^1

010.1590_0103-6440201801735.json
_10.1590_1414-431X20122453.json
Doi: 10.4025_actasciagron.v37i1.17686.json
_dx.doi.org_10.1590_0037-8682-0155-2015.json


## DOI value cleaning

The four values found from the files can also be found directly from the CSV:

In [10]:
dataset = pd.read_csv("inner_join_2018-06-04.csv",
                      dtype=str,
                      keep_default_na=False) \
            .drop_duplicates()

In [11]:
dataset[~dataset["article_doi"].str.startswith("1")]["article_doi"].unique().tolist()

['/dx.doi.org/10.1590/0037-8682-0155-2015',
 'Doi: 10.4025/actasciagron.v37i1.17686',
 '',
 'S0365-05962013000700001',
 '010.1590/0103-6440201801735',
 '/10.1590/1414-431X20122453']

The empty entry was expected
(Crossref wasn't queried for it, as it wouldn't make sense),
and the `S`-prefixed DOI (perhaps a valid PID) looks invalid,
and wasn't found when querying the Crossref API.
However, there are thousands of DOIs that either weren't found,
or has something different in the Crossref API,
therefore we need to perform some kind of DOI normalization
before joining the dataframes:

In [12]:
len(dataset[~dataset["article_doi"].isin(doi_pid["DOI"])]["article_doi"].unique())

9344

At first, let's see how many values in the CSV don't follow the
[DOI numbering specification](https://www.doi.org/doi_handbook/2_Numbering.html):

In [13]:
import re

In [14]:
udois = dataset["article_doi"].drop_duplicates()
invalid_dois = udois[~udois.apply(re.compile(r"10(\.\d+)+/.+$").match).apply(bool)]
invalid_dois

33746    /dx.doi.org/10.1590/0037-8682-0155-2015
40790      Doi: 10.4025/actasciagron.v37i1.17686
45760              1590/TEM-1980-542X2018v240107
45912               1590/1983-1447.2016.01.54105
63008                                           
63917                    S0365-05962013000700001
72612                010.1590/0103-6440201801735
81931                 /10.1590/1414-431X20122453
93953                       105028/jatm.v5i1.163
Name: article_doi, dtype: object

Only $8$ values are invalid, most of them because of prefix issues.
This regex should fix the invalid DOI values (but the PID one):

In [15]:
doi_regex = re.compile(r"(?:.*?10\.?|^)(\d+(?:.\d+)+/.+)$")
invalid_dois.apply(doi_regex.search).apply(lambda val: val and "10." + val.groups()[0])

33746         10.1590/0037-8682-0155-2015
40790    10.4025/actasciagron.v37i1.17686
45760    10.1590/TEM-1980-542X2018v240107
45912     10.1590/1983-1447.2016.01.54105
63008                                None
63917                                None
72612          10.1590/0103-6440201801735
81931           10.1590/1414-431X20122453
93953               10.5028/jatm.v5i1.163
Name: article_doi, dtype: object

But that's a rather small value.
How about the $9344$ distinct values
that hadn't matched a DOI in the DOI to PID dataframe?
Let's compare with the filename,
how many files doesn't match?

In [16]:
!for f in crossref-by-doi/*.json ; do \
  doi="$(jq -r .message.DOI "$f")" ; \
  if [ crossref-by-doi/"${doi//\//_}".json != "$f" ] ; then \
    echo 1 ; \
  fi ; \
done | wc -l

6491


Less than $9$ thousand, but that's still too much.
Let's see the first ones not matching
(each row has a DOI from the Crossref
 and the file name created by replacing the `/` by `_`):

In [17]:
!i=0; \
for f in crossref-by-doi/*.json ; do \
  doi="$(jq -r .message.DOI "$f")" ; \
  if [ crossref-by-doi/"${doi//\//_}".json != "$f" ] ; then \
    echo "$doi" "$f" ; \
    i=$[$i + 1] ; \
  fi ; \
  test $i -ge 5 && break ; \
done

10.1590/0103-6440201801735 crossref-by-doi/010.1590_0103-6440201801735.json
10.11606/issn.2316-901x.v0i61p103-121 crossref-by-doi/10.11606_issn.2316-901X.v0i61p103-121.json
10.11606/issn.2316-901x.v0i61p122-139 crossref-by-doi/10.11606_issn.2316-901X.v0i61p122-139.json
10.11606/issn.2316-901x.v0i61p140-158 crossref-by-doi/10.11606_issn.2316-901X.v0i61p140-158.json
10.11606/issn.2316-901x.v0i61p14-17 crossref-by-doi/10.11606_issn.2316-901X.v0i61p14-17.json


These are distict because of the `X` case,
which seems to be given in lower case by the Crossref output.
Putting the file name in lowercase for the comparison yields:

In [18]:
!for f in crossref-by-doi/*.json ; do \
  doi="$(jq -r .message.DOI "$f")" ; \
  if [ crossref-by-doi/"${doi//\//_}".json != "${f,,}" ] ; then \
    echo "$doi" "$f" ; \
  fi ; \
done

10.1590/0103-6440201801735 crossref-by-doi/010.1590_0103-6440201801735.json
10.1590/1414-431x20122453 crossref-by-doi/_10.1590_1414-431X20122453.json
10.4025/actasciagron.v37i1.17686 crossref-by-doi/Doi: 10.4025_actasciagron.v37i1.17686.json
10.1590/0037-8682-0155-2015 crossref-by-doi/_dx.doi.org_10.1590_0037-8682-0155-2015.json


Only $4$ not matching. Doing the same with the dataframes:

In [19]:
lower_doi = dataset["article_doi"].str.lower()
dataset[~lower_doi.isin(doi_pid["DOI"])]["article_doi"].drop_duplicates()

858      10.4301/S1807-17752015000100005
861      10.4301/S1807-17752015000100006
863      10.4301/S1807-17752015000100008
865      10.4301/S1807-17752015000100002
868      10.4301/S1807-17752015000100001
869      10.4301/S1807-17752015000100009
872      10.4301/S1807-17752015000100007
875      10.4301/S1807-17752015000100004
876      10.4301/S1807-17752015000100003
879      10.4301/S1807-17752017000100002
880      10.4301/S1807-17752017000100001
882      10.4301/S1807-17752017000100005
886      10.4301/S1807-17752017000100006
888      10.4301/S1807-17752017000100004
891      10.4301/S1807-17752017000100003
892      10.4301/S1807-17752014000100007
894      10.4301/S1807-17752014000100010
896      10.4301/S1807-17752014000100008
898      10.4301/S1807-17752014000100002
901      10.4301/S1807-17752014000100009
903      10.4301/S1807-17752014000100011
908      10.4301/S1807-17752014000100003
910      10.4301/S1807-17752014000100012
912      10.4301/S1807-17752014000100006
915      10.4301

And that makes sense, we don't have more than $3$ thousand paired rows,
though most of them have a Crossref entry:

In [20]:
3175 + 17378 # Not matched + "doi_pid" size == distinct DOIs

20553

$2985$ out of $3175$ are Crossref entries without the PID:

In [21]:
!cat crossref-by-doi/*.json \
  | jq -r '.message | select(has("alternative-id") | not) | .DOI' \
  | wc -l

2985


Several entries don't have a PID since SciELO's DOI prefix is `10.1590`,
and entries with other DOI prefixes aren't submitted  by SciELO
but probably by the publisher who owns that prefix.

Nevertheless, we can fix the few DOI values that aren't normalized.
Let's patch the dataset with a new "fixed_doi" column:

In [22]:
dataset["fixed_doi"] = dataset["article_doi"] \
                      .apply(doi_regex.search) \
                      .apply(lambda val: val and "10." + val.groups()[0]) \
                      .str.lower()
dataset[["article_doi", "fixed_doi"]] \
    [dataset["article_doi"].str.lower() != dataset["fixed_doi"]] \
    .drop_duplicates()

Unnamed: 0,article_doi,fixed_doi
33746,/dx.doi.org/10.1590/0037-8682-0155-2015,10.1590/0037-8682-0155-2015
40790,Doi: 10.4025/actasciagron.v37i1.17686,10.4025/actasciagron.v37i1.17686
45760,1590/TEM-1980-542X2018v240107,10.1590/tem-1980-542x2018v240107
45912,1590/1983-1447.2016.01.54105,10.1590/1983-1447.2016.01.54105
63008,,
63917,S0365-05962013000700001,
72612,010.1590/0103-6440201801735,10.1590/0103-6440201801735
81931,/10.1590/1414-431X20122453,10.1590/1414-431x20122453
93953,105028/jatm.v5i1.163,10.5028/jatm.v5i1.163


One can append `or ""` to the lambda
in order to replace the `None` by empty strings.
This `(?:.*?10\.?|^)(\d+(?:.\d+)+/.+)$` regex should be seen
as an overfit to this data:
it assumes there are no suffix errors,
and that a `10`-prefixed value without a dot
had just missed the dot.
However, as it matches a valid DOI,
it should remain useful for DOI value normalization
even for data obtained from elsewhere.

This procedure enhances the matching by some few entries: 

In [23]:
dataset[~dataset["fixed_doi"].isin(doi_pid["DOI"])]["fixed_doi"].drop_duplicates().shape[0]

3171

Just for consistency sake,
that value can also be found as ($\#$ denotes "number of"):

$$
\# not(matched) = \# distinct(article\_doi) - \# Crossref\_files_{total} + \# Crossref\_files_{without\_PID}
$$

In [24]:
20550 - 20364 + 2985

3171

## Comparing raw XML PID with the Crossref PID

Having the `fixed_doi` column makes a matching join quite easy:

In [25]:
merged2 = pd.merge(dataset, doi_pid, how="left", left_on="fixed_doi", right_on="DOI")
merged2.head(3).T

Unnamed: 0,0,1,2
addr_city,,,
addr_country,Brazil,Brazil,Brazil
addr_country_code,BR,BR,BR
addr_postal_code,,,
addr_state,,,
aff_email,ligiamorimadeira@gmail.com,mksilva@ufrgs.br,bianca.or@gmail.com
aff_id,aff1,aff1,aff2
aff_text,* É doutora em Sociologia pela Universidade Fe...,* É professor do Departamento de Sociologia da...,** É professora do Instituto Federal Sul-rio-g...
article_doi,10.1590/0103-335220162102,10.1590/0103-335220162106,10.1590/0103-335220162106
article_publisher_id,,,


However, there are way too many rows with distinct PIDs for a single DOI:

In [26]:
merged2[["article_publisher_id", "PID"]] \
    [(merged2["article_publisher_id"] != "") &
    (merged2["article_publisher_id"] != merged2["PID"])] \
    .drop_duplicates()

Unnamed: 0,article_publisher_id,PID
429,S2237-26602017005001103,S2237-26602018000100117
433,S2237-26602017005001102,S2237-26602018000100167
436,S2237-26602017005001101,S2237-26602018000100151
733,1982-451320160109,S1982-45132016000100131
738,1982-451320160101,S1982-45132016000100009
741,1982-451320160105,S1982-45132016000100067
743,1982-451320160103,S1982-45132016000100039
747,1982-451320160106,S1982-45132016000100083
751,1982-451320160108,S1982-45132016000100117
755,1982-451320160107,S1982-45132016000100095


That's most of the data we have:

In [27]:
dataset["article_publisher_id"].unique().size - 1 # -1 to remove the empty

5598

Less than $400$ matched:

In [28]:
merged2[["article_publisher_id", "PID"]] \
    [merged2["article_publisher_id"] == merged2["PID"]] \
    .drop_duplicates() \
    .shape[0]

398

Since $5598 - 398 = 5200 < 5218$, the PID relationship isn't $1:1$.
Some of these PID values are from older "ahead of time" registries
of the very same article, but that's not the only reason.

The PID itself seems to require some normalization.
Perhaps the value from Crossref is the last/frozen one,
but it's hard to be confident about that.
Missing the leading `S`?
Having an extra `.` from some older PID format?
I'm not sure, but reading the raw XML (the ground truth source)
presents us some information that are hard to
confidently find relationships.

Trying to normalize it by adding the `S` prefix and removing any dots doesn't help:

In [29]:
pid_regex = re.compile(r"[sS]?(\d{4})-?(\d{3}[0-9xX].*)$")
merged2["alt_pid"] = merged2["article_publisher_id"] \
    .drop_duplicates() \
    .apply(pid_regex.search) \
    .apply(lambda val: val and "S{}-{}".format(*val.groups()) or "") \
    .str.replace(".", "") \
    .str.upper()
merged2[["alt_pid", "PID"]] \
    [merged2["alt_pid"].apply(bool) & (merged2["alt_pid"] == merged2["PID"])] \
    .drop_duplicates().shape[0]

398

## Finding the current PID and the DOI from the older PID

The `1807-57622016.0103` format isn't accepted by the SciELO's "XML debug" API:

In [30]:
!curl -I 'http://www.scielo.br/scielo.php?script=sci_arttext&pid=1807-57622016.0103&lng=en&nrm=iso&tlng=pt&debug=xml'

HTTP/1.1 404 Not Found - Archive Empty
[1mDate[21m: Mon, 23 Jul 2018 20:06:12 GMT
[1mPragma[21m: no-cache
[1mContent-Length[21m: 886
[1mContent-Type[21m: text/html; charset=UTF-8
[1mcache-control[21m: max-age=900
[1mmagicmarker[21m: 1
[1mServer[21m: nginx
[1mX-Varnish[21m: 554180725 550549535
[1mAge[21m: 2816
[1mVia[21m: 1.1 varnish-v4
[1mConnection[21m: keep-alive



But using its matching `S1414-32832017000200349`
(found from DOI), we get:

In [31]:
!curl -I 'http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1414-32832017000200349&lng=en&nrm=iso&tlng=pt&debug=xml'

HTTP/1.1 200 OK
[1mDate[21m: Mon, 23 Jul 2018 20:06:46 GMT
[1mPragma[21m: no-cache
[1mContent-Type[21m: text/xml; charset=utf-8
[1mcache-control[21m: max-age=900
[1mmagicmarker[21m: 1
[1mServer[21m: nginx
[1mX-Varnish[21m: 558247970 550518563
[1mAge[21m: 2782
[1mVia[21m: 1.1 varnish-v4
[1mContent-Length[21m: 22847
[1mConnection[21m: keep-alive



Let's get the XML content with `urllib`:

In [32]:
from urllib.request import urlopen, Request
from urllib.parse import urlencode
from lxml import etree

In [33]:
def get_scielo_php(pid):
    resp = urlopen("http://www.scielo.br/scielo.php?" + urlencode({
        "script": "sci_arttext",
        "pid": pid,
        "lng": "en",
        "nrm": "iso",
        "tlng": "pt",
        "debug": "xml"
    }))
    return etree.parse(resp) if resp.status == 200 else None

In [34]:
php_xml_output = get_scielo_php("S1414-32832017000200349")
dict(php_xml_output.xpath("//ARTICLE")[0].attrib)

{'TEXTLANG': 'pt',
 'ORIGINALLANG': 'pt',
 'FPAGE': '349',
 'LPAGE': '361',
 'PID': 'S1414-32832017000200349',
 'DOCTOPIC': 'oa',
 'DOCTYPE': 'article',
 'RELATED': '',
 'CITED': '',
 'PROJFAPESP': '',
 'CLINICALTRIALS': ' 0',
 'AREASGEO': '',
 'PROCESSDATE': '20170317',
 'CURR_DATE': '20180723',
 'ahpdate': '20161103',
 'DOI': '10.1590/1807-57622016.0103',
 'oldpid': 'S1414-32832016005024105',
 'PDF': '1'}

That element alone gives us all information regarding the PID and DOI,
including the previous PID value.
Also, we can query using the old PID value: 

In [35]:
old_pid_php_xml_output = get_scielo_php("S1414-32832016005024105")
dict(old_pid_php_xml_output.xpath("//ARTICLE")[0].attrib)

{'TEXTLANG': 'pt',
 'ORIGINALLANG': 'pt',
 'FPAGE': '349',
 'LPAGE': '361',
 'PID': 'S1414-32832017000200349',
 'DOCTOPIC': 'oa',
 'DOCTYPE': 'article',
 'RELATED': '',
 'CITED': '',
 'PROJFAPESP': '',
 'CLINICALTRIALS': ' 0',
 'AREASGEO': '',
 'PROCESSDATE': '20170317',
 'CURR_DATE': '20180723',
 'ahpdate': '20161103',
 'DOI': '10.1590/1807-57622016.0103',
 'oldpid': 'S1414-32832016005024105',
 'PDF': '1'}