# Cleaning / Normalizing the ISSN

*This is the analysis of the full SciELO's network
`journals.csv` report/spreadsheet/dataset
as it was on its 2018-06-10 release,
future versions will hopefully have pre-normalized ISSN fields.*

Some journals might have more than one ISSN,
since every medium (electronic/print/CD/etc.) have at least its own ISSN.
However, in different collections, the ISSN might be different.
We should find a way to normalize them,
in order to know when two entries regard to the same journal.

In [1]:
import pandas as pd
pd.options.display.max_colwidth = 400

In [2]:
journals = pd.read_csv("tabs_network/journals.csv")

There are two columns regarding ISSN:

In [3]:
[col for col in journals.columns if "ISSN" in col.upper()]

['ISSN SciELO', "ISSN's"]

The first, `ISSN SciELO`, has a selected ISSN
to be something akin to a primary key,
whereas the `ISSN's` has a list of other ISSNs
regarding the same journal content,
written as a single string
where the ISSNs are separated by a `;` (semicolon) symbol.

## Detecting grossly invalid ISSNs

The format of an ISSN is `NNNN-NNNC`,
where `N` is a digit (from $0$ to $9$)
and `C` is a check "digit" (from $0$ to $9$ or $X$).
Is there any ISSN in the `ISSN SciELO` column
that doesn't conform to that?

In [4]:
single_issn_regex = r"^\d{4}-\d{3}[\dX]$"
journals[["ISSN SciELO"]][~journals["ISSN SciELO"].str.contains(single_issn_regex)]

Unnamed: 0,ISSN SciELO
1410,0719-448x


It's not invalid,
but we should always use the same letter case
in order to work with the ISSN as a *matching index* or *primary key*.
A proper normalization would use something like
`journals["ISSN SciELO"].str.upper()`.

How about the `ISSN's` column?

In [5]:
multi_issn_regex = r"^(?:\d{4}-\d{3}[\dX])(?:;\d{4}-\d{3}[\dX])*$"
journals[["ISSN's"]][~journals["ISSN's"].fillna("").str.contains(multi_issn_regex)]

Unnamed: 0,ISSN's
102,
103,
660,ISSN;0252-8584
1410,0719-448x;0718-0446
1696,20030507;1315-6411


Besides the `x` case issue and the empty `ISSN's` field,
the date-like `20030507` and the `ISSN` text are invalid ISSN values,
the latter being the only grossly invalid entry found.
The date-like one was grabbed here because of the lack of `-`,
but it's invalid due to its last digit,
which should have been `9` in order to get a valid `ISSN` value,
as discussed in the next session.

We can clean these issues
by filling the `NaN` with the `ISSN SciELO` value from the same row,
by taking the uppercase to get rid from the single small `x`,
and by using a mapping to remove the undesired value.
Before normalizing it all,
let's check if there's no other invalid check digit.

## ISSN check digit

### Equation

The check digit is the modulo $11$,
and the equation to get it frmo the first $7$ ISSN digits is
(where $X$ means this equation yields $10$):

$$
S = \text{ISSN7} \cdot [8, 7, 6, 5, 4, 3, 2] \\
\text{check digit} = 11 \left\lceil \frac{S}{11} \right\rceil - S
$$

The check digit can be obtained from the remainder of the $^{S}/_{11}$ division:
if it's zero, the check digit is zero, else the check digit is $11 - remainder$.
Proof:

$$
\begin{array}{rcl}
 S &=& 11 \times \text{integer quotient} + \text{remainder} \\
{} &=& 11 \left\lfloor \frac{S}{11} \right\rfloor + \text{remainder} \\
{} &=& 11 \left\lceil \frac{S}{11} \right\rceil - \text{check digit} \\
\end{array}
\\
\therefore\quad
\text{check digit} =
11 \left(
  \left\lceil \frac{S}{11} \right\rceil -
  \left\lfloor \frac{S}{11} \right\rfloor
\right) - \text{remainder}
$$

### Example

For example, `0103-6564` (regarding the *Psicologia USP* journal)
is a valid ISSN, since the dot product $S$
between its first $7$ digits and $[8, 7, 6, 5, 4, 3, 2]$ is:

$$
\begin{array}{rccccccccl}
  \text{ISSN:} &  0 &  1 &  0 &  3 & - &  6 &  5 &  6 & (4) \\
        \times &  8 &  7 &  6 &  5 &   &  4 &  3 &  2 \\ \hline
      S = \sum & \{ 0,&  7,&  0,& 15,&   & 24,& 15,& 12 \} & = 73\\
\end{array}
$$

The remainder is $7$ and $11 - 7 = 4$, the check digit:

$$
73 = 11 \cdot 6 + {\underset{\uparrow}{7}} =
     11 \cdot 7 - {\underset{\uparrow}{4}}
$$


### ISSN digit checker function

In [6]:
def issn_digit(issn7):
    issn7_int = map(int, issn7)
    dp_pairs = zip(issn7_int, [8, 7, 6, 5, 4, 3, 2])
    dot_product = sum(a * b for a, b in dp_pairs)
    rem_compl = (-dot_product) % 11
    return "X" if rem_compl == 10 else str(rem_compl)

In [7]:
def check_issn_digit(issn):
    issn_clean = issn.replace("-", "").strip().upper()
    return len(issn_clean) == 8 \
       and issn_clean[-1] == issn_digit(issn_clean[:7])

In [8]:
def issn_full2digit(issn):
    return issn_digit(issn.replace("-", "").strip()[:7])

In [9]:
issn_digit("0103656") # The "ISSN7" input shouldn't include the "-"

'4'

In [10]:
check_issn_digit("0103-6564") # But here "-" is optional

True

In [11]:
issn_full2digit("2003-0507") # And here, for convenience!

'9'

In [12]:
check_issn_digit("20030507") # That's the invalid ISSN previously obtained

False

In [13]:
issn_digit("2003050") # Its digit should had been 9 (as we've already seen)

'9'

### Validating the ISSN digits in the `tabs_network/journals.csv` dataset

The ISSNs with invalid digits from the `ISSN SciELO` column are:

In [14]:
icd_issn_scielo = journals[~journals["ISSN SciELO"].apply(check_issn_digit)]
icd_issn_scielo[["title at SciELO", "ISSN's", "ISSN SciELO"]] \
    .assign(digit=icd_issn_scielo["ISSN SciELO"].apply(issn_full2digit))

Unnamed: 0,title at SciELO,ISSN's,ISSN SciELO,digit
506,Ajayu Órgano de Difusión Científica del Departamento de Psicología UCBSP,2077-2161,2077-2161,5
517,Acta Nova,1683-0789,1683-0789,4
956,Acta Médica Costarricense,0001-6012;0001-6002,0001-6002,4
1285,Revista Diacrítica,0807-8967,0807-8967,3
1647,Revista Uruguaya de Medicina Interna,2993-6797,2993-6797,9
1694,Utopìa y Praxis Latinoamericana,1315-5216,1315-5216,0


The only one we can easily fix is the `0001-6002`,
since its alternative in the `ISSN's` list is valid
and is quite explicit in the
[Acta medica costarricense's web site](http://www.actamedica.medicos.cr),
besides being the only one there.

In [15]:
issn_full2digit("2077-2161")

'5'

In [16]:
check_issn_digit("0001-6012")

True

Fixing the remaining ones might be way more difficult than it might seem.
The [Ajayu's web site](http://www.ucb.edu.bo/publicaciones/ajayu)
gives us that very same ISSN: `2077-2161`.
It seems that either the digit checking algorithm
isn't taken on account for every assigned/granted ISSN,
or there's some specific historical issue,
like an assignment happening before that calculation was standardized,
or some human mistake when performing the assignment.
Or that's simply a mistake in the journal home page
that had been copied to the database.
Whichever the reason for that,
we should stick with some inconsistent data as is
for the time being,
at least until someone fixes or confirms that information.

A similar analysis in the entries from the `ISSN's` column:

In [17]:
journals[["title at SciELO", "ISSN's", "ISSN SciELO"]] \
        [journals["ISSN's"].fillna("").str.split(";")
                           .apply(lambda issns: not all(check_issn_digit(issn)
                                                        for issn in issns))]

Unnamed: 0,title at SciELO,ISSN's,ISSN SciELO
102,Revista Brasileira de Engenharia Biomédica,,1517-3151
103,Revista Brasileira de Coloproctologia,,0101-9880
409,SaberEs,1852-4418;1852-4222,1852-4222
500,Salud(i)ciencia,1667-8682;1667-8990,1667-8990
506,Ajayu Órgano de Difusión Científica del Departamento de Psicología UCBSP,2077-2161,2077-2161
517,Acta Nova,1683-0789,1683-0789
660,Economía y Desarrollo,ISSN;0252-8584,0252-8584
956,Acta Médica Costarricense,0001-6012;0001-6002,0001-6002
957,Actualidades en Psicología,0858-6444;2215-3535,2215-3535
1285,Revista Diacrítica,0807-8967,0807-8967


Some ISSNs there are valid:

In [18]:
all(check_issn_digit(issn)
    for issn in ["0252-8584", "1315-6411", "1667-8990", "1729-4827",
                 "1852-4222", "2175-6104", "2215-3535"])

True

### Finding the correct ISSN for these few journals

From [SaberEs's web page](http://saberes.fcecon.unr.edu.ar/index.php/revista),
we find `1852-4418` should have been `1852-4184`.
Likewise, from [Liberabit's web page](http://revistaliberabit.com),
we find `2233-7666` has a typo, it's `2223-7666`.
A similar typo is `0858-6444`, which should have been `0258-6444`,
as it's written in the
[Actualidades en Psicología's web page](https://revistas.ucr.ac.cr/index.php/actualidades).
The `1667-8682` should have been `1667-8982`, as
[this PDF of a Salud(i)ciencia article](https://www.ris.uu.nl/ws/files/41145926/sic_176_1.pdf)
suggests and [its SJR entry](https://www.scimagojr.com/journalsearch.php?q=4100151617&tip=sid)
seems to confirm.
[Revista Uruguaya de Medicina Interna](http://www.medicinainterna.org.uy/revista-medicina-interna)
on
[No.3/Nov2017](http://www.medicinainterna.org.uy/wp-content/uploads/2016/06/RumiNo3_Nov_2017Ch.pdf)
tells us the ISSN is `2393-6797`, not `2993-6797`.
*Utopia y Praxis Latinoamericana* appears on
[SJR](https://www.scimagojr.com/journalsearch.php?q=5700164382&tip=sid)
with two ISSNs: `1316-5216` and `2477-9555`.
[Acta Nova](https://www.ucbcba.edu.bo/universidad/publicaciones/revistas-2/acta-nova)'s
printed version ISSN is `1683-0768`, not `1683-0789`.
[Revista Diacrítica](http://diacritica.ilch.uminho.pt) on
[26/2-2012](http://ceh.ilch.uminho.pt/publicacoes/Diacritica_26-2.pdf)
wrote `0807-8967` as its ISSN, but that seems like a typo,
as in its page the ISSN is explicitly written as
`0870-8967 (printed version); 2183-9174 (electronic version)`.
There's no information in [A Peste's web page](http://revistas.pucsp.br/apeste)
regarding a printed version ISSN,
but that `1775-1851` appeared
in the description of the cover image:
*The Fifth Plague of Egypt*
by *Joseph Mallord William Turner (1775-1851)*;
[his Wikipedia page](https://pt.wikipedia.org/wiki/William_Turner)
states that's the year range of his life, it's not an ISSN.

All these new ISSNs found have a valid check digit:

In [19]:
all(check_issn_digit(issn)
    for issn in ["0258-6444", "0870-8967", "1316-5216", "1667-8982",
                 "1683-0768", "1852-4184", "2183-9174", "2223-7666",
                 "2393-6797", "2477-9555"])

True

From the remaining entries,
the only invalid ISSN we couldn't fix
was the one belonging to `Ajayu`.
There's no evidence that its ISSN could be different
besides the inconsistency regarding the check digit, and a
[single article](https://www.scribd.com/document/152839301/Ruptura-Amorosa-y-Terapia-Narrativa)
that had written `2011-2161` as the ISSN,
but that alternative still need to have $5$ as its check digit
(i.e., it's also invalid),
and that's not a trusted source of information.

In [20]:
issn_full2digit("2011-2161")

'5'

A summary of what should be done regarding these selected ISSNs:

In [21]:
issns_fix = { # To replace all entries in ISSN SciELO and ISSN's
    "0001-6002": "0001-6012", # Acta Médica Costarricense
    "0858-6444": "0258-6444", # Actualidades en Psicología
    "1667-8682": "1667-8982", # Salud(i)ciencia
    "1852-4418": "1852-4184", # SaberEs
    "2233-7666": "2223-7666", # Liberabit
    "0807-8967": "0870-8967", # Revista Diacrítica
    "2993-6797": "2393-6797", # Revista Uruguaya de Medicina Interna
    "1315-5216": "1316-5216", # Utopia y Praxis Latinoamericana
    "1683-0789": "1683-0768", # Acta Nova
    "0719-448x": "0719-448X",
}
extra_issns = { # To add as alternative ISSN's
    "0870-8967": "2183-9174", # Revista Diacrítica
    "1316-5216": "2477-9555", # Utopia y Praxis Latinoamericana
}
invalid_issns = [ # To remove from ISSN's
    "ISSN",      # Economía y Desarrollo
    "20030507",  # Revista Venezolana de Economía y Ciencias Sociales
    "1775-1851", # A Peste : Revista de Psicanálise e Sociedade
]

And the `ISSN's` should always include the `ISSN SciELO` value.
Let's do that!

In [22]:
issn_scielo = journals["ISSN SciELO"].str.upper().replace(issns_fix)
issn_scielo.tail() # `ISSN SciELO` solving every issue found so far

1716    1012-2508
1717    0254-0770
1718    1316-0087
1719    1317-5815
1720    0367-4762
Name: ISSN SciELO, dtype: object

In [23]:
digitfix_issns = {k: {v, extra_issns[v]} if v in extra_issns else {v}
                  for k, v in issns_fix.items()}
issns_set = journals["ISSN's"] \
    .fillna(issn_scielo) \
    .str.upper() \
    .str.split(";") \
    .apply(lambda items: set.union(*[digitfix_issns.get(item, {item})
                                     for item in items
                                     if item not in invalid_issns]))
issns_set.tail() # `ISSN's` as a set, solving every issue found so far

1716    {2443-468X, 1012-2508}
1717               {0254-0770}
1718               {1316-0087}
1719               {1317-5815}
1720               {0367-4762}
Name: ISSN's, dtype: object

## Mixed ISSN in the `ISSN SciELO` field

The `ISSN SciELO` should have a *primary* ISSN,
in the *primary key* sense from databases,
somewhat arbitrary but still required
in order to avoid errors in analysis.
Crossing the data with other tables
should ideally not require any other ISSN,
and that's the main goal:
keep everything simple after this normalization.

There are at most one mixed ISSN for every ISSN list
(that is, there's a single ISSN in the `ISSN's` field
 different from the `ISSN SciELO` of the same row
 that appears in the `ISSN SciELO` field of another row):

In [24]:
other_mixed_issns = (issns_set - issn_scielo.apply(lambda issn: {issn})) \
    .apply(lambda issn_set: {issn for issn in issn_set
                             if issn in issn_scielo.values})
how_many_mixed_issns = other_mixed_issns.apply(len)
how_many_mixed_issns.max()

1

If that number was greater than $1$,
the technique below wouldn't work.
Actually, our goal is just to find a mapping
that would fix the mixed ISSN,
i.e., for a set of ISSN values for a single journal,
the `ISSN SciELO` should always have the same ISSN
in every entry belonging to that same journal.
Below is the mapping of
what appears in both the `ISSN's` and `ISSN SciELO` columns
and a distinct value that appears in the `ISSN SciELO`.

In [25]:
has_mixed_issn = how_many_mixed_issns > 0
mixed_issn_df = pd.DataFrame([
                    other_mixed_issns[has_mixed_issn]
                        .apply(lambda x: set(x).pop())
                        .rename("mixed_issn"),
                    issn_scielo[has_mixed_issn],
                ]).T
mixed_issn_df

Unnamed: 0,mixed_issn,ISSN SciELO
63,1980-5438,0103-5665
83,1518-3319,2237-101X
281,1678-5177,0103-6564
512,2077-3323,1817-7433
957,0258-6444,2215-3535
1437,1668-7027,0325-8203
1455,2175-3598,0104-1282
1459,1980-5438,0103-5665
1651,0797-9789,1688-499X
1661,1688-4094,1688-4221


That small table above is exhaustive.
We can select any of the columns to be the normalized ISSN,
taking care of duplicated entries.
The rows with the issues above are:

In [26]:
journals[["collection", "title at SciELO",
          "title thematic areas", "publisher name"]] \
    .assign(issn_scielo=issn_scielo,
            issns=issns_set) \
    [issn_scielo.isin(mixed_issn_df.values.ravel())]

Unnamed: 0,collection,title at SciELO,title thematic areas,publisher name,issn_scielo,issns
63,scl,Psicologia Clínica,Human Sciences,Departamento de Psicologia da Pontifícia Universidade Católica do Rio de Janeiro,0103-5665,"{0103-5665, 1980-5438}"
83,scl,Topoi (Rio de Janeiro),Human Sciences,Programa de Pós-Graduação em História Social da Universidade Federal do Rio de Janeiro,2237-101X,"{2237-101X, 1518-3319}"
281,scl,Psicologia USP,Human Sciences,Instituto de Psicologia da Universidade de São Paulo,0103-6564,"{1678-5177, 0103-6564}"
380,arg,Interdisciplinaria,Human Sciences,Centro Interamericano de Investigaciones Psicológicas y Ciencias Afines (CIIPCA),1668-7027,{1668-7027}
510,bol,Revista Ciencia y Cultura,"Applied Social Sciences;Human Sciences;Linguistics, Letters and Arts",Universidad Católica Boliviana,2077-3323,{2077-3323}
512,bol,Revista Científica Ciencia Médica,Health Sciences,"Facultad de Medicina, Universidad Mayor de San Simón.",1817-7433,{2077-3323}
957,cri,Actualidades en Psicología,Applied Social Sciences;Health Sciences,"Instituto de Investigaciones Psicológicas, Universidad de Costa Rica",2215-3535,"{0258-6444, 2215-3535}"
1365,psi,Psicologia USP,Human Sciences,Instituto de Psicologia da Universidade de São Paulo,1678-5177,{1678-5177}
1421,psi,Ciencias Psicológicas,Human Sciences,"Facultad de Psicología de la Universidad Católica del Uruguay, Damaso A. Larrañaga",1688-4094,{1688-4094}
1422,psi,Actualidades en psicología,Applied Social Sciences,Universidad de Costa Rica. Facultad de Ciencias Sociales. Instituto de Investigaciones Psicológicas,0258-6444,{0258-6444}


The `1817-7433` entry in the `bol` collection
has an incorrect secondary `2077-3323` ISSN
(the entries are from distinct thematic areas),
that won't give us any trouble
as long as we don't use the `ISSN's` column afterwards,
but for this normalization our goal is to fix that, as well.

The resulting mapping is:

In [27]:
issns_select = {
    "1980-5438": "0103-5665", # psi -> scl/psi
    "2237-101X": "1518-3319", # sss -> scl
    "1678-5177": "0103-6564", # psi -> scl
    "0325-8203": "1668-7027", # psi -> arg
    "2175-3598": "0104-1282", # psi -> psi
    "0797-9789": "1688-499X", # sss -> ury
    "1688-4094": "1688-4221", # psi -> ury
    "0258-6444": "2215-3535", # psi -> cri
}

Full normalization of the `ISSN SciELO` in a single step can be achieved with:

In [28]:
issn_scielo_n = journals["ISSN SciELO"].replace({**issns_fix, **issns_select})

## Summary

In [33]:
from pprint import pprint

### Normalizing the ISSN SciELO

We can apply all the normalization
from the `issns_fix` and `issns_select` dictionaries
by updating the dataframe with:

```python
journals["ISSN SciELO"].replace(issn_scielo_fix, inplace=True)
```

Where `issn_scielo_fix` should be the joined dictionary,
as follows:

In [34]:
pprint({**issns_fix, **issns_select})

{'0001-6002': '0001-6012',
 '0258-6444': '2215-3535',
 '0325-8203': '1668-7027',
 '0719-448x': '0719-448X',
 '0797-9789': '1688-499X',
 '0807-8967': '0870-8967',
 '0858-6444': '0258-6444',
 '1315-5216': '1316-5216',
 '1667-8682': '1667-8982',
 '1678-5177': '0103-6564',
 '1683-0789': '1683-0768',
 '1688-4094': '1688-4221',
 '1852-4418': '1852-4184',
 '1980-5438': '0103-5665',
 '2175-3598': '0104-1282',
 '2233-7666': '2223-7666',
 '2237-101X': '1518-3319',
 '2993-6797': '2393-6797'}


### Beyond normalization

The goal of this normalization
is to analyze the data from `journals.csv`.
For some contexts,
you can keep the old values of your data,
e.g. by adding a new column
instead of replacing the raw one:

```python
journals["issn"] = journals["ISSN SciELO"].replace(issn_scielo_fix)
```

Or:

```python
# Usually, this syntax is more helpful for using the
# "assign" expression, not as part of an assignment statement
journals = journals.assign(
    issn=journals["ISSN SciELO"].replace(issn_scielo_fix),
)
```

The goal of keeping the raw data
is due to some external reference or some user input
that might be looking for an invalid/inconsistent entry
that no longer exists because of this normalization.