# Cleaning / Normalizing the ISSN

*This is the analysis of the full SciELO's network
`journals.csv` report/spreadsheet/dataset
as it was on its 2018-06-10 release,
future versions will hopefully have pre-normalized ISSN fields.*

Some journals might have more than one ISSN,
since every medium (electronic/print/CD/etc.) have at least its own ISSN.
However, in different collections, the ISSN might be different.
We should find a way to normalize them,
in order to know when two entries regard to the same journal.

In [1]:
import pandas as pd
pd.options.display.max_colwidth = 400

In [2]:
journals = pd.read_csv("tabs_network/journals.csv")

There are two columns regarding ISSN:

In [3]:
[col for col in journals.columns if "ISSN" in col.upper()]

['ISSN SciELO', "ISSN's"]

The first, `ISSN SciELO`, has a selected ISSN
to be something akin to a primary key,
whereas the `ISSN's` has a list of other ISSNs
regarding the same journal content,
written as a single string
where the ISSNs are separated by a `;` (semicolon) symbol.

## Detecting grossly invalid ISSNs

The format of an ISSN is `NNNN-NNNC`,
where `N` is a digit (from $0$ to $9$)
and `C` is a check "digit" (from $0$ to $9$ or $X$).
Is there any ISSN in the `ISSN SciELO` column
that doesn't conform to that?

In [4]:
single_issn_regex = r"^\d{4}-\d{3}[\dX]$"
journals[["ISSN SciELO"]][~journals["ISSN SciELO"].str.contains(single_issn_regex)]

Unnamed: 0,ISSN SciELO
1410,0719-448x


It's not invalid,
but we should always use the same letter case
in order to work with the ISSN as a *matching index* or *primary key*.
A proper normalization would use something like
`journals["ISSN SciELO"].str.upper()`.

How about the `ISSN's` column?

In [5]:
multi_issn_regex = r"^(?:\d{4}-\d{3}[\dX])(?:;\d{4}-\d{3}[\dX])*$"
journals[["ISSN's"]][~journals["ISSN's"].fillna("").str.contains(multi_issn_regex)]

Unnamed: 0,ISSN's
102,
103,
660,ISSN;0252-8584
1410,0719-448x;0718-0446
1696,20030507;1315-6411


Besides the `x` case issue and the empty `ISSN's` field,
the date-like `20030507` and the `ISSN` text are invalid ISSN values,
the latter being the only grossly invalid entry found.
The date-like one was grabbed here because of the lack of `-`,
but it's invalid due to its last digit,
which should have been `9` in order to get a valid `ISSN` value.

We can clean these issues
by filling the `NaN` with the `ISSN SciELO` value from the same row,
by taking the uppercase to get rid from the single small `x`,
and by using a mapping to remove the undesired value.

In [6]:
issn_scielo = journals["ISSN SciELO"].str.upper()
issns_set = journals["ISSN's"] \
    .fillna(issn_scielo) \
    .str.upper() \
    .str.split(";") \
    .apply(lambda items: {item for item in items
                               if item not in ["ISSN", "20030507"]})
issns_set.tail() # `ISSN's` as a set

1716    {2443-468X, 1012-2508}
1717               {0254-0770}
1718               {1316-0087}
1719               {1317-5815}
1720               {0367-4762}
Name: ISSN's, dtype: object

## Mixed ISSN in the `ISSN SciELO` field

The `ISSN SciELO` should have a *primary* ISSN,
in the *primary key* sense from databases,
somewhat arbitrary but still required
in order to avoid errors in analysis.
Crossing the data with other tables
should ideally not require any other ISSN,
and that's the main goal:
keep everything simple after this normalization.

There are at most one mixed ISSN for every ISSN list
(that is, there's a single ISSN in the `ISSN's` field
 different from the `ISSN SciELO` of the same row
 that appears in the `ISSN SciELO` field of another row):

In [7]:
other_mixed_issns = (issns_set - issn_scielo.apply(lambda issn: {issn})) \
    .apply(lambda issn_set: {issn for issn in issn_set
                             if issn in issn_scielo.values})
how_many_mixed_issns = other_mixed_issns.apply(len)
how_many_mixed_issns.max()

1

If that number was greater than $1$,
the technique below wouldn't work.
Actually, our goal is just to find a mapping
that would fix the mixed ISSN,
i.e., for a set of ISSN values for a single journal,
the `ISSN SciELO` should always have the same ISSN
in every entry belonging to that same journal.
Below is the mapping of
what appears in both the `ISSN's` and `ISSN SciELO` columns
and a distinct value that appears in the `ISSN SciELO`.

In [8]:
has_mixed_issn = how_many_mixed_issns > 0
mixed_issn_df = pd.DataFrame([
                    other_mixed_issns[has_mixed_issn]
                        .apply(lambda x: set(x).pop())
                        .rename("mixed_issn"),
                    issn_scielo[has_mixed_issn],
                ]).T
mixed_issn_df

Unnamed: 0,mixed_issn,ISSN SciELO
63,1980-5438,0103-5665
83,1518-3319,2237-101X
281,1678-5177,0103-6564
512,2077-3323,1817-7433
1437,1668-7027,0325-8203
1455,2175-3598,0104-1282
1459,1980-5438,0103-5665
1651,0797-9789,1688-499X
1661,1688-4094,1688-4221


That small table above is exhaustive.
We can select any of the columns to be the normalized ISSN,
taking care of duplicated entries.
The rows with the issues above are:

In [9]:
journals[["collection", "title at SciELO",
          "title thematic areas", "publisher name"]] \
    .assign(issn_scielo=issn_scielo,
            issns=issns_set) \
    [issn_scielo.isin(mixed_issn_df.values.ravel())]

Unnamed: 0,collection,title at SciELO,title thematic areas,publisher name,issn_scielo,issns
63,scl,Psicologia Clínica,Human Sciences,Departamento de Psicologia da Pontifícia Universidade Católica do Rio de Janeiro,0103-5665,"{0103-5665, 1980-5438}"
83,scl,Topoi (Rio de Janeiro),Human Sciences,Programa de Pós-Graduação em História Social da Universidade Federal do Rio de Janeiro,2237-101X,"{2237-101X, 1518-3319}"
281,scl,Psicologia USP,Human Sciences,Instituto de Psicologia da Universidade de São Paulo,0103-6564,"{0103-6564, 1678-5177}"
380,arg,Interdisciplinaria,Human Sciences,Centro Interamericano de Investigaciones Psicológicas y Ciencias Afines (CIIPCA),1668-7027,{1668-7027}
510,bol,Revista Ciencia y Cultura,"Applied Social Sciences;Human Sciences;Linguistics, Letters and Arts",Universidad Católica Boliviana,2077-3323,{2077-3323}
512,bol,Revista Científica Ciencia Médica,Health Sciences,"Facultad de Medicina, Universidad Mayor de San Simón.",1817-7433,{2077-3323}
1365,psi,Psicologia USP,Human Sciences,Instituto de Psicologia da Universidade de São Paulo,1678-5177,{1678-5177}
1421,psi,Ciencias Psicológicas,Human Sciences,"Facultad de Psicología de la Universidad Católica del Uruguay, Damaso A. Larrañaga",1688-4094,{1688-4094}
1436,psi,Psicologia clínica (Rio de Janeiro. Online),Applied Social Sciences,"Pontifícia Universidade Católica do Rio de Janeiro, Departamento de Psicologia",1980-5438,{1980-5438}
1437,psi,Interdisciplinaria,Human Sciences,Centro Interamericano de Investigaciones Psicológicas y Ciencias Afines (CIIPCA),0325-8203,"{1668-7027, 0325-8203}"


The `1817-7433` entry in the `bol` collection
has an incorrect secondary `2077-3323` ISSN
(the entries are from distinct thematic areas),
that won't give us any trouble
as long as we don't use the `ISSN's` column afterwards,
but for this normalization our goal is to fix that, as well.

The resulting mapping is:

In [10]:
issn_scielo_fix = {
    "1980-5438": "0103-5665", # psi -> scl/psi
    "2237-101X": "1518-3319", # sss -> scl
    "1678-5177": "0103-6564", # psi -> scl
    "0325-8203": "1668-7027", # psi -> arg
    "2175-3598": "0104-1282", # psi -> psi
    "0797-9789": "1688-499X", # sss -> ury
    "1688-4094": "1688-4221", # psi -> ury
    "0258-6444": "2215-3535", # psi -> cri
    "0719-448x": "0719-448X", # letter case normalization
}

Full normalization of the `ISSN SciELO` in a single step can be achieved with:

In [11]:
issn_scielo_n = journals["ISSN SciELO"].replace(issn_scielo_fix)

## Summary

In [12]:
from pprint import pprint

### Normalizing the ISSN SciELO

We can apply all the normalization
from the `issn_scielo_fix` dictionary
by updating the dataframe with:

```python
journals["ISSN SciELO"].replace(issn_scielo_fix, inplace=True)
```

Where `issn_scielo_fix` should be the joined dictionary,
as follows:

In [13]:
pprint(issn_scielo_fix)

{'0258-6444': '2215-3535',
 '0325-8203': '1668-7027',
 '0719-448x': '0719-448X',
 '0797-9789': '1688-499X',
 '1678-5177': '0103-6564',
 '1688-4094': '1688-4221',
 '1980-5438': '0103-5665',
 '2175-3598': '0104-1282',
 '2237-101X': '1518-3319'}


### Beyond normalization

The goal of this normalization
is to analyze the data from `journals.csv`.
For some contexts,
you can keep the old values of your data,
e.g. by adding a new column
instead of replacing the raw one:

```python
journals["issn"] = journals["ISSN SciELO"].replace(issn_scielo_fix)
```

Or:

```python
# Usually, this syntax is more helpful for using the
# "assign" expression, not as part of an assignment statement
journals = journals.assign(
    issn=journals["ISSN SciELO"].replace(issn_scielo_fix),
)
```

The goal of keeping the raw data
is due to some external reference or some user input
that might be looking for an invalid/inconsistent entry
that no longer exists because of this normalization.