# Cleaning / Normalizing the thematic area

In [1]:
import pandas as pd
pd.options.display.max_colwidth = 400

In [2]:
%matplotlib inline

## Loading the dataset

In [3]:
journals = pd.read_csv("tabs_network/journals.csv")
journals.columns

Index(['extraction date', 'study unit', 'collection', 'ISSN SciELO', 'ISSN's',
       'title at SciELO', 'title thematic areas',
       'title is agricultural sciences', 'title is applied social sciences',
       'title is biological sciences', 'title is engineering',
       'title is exact and earth sciences', 'title is health sciences',
       'title is human sciences', 'title is linguistics, letters and arts',
       'title is multidisciplinary', 'title current status',
       'title + subtitle SciELO', 'short title SciELO', 'short title ISO',
       'title PubMed', 'publisher name', 'use license', 'alpha frequency',
       'numeric frequency (in months)', 'inclusion year at SciELO',
       'stopping year at SciELO', 'stopping reason',
       'date of the first document', 'volume of the first document',
       'issue of the first document', 'date of the last document',
       'volume of the last document', 'issue of the last document',
       'total of issues', 'issues at 2018', 'is

The column names aren't helping us with all the small details
like the trailing whitespaces in the latter fields.
The easiest approach to deal with them
is to run this normalization function
from the column names simplification notebook.
Applying it is straightforward, and the order of the columns is kept as is.

In [4]:
def normalize_column_title(name):
    import re
    name_unbracketed = re.sub(r".*\((.*)\)", r"\1",
                              name.replace("(in months)", "in_months"))
    words = re.sub("[^a-z0-9+_ ]", "", name_unbracketed.lower()).split()
    ignored_words = ("at", "the", "of", "and", "google", "scholar", "+")
    replacements = {
        "document": "doc",
        "documents": "docs",
        "frequency": "freq",
        "language": "lang",
    }
    return "_".join(replacements.get(word, word)
                    for word in words if word not in ignored_words) \
              .replace("title_is", "is")

In [5]:
journals.rename(columns=normalize_column_title, inplace=True)
journals.columns

Index(['extraction_date', 'study_unit', 'collection', 'issn_scielo', 'issns',
       'title_scielo', 'title_thematic_areas', 'is_agricultural_sciences',
       'is_applied_social_sciences', 'is_biological_sciences',
       'is_engineering', 'is_exact_earth_sciences', 'is_health_sciences',
       'is_human_sciences', 'is_linguistics_letters_arts',
       'is_multidisciplinary', 'title_current_status', 'title_subtitle_scielo',
       'short_title_scielo', 'short_iso', 'title_pubmed', 'publisher_name',
       'use_license', 'alpha_freq', 'numeric_freq_in_months',
       'inclusion_year_scielo', 'stopping_year_scielo', 'stopping_reason',
       'date_first_doc', 'volume_first_doc', 'issue_first_doc',
       'date_last_doc', 'volume_last_doc', 'issue_last_doc', 'total_issues',
       'issues_2018', 'issues_2017', 'issues_2016', 'issues_2015',
       'issues_2014', 'issues_2013', 'total_regular_issues',
       'regular_issues_2018', 'regular_issues_2017', 'regular_issues_2016',
       'regul

## Thematic areas

At first, it might seem that there are way too many thematic areas:

In [6]:
journals["title_thematic_areas"].unique()

array(['Applied Social Sciences', 'Health Sciences', 'Human Sciences',
       'Exact and Earth Sciences', 'Biological Sciences',
       'Agricultural Sciences',
       'Biological Sciences;Exact and Earth Sciences',
       'Engineering;Exact and Earth Sciences',
       'Agricultural Sciences;Biological Sciences',
       'Applied Social Sciences;Human Sciences', 'Engineering',
       'Health Sciences;Human Sciences',
       'Agricultural Sciences;Biological Sciences;Exact and Earth Sciences;Health Sciences',
       'Linguistics, Letters and Arts',
       'Biological Sciences;Health Sciences',
       'Agricultural Sciences;Biological Sciences;Health Sciences',
       'Agricultural Sciences;Biological Sciences;Engineering;Exact and Earth Sciences;Health Sciences;Human Sciences',
       'Agricultural Sciences;Biological Sciences;Engineering;Exact and Earth Sciences;Human Sciences',
       'Agricultural Sciences;Biological Sciences;Engineering;Health Sciences',
       'Applied Social Scienc

But, actually, there are just $8$ of them,
and what we're seeing are the several combinations of them:

In [7]:
set.union(*journals["title_thematic_areas"].str.split(";").dropna().apply(set).values)

{'Agricultural Sciences',
 'Applied Social Sciences',
 'Biological Sciences',
 'Engineering',
 'Exact and Earth Sciences',
 'Health Sciences',
 'Human Sciences',
 'Linguistics, Letters and Arts',
 'Psicanalise'}

The `Psicanalise` isn't a thematic area,
it appears in the `psi` collection,
which is independent
(i.e., it's in the SciELO network but it's not maintained by SciELO,
 and its requirements regarding some fields
 aren't the same of other collections).

Actually, we don't need to worry so much about this column
in this normalization step
since this information is split in the several `title is ...` columns,
which had been renamed here to:

In [8]:
areas_map = {
    "Agricultural Sciences": "is_agricultural_sciences",
    "Applied Social Sciences": "is_applied_social_sciences",
    "Biological Sciences": "is_biological_sciences",
    "Engineering": "is_engineering",
    "Exact and Earth Sciences": "is_exact_earth_sciences",
    "Health Sciences": "is_health_sciences",
    "Human Sciences": "is_human_sciences",
    "Linguistics, Letters and Arts": "is_linguistics_letters_arts",
}
areas = list(areas_map.values())

## Multidisciplinary

Actually, `is_multidisciplinary` isn't a thematic area by itself,
but it might be useful, and its meaning can be promptly checked:

In [9]:
(
    (journals[areas].sum(axis=1) >= 3)
    != journals["is_multidisciplinary"].apply(bool)
).sum()

0

We have `is_multidisciplinary == 1` if and only if the journal have at least 3 areas.

## Consistency between text and flags

Does the `title_thematic_areas` text
match the data in the single-area `is_*` columns?

In [10]:
tta_sets = (
    journals["title_thematic_areas"]
    .fillna("")
    .str.split(";")
    .apply(lambda x: {areas_map[area] for area in x
                                      if area in areas_map})
)
pd.concat([
    journals[area]
    != tta_sets.apply((lambda a: lambda x: int(a in x))(area))
    for area in areas
], axis=1).any()

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
dtype: bool

Yes, it does, as long as we're ignoring the already seen `Psicanalise` value.

## Emptiness

Are there entries without any thematic area?

In [11]:
journals[journals[areas].sum(axis=1) == 0] \
        [["issn_scielo", "collection", "title_scielo", "title_thematic_areas"]]

Unnamed: 0,issn_scielo,collection,title_scielo,title_thematic_areas
1342,0104-3269,psi,Mudanças,
1343,1516-1854,psi,Interação,
1344,1679-074X,psi,Psicanalítica,
1345,1809-8894,psi,Mnemosine,
1346,1413-0556,psi,Psicanálise e Universidade,Psicanalise
1347,1413-4063,psi,Psicologia Revista,
1348,1806-6631,psi,Família e Comunidade,
1349,0102-7182,psi,Psicologia & Sociedade,
1350,1982-5471,psi,Mosaico,
1351,0103-863X,psi,Paidéia (Ribeirão Preto),


That includes every entry with the unnormalized `Psicanalise` value
as the thematic area, which will be regarded here as invalid.

In [12]:
journals[journals["title_thematic_areas"] == "Psicanalise"].shape

(5, 98)

That's consistent, and all empty/invalid entries are from the `psi` collection.

## Consistency within the ISSN

We'll need the ISSN, so let's normalize it
by applying the snippet from the ISSN normalization notebook:

In [13]:
issn_scielo_fix = {"0001-6002": "0001-6012",
                   "0258-6444": "2215-3535",
                   "0325-8203": "1668-7027",
                   "0719-448x": "0719-448X",
                   "0797-9789": "1688-499X",
                   "0807-8967": "0870-8967",
                   "0858-6444": "0258-6444",
                   "1315-5216": "1316-5216",
                   "1667-8682": "1667-8982",
                   "1678-5177": "0103-6564",
                   "1683-0789": "1683-0768",
                   "1688-4094": "1688-4221",
                   "1852-4418": "1852-4184",
                   "1980-5438": "0103-5665",
                   "2175-3598": "0104-1282",
                   "2233-7666": "2223-7666",
                   "2237-101X": "1518-3319",
                   "2993-6797": "2393-6797"}
journals["issn_scielo"].replace(issn_scielo_fix, inplace=True)

Each journal might have more than one row,
since it might appear in more than one collection,
but there might be some inconsistency going on, as well.
Repeated rows aren't a big issue,
but every inconsistent duplication needs to be fixed.
Which ISSNs are inconsistent?
That is, which ISSNs
are assigned to distinct thematic areas in distinct rows?

In [14]:
areas_inconsistency = journals[journals[areas].sum(axis=1) != 0] \
                              [["issn_scielo"] + areas] \
    .groupby("issn_scielo") \
    .apply(lambda df: df.apply(lambda col: set(col.dropna())).apply(len).max() > 1)
areas_inconsistency_index = areas_inconsistency[areas_inconsistency].index
areas_inconsistency_index

Index(['0011-5258', '0100-512X', '0100-8587', '0101-3300', '0101-9074',
       '0102-6909', '0103-2070', '0103-5665', '0104-026X', '0104-4478',
       '0104-7183', '0104-8333', '0104-9313', '0120-0534', '0254-9247',
       '0717-7194', '0718-6924', '1012-1587', '1413-294X', '1413-8271',
       '1414-3283', '1414-753X', '1414-9893', '1517-4522', '1518-3319',
       '1688-4221', '1688-499X', '1794-9998', '1806-6445', '1806-6976',
       '1981-3821', '2215-3535'],
      dtype='object', name='issn_scielo')

In [15]:
pd.DataFrame(
    journals[journals["issn_scielo"].isin(areas_inconsistency_index)]
        .groupby("issn_scielo")
        .apply(lambda df: {k: v for k, v in df[areas].apply(set).to_dict().items()
                                if len(v) > 1})
        .apply(sorted) # Casts from dictionary (keys) to list
        .rename("inconsistency")
)

Unnamed: 0_level_0,inconsistency
issn_scielo,Unnamed: 1_level_1
0011-5258,"[is_applied_social_sciences, is_human_sciences]"
0100-512X,[is_applied_social_sciences]
0100-8587,[is_applied_social_sciences]
0101-3300,[is_applied_social_sciences]
0101-9074,"[is_applied_social_sciences, is_human_sciences]"
0102-6909,"[is_applied_social_sciences, is_human_sciences]"
0103-2070,"[is_applied_social_sciences, is_human_sciences]"
0103-5665,"[is_applied_social_sciences, is_human_sciences]"
0104-026X,[is_applied_social_sciences]
0104-4478,"[is_applied_social_sciences, is_human_sciences]"


There seems to be way too many inconsistencies,
but let's simply remove the empty entries before checking this.

In [16]:
inconsistencies_df = pd.DataFrame(
    journals[journals["issn_scielo"].isin(areas_inconsistency_index) & 
             journals[areas].sum(axis=1)]
        .groupby("issn_scielo")
        .apply(lambda df: sorted(k for k, v in df[areas].apply(set).to_dict().items()
                                   if len(v) > 1)
                          or None)
        .dropna()
        .rename("inconsistency")
)
inconsistencies_df

Unnamed: 0_level_0,inconsistency
issn_scielo,Unnamed: 1_level_1
0011-5258,"[is_applied_social_sciences, is_human_sciences]"
0101-9074,"[is_applied_social_sciences, is_human_sciences]"
0102-6909,"[is_applied_social_sciences, is_human_sciences]"
0103-2070,"[is_applied_social_sciences, is_human_sciences]"
0103-5665,"[is_applied_social_sciences, is_human_sciences]"
0104-4478,"[is_applied_social_sciences, is_human_sciences]"
0104-7183,"[is_applied_social_sciences, is_human_sciences]"
0120-0534,"[is_biological_sciences, is_human_sciences]"
0254-9247,"[is_applied_social_sciences, is_human_sciences]"
0718-6924,"[is_applied_social_sciences, is_human_sciences]"


In [17]:
inconsistent_rows = (
    journals
        [journals["issn_scielo"].isin(inconsistencies_df.index)]
        [["issn_scielo", "collection", "title_thematic_areas", "title_current_status"]]
        .sort_values(by=["issn_scielo", "collection"])
)
inconsistent_rows.set_index(["issn_scielo", "collection"])

Unnamed: 0_level_0,Unnamed: 1_level_0,title_thematic_areas,title_current_status
issn_scielo,collection,Unnamed: 2_level_1,Unnamed: 3_level_1
0011-5258,scl,Human Sciences,current
0011-5258,sss,Applied Social Sciences,current
0101-9074,scl,Human Sciences,current
0101-9074,sss,Applied Social Sciences,current
0102-6909,scl,Human Sciences,current
0102-6909,sss,Applied Social Sciences,current
0103-2070,scl,Human Sciences,current
0103-2070,sss,Applied Social Sciences,current
0103-5665,psi,Applied Social Sciences,deceased
0103-5665,psi,Applied Social Sciences,current


In [18]:
inconsistent_rows.groupby("issn_scielo")["collection"].apply(set).value_counts()

{scl, sss}    9
{psi, scl}    3
{psi, col}    2
{psi, per}    1
{psi, chl}    1
{psi, rve}    1
Name: collection, dtype: int64

The above show that, internal to each collection,
the thematic area is always consistent in the 2018-09-14 reports.
However, distinct collections sometimes classify some journals differently.
Most entries regarding this issue are from both
the now discontinued `sss` collection (Social Sciences)
and the `scl` collection (Brazil),
in these cases we should stick with the value given by the `scl` collection,
since it's probably the updated value.
The entries with both `psi` and `scl` have
the journal either `suspended` or `deceased` in `psi`,
so we should, also, use the value in the `scl` entry.
The same happen in the pairs `col`-`psi` and `chl`-`psi`.