# CSV data analysis

During the last monday (`2018-06-04`), a CSV table was generated from a XML dataset containing $23821$ files. From these, $425$ had been proven unreadable by the `lxml` parser, and they've been discarded. Some of the remaining $23396$ valid XML files feature:

- Unknown/typoed tags and attributes names
- Missing affiliation
- Nested affiliation
- Affiliations `<aff>` with no `<contrib>` reference
- ...

Some of these issues are solved by the [scielo-clea v0.1](https://github.com/scieloorg/clea/releases/tag/v0.1.0) (henceforth named *Clea*), an article front matter metadata reader that gathers information from the XML traversal by means of fuzzy/approximate regexes. Its result for a single XML input is either a Python dictionary or a JSON HTTP response with a tabular structure. A CSV dataset was built from the XML dataset by joining and flattening every Clea result, effectively casting the XML dataset into a table-like dataset.

The CSV dataset has a row for each `<aff>`-`<contrib>` matching pair (*inner join* approach), and the goal is to look for any information in the dataset that can be useful for a data sanitization process such as:

- Filling the empty fields
- Fixing typoed contents
- Joining/grouping distinct values that have the same meaning

CSV file: https://drive.google.com/file/d/1XmBh6YlfPkB5WfYSolAMP1EA5e02jHQO/view?usp=sharing

## Loading the CSV dataset

In [1]:
import pandas as pd

In [2]:
dataset_raw = pd.read_csv("inner_join_2018-06-04.csv", dtype=str, keep_default_na=False)
len(dataset_raw)

94660

Some articles appeared more than once in the dataset, so let's remove the duplicates.

In [3]:
dataset = dataset_raw.drop_duplicates()
len(dataset)

89451

That's the number of `<aff>`-`<contrib>` matching pairs.

## General information

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 89451 entries, 0 to 94659
Data columns (total 39 columns):
addr_city                        89451 non-null object
addr_country                     89451 non-null object
addr_country_code                89451 non-null object
addr_postal_code                 89451 non-null object
addr_state                       89451 non-null object
aff_email                        89451 non-null object
aff_id                           89451 non-null object
aff_text                         89451 non-null object
article_doi                      89451 non-null object
article_publisher_id             89451 non-null object
article_title                    89451 non-null object
contrib_bio                      89451 non-null object
contrib_degrees                  89451 non-null object
contrib_email                    89451 non-null object
contrib_given_names              89451 non-null object
contrib_name                     89451 non-null object
contrib_orc

In [5]:
dataset.describe().T

Unnamed: 0,count,unique,top,freq
addr_city,89451,2729,,9529
addr_country,89451,267,Brazil,48429
addr_country_code,89451,125,BR,56596
addr_postal_code,89451,426,,87918
addr_state,89451,672,,20906
aff_email,89451,10068,,74478
aff_id,89451,60,aff1,38355
aff_text,89451,49959,1 Hospital Israelita Albert Einstein Hospital ...,116
article_doi,89451,20551,10.1590/1414-431X20122449,56
article_publisher_id,89451,5599,,64392


There are many emptied fields that probably won't help at all. Some of them have few possible values, e.g.:

In [6]:
dataset["contrib_prefix"].unique().tolist()

['', 'Filho', 'Junior', 'Pe.', 'Prof. Dr.', 'Dra.', 'Profa. Dra.', 'Dr.']

## Contrib type

In [7]:
dataset.groupby("contrib_type").size()

contrib_type
                     13
author            89403
author; author        2
autor                 2
editor               27
organizer             2
translator            2
dtype: int64

- The `author; author` is due to a contrib with two affiliations;
- The only single typo seem to be *autor* instead of *author*, that field was probably written in Portuguese;
- By far, most contributors are authors;
- The valid contrib types are at least `["author", "editor", "organizer", "translator"]`, but [the tag documentation](http://docs.scielo.org/projects/scielo-publishing-schema/pt_BR/1.8-branch/tagset/elemento-contrib.html) tells us there's at least `"compiler"` missing, though these other contrib types should include `<author-notes>` and `<fn>` giving us more information.

There are a few missing/empty rows missing the contrib type. How do we find the empty contrib? The easiest approach would be to simply say these are *author*, since our "*prior*" leans towards it: $99.95\%$ of the input rows have this contrib type. That's to say that probably they're authors when we're biased towards this specific unbalanced dataset. And that's true as a probability: the best blind guess would be "author". But we can't reasonably say something about the contrib type unless we at least evaluate the row where it's in (i.e., information specific to the document), so we need to find another approach.

In [8]:
print("%.2f%%" % (89403 / 89451 * 100)) # "author" frequency (prior)

99.95%


If we're lucky, the same contributor had contributed to another piece of work in the same dataset, so we can at least say he/she has a "type of contribution" elsewhere. That approach still might be wrong: the same person might perform different roles, and there might be homonyms; but it would be at least better than a blind guess.

In [9]:
ct_data = dataset[["contrib_given_names", "contrib_surname", "contrib_type"]]
ct_selector = ct_data["contrib_type"] == ""
ct_data_unfilled = ct_data[ct_selector]
ct_data_unfilled

Unnamed: 0,contrib_given_names,contrib_surname,contrib_type
49700,Emiko Yoshikawa,Egry,
49701,Lucia Yasuko Izumi,Nichiata,
86731,Rafael,Pazinato,
86732,Vanderlei,Klauck,
86733,Andréia,Volpato,
86734,Alexandre,Balzan,
86735,Julia,Rossett,
86736,Chrystian Jassanã,Cazarotto,
86737,Leandro Sâmia,Lopes,
86738,Julcemar Dias,Kessler,


These contributor names appear in some XML file missing the type of contribution. Let's get the "nearest" names (in a Levenshtein distance sense) in the remaining data. But, at first, we have to "pre-normalize" the names (remove accents and put in lowercase).

In [10]:
from functools import partial
import Levenshtein as lev
from unidecode import unidecode

In [11]:
ct_unknown = ct_data_unfilled.drop(
    columns=["contrib_type"],
).join(
    ct_data_unfilled.drop(columns=["contrib_type"])
                    .applymap(unidecode)
                    .applymap(str.lower),
    rsuffix="_n",
)
ct_unknown.sort_values(by=["contrib_given_names_n", "contrib_surname_n"])

Unnamed: 0,contrib_given_names,contrib_surname,contrib_given_names_n,contrib_surname_n
86740,Aleksandro Schafer Da,Silva,aleksandro schafer da,silva
86734,Alexandre,Balzan,alexandre,balzan
86741,Alexandre Alberto,Tonin,alexandre alberto,tonin
86733,Andréia,Volpato,andreia,volpato
86736,Chrystian Jassanã,Cazarotto,chrystian jassana,cazarotto
86739,Diego Córdova,Cucco,diego cordova,cucco
49700,Emiko Yoshikawa,Egry,emiko yoshikawa,egry
86738,Julcemar Dias,Kessler,julcemar dias,kessler
86735,Julia,Rossett,julia,rossett
86737,Leandro Sâmia,Lopes,leandro samia,lopes


In [12]:
ct_data_filled = ct_data[~ct_selector]
ct_known = ct_data_filled.join(
    ct_data_filled.applymap(unidecode)
                  .applymap(str.lower),
    rsuffix="_n",
).drop(
    columns=["contrib_type_n"],
)
len(ct_known)

89438

The ct_known is quite big. These names have some few repetitions:

In [13]:
pd.merge(ct_known, ct_unknown, how="right", on=["contrib_given_names_n", "contrib_surname_n"])

Unnamed: 0,contrib_given_names_x,contrib_surname_x,contrib_type,contrib_given_names_n,contrib_surname_n,contrib_given_names_y,contrib_surname_y
0,Emiko Yoshikawa,Egry,author,emiko yoshikawa,egry,Emiko Yoshikawa,Egry
1,Emiko Yoshikawa,Egry,author,emiko yoshikawa,egry,Emiko Yoshikawa,Egry
2,Emiko Yoshikawa,Egry,author,emiko yoshikawa,egry,Emiko Yoshikawa,Egry
3,Emiko Yoshikawa,Egry,author,emiko yoshikawa,egry,Emiko Yoshikawa,Egry
4,Emiko Yoshikawa,Egry,author,emiko yoshikawa,egry,Emiko Yoshikawa,Egry
5,Emiko Yoshikawa,Egry,author,emiko yoshikawa,egry,Emiko Yoshikawa,Egry
6,Emiko Yoshikawa,Egry,author,emiko yoshikawa,egry,Emiko Yoshikawa,Egry
7,Emiko Yoshikawa,Egry,author,emiko yoshikawa,egry,Emiko Yoshikawa,Egry
8,Emiko Yoshikawa,Egry,author,emiko yoshikawa,egry,Emiko Yoshikawa,Egry
9,Emiko Yoshikawa,Egry,author,emiko yoshikawa,egry,Emiko Yoshikawa,Egry


Now we have a more specific information: most of these contributors had been author otherwhere, and nothing else. That's way better than just saying "most contributors overall are authors, so my guess is that this I-haven't-read-the-name-yet one should be author as well". This time, we're looking at the specifics, not a single class for every entry.

The last entry (Lucia) doesn't match anyone. Are similar names meaningful?
These are the non-repeated names without the exact matches:

In [14]:
ct_known_remain = ct_known[
    ~(ct_known["contrib_given_names_n"].isin(ct_unknown["contrib_given_names_n"]) &
      ct_known["contrib_surname_n"].isin(ct_unknown["contrib_surname_n"]))
].drop_duplicates()
len(ct_known_remain)

69614

Let's find the 3 nearest non-equal names for each of those unknowns.
This way of finding the nearest names might be slow, but it's enough for an experiment:

In [15]:
from functools import lru_cache

In [16]:
cached_lev_dist = lru_cache(None)(lev.distance)

In [17]:
def ct_series_apply(series):
    series["dist_given_names"] = cached_lev_dist(series["contrib_given_names_n_x"],
                                                 series["contrib_given_names_n_y"])
    series["dist_surname"] = cached_lev_dist(series["contrib_surname_n_x"],
                                             series["contrib_surname_n_y"])
    series["dist"] = series["dist_given_names"] / len(series["contrib_given_names_n_x"]) \
                   + series["dist_surname"] / len(series["contrib_surname_n_x"])
    return series

The distance measure is governed by these assumptions:

- The given-name and the surname have the same weight;
- The per-field Levenshtein distance should be normalized in order to compare inputs with different lengths;
- The overall range is from 0 (equalness) to 2 (should modify every single character).

In [18]:
ct_nearest = pd.merge(ct_unknown.assign(_unused=1),
                      ct_known_remain.assign(_unused=1),
                      on="_unused") \
               .drop(columns=["_unused"]) \
               .apply(ct_series_apply, axis=1) \
               .groupby(["contrib_given_names_x", "contrib_surname_x"]) \
               .apply(lambda grp: grp.sort_values("dist")
                                     .loc[:, ["contrib_given_names_y",
                                               "contrib_surname_y",
                                               "contrib_type",
                                               "dist_given_names",
                                               "dist_surname",
                                               "dist"]]
                                     .iloc[:3])
ct_nearest

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,contrib_given_names_y,contrib_surname_y,contrib_type,dist_given_names,dist_surname,dist
contrib_given_names_x,contrib_surname_x,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Aleksandro Schafer Da,Silva,799338,Alessandro Salles da,Silva,author,6,0,0.285714
Aleksandro Schafer Da,Silva,798416,Alexsandro O. da,Silva,author,8,0,0.380952
Aleksandro Schafer Da,Silva,783035,Alexsandro Oliveira da,Silva,author,8,0,0.380952
Alexandre,Balzan,365704,Alexandre,Palma,author,0,3,0.5
Alexandre,Balzan,353244,Alexandre,Uhlein,author,0,4,0.666667
Alexandre,Balzan,405684,Alexandre,Mazzanti,author,0,4,0.666667
Alexandre Alberto,Tonin,842266,Rosane Baldiga,Tonin,author,12,0,0.705882
Alexandre Alberto,Tonin,847430,Alexandre Penna,Torini,author,6,2,0.752941
Alexandre Alberto,Tonin,870064,Paulo César,Tonin,author,14,0,0.823529
Andréia,Volpato,282623,Andréa,Falcão,author,1,4,0.714286


Most of these names regard to distinct people, and there's no typographical error entry. It's pretty hard to find a meaningful relationship using this Levenshtein distance approach, besides typos. On the other hand, we've found a name that seems to match Lucia Yasuko Izumi Nichiata, with Yasuko shortened to "Y.". If she's really an author otherwhere in the dataset, then everyone missing information seem to be an author. Actually, every contributor that appeared in the exaples above is an author.

We can *feel* that `Lucia Y. Izumi` and `Lucia Yasuko Izumi` are similar and probably the same person, but it doesn't seem obvious how to automate that.

Another distance approach would be to use the full names names segregated by words, with different weights for stuff like:

- number of chars
- leading/trailing char
- set of chars difference
- amount of chars difference
- ...