Data normalization/cleaning experiments

This repository contains analyses and experiments performed with the goal of normalizing/cleaning the SciELO data, intended to find and fix unclean/inconsistent values in their raw format, as well as other similar issues, mainly towards the fields that regards to the affiliations.

Contents of this repository ordered by creation date:

Date	Description	Link
2018-04-05	Grabbing article `<aff>` and `<country>` data with BeautifulSoup 4	Notebook
2018-04-19	Article XML parsing with `ElementTree`/`libxml2`/`lxml`, using XPath/XSLT	Notebook / XML pack
2018-04-26	Creating a table with data from `<aff>`-`<contrib>` pairs (front matter) in 25 XML files using `lxml`	Notebook / CSV
2018-05-03	Loading/cleaning/analyzing a table of manually normalized data, including a DBSCAN clustering model for the institution name	Notebook / Raw manual CSV / Manual CSV
2018-05-10	Looking for alternatives to the CSS/XPath/XSLT based XML parsing: `xmltodict` on article XML and fuzzy regex on custom paths	Notebook
2018-05-17	Getting tags that looks like `<article-id>`, `<aff>` and `<contrib>` using fuzzy regex / Levenshtein distance	Notebook
2018-06-04	CSV generation with Clea	Notebook / File list / CSV
2018-06-07	Analysis of the `contrib_type` field from Clea's CSV output	Notebook
2018-06-14 to 2018-07-05	Country analysis of Clea's CSV output using graphs (NetworkX), including a substantial analysis of alternative libraries for country normalization/cleaning in Python/R/Ruby, resulting in a taxonomy/classification of techniques (exact match, regex, fuzzy, graphs)	Notebook
2018-07-05	Analysis of the country in the manual normalization CSV data using graphs	Notebook
2018-07-12	Creation of a CrossRef fetching script for all articles in a `article_doi` CSV column due to the presence of several DOI / PID empty fields	Notebook / Script
2018-07-23	Matching and normalizing PID/DOI using Crossref data, besides a first experiment based on the SciELO's "XML debug" API to get the current article PID from its older PID	Notebook / Script
2018-07-26	Crunching/crawling data from SciELO's search engine and the XML debug API, looking for a specific DOI / PID	Notebook
2018-08-02 to 2018-08-16	Normalizing the USP institutions `orgname` (faculty name) and `orgdiv1` (department name) fields filled in Brazilian Portuguese	Notebook
2018-08-09	Summarization of the affiliations report from SciELO Analytics	Notebook / Summary
2018-08-23 to 2018-11-14	Latent Semantic Analysis (LSA) on the CSV data for predicting the country code, using k-Means, k-NN and random forest	Notebook
2018-11-22 to 2019-03-08	Experiments with word2vec to find the country code from a single string having the merged information of an affiliation-contributor pair	Notebook / Example / Dump Dictionary / Dump W2V 200 / Dump W2V 1000
2018-12-06 to 2018-12-13	Looking for articles' PIDs from USP/UNESP/UNICAMP (SciELO Brazil) by analyzing the distinct values that appear as the institution name	Notebook / XLSX
2019-01-10 to 2019-02-21	Looking for articles from EMBRAPA and public state universities in SP (USP/UNESP/Unicamp) in the entire SciELO Network by analyzing the institution name, country, state and city, as well as the graph of authors and institutions	Notebook / XLSX
2019-05-13 to 2019-06-05	Analysis of the trained "W2V 200" model using other XML files	Notebook / List of training files / Script requirements / Script / W2V 200 results CSV
2019-08-15	Number of days until the first access burst	Notebook
2019-08-21	Analyzing accesses of a single journal with Ratchet and ArticleMeta	Notebook
2019-11-14 onwards	Applying FastText directly on ISIS ISO data	Notebook / ISO files

List of files that aren't stored in this repository:

Dataset of manually normalized data: aff_norm_update.csv (raw), aff_n15.csv (fixed)
Clea's 2018-06-04 CSV and the XML pack from which it was created: selecao_xml_br.tgz, inner_join_2018-06-04.csv, inner_join_2018-06-04_filenames.txt
ISIS ISO dump: 2019-11-13_iso200.zip
Random forest models based on Word2Vec: dictionary_w2v_both.dump, rf_w2v_200.dump, rf_w2v_1000.dump
Results of applying the rf_w2v_200.dump model: 2019-05_w2v_country.csv
Country summary CSV based on the reports from SciELO Analytics (2018-06-10): documents_affiliations_country_summary.csv
XLSX with articles' PIDs based on the reports from SciELO Analytics (2018-12-10): pids_network_2018-12-10_usp_unesp_unicamp_embrapa.xlsx, pids_2018-12-10_usp_unesp_unicamp.xlsx

Packages with old reports from SciELO Analytics on which some experiment was based:

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
country		country
.gitignore		.gitignore
0104-6632-bjce-33-01-0001.xml		0104-6632-bjce-33-01-0001.xml
1020-4989-RPSP-41-e78.xml		1020-4989-RPSP-41-e78.xml
1984-4689-zool-35-e14641.xml		1984-4689-zool-35-e14641.xml
2018-08-09_affiliations_report_summary.ipynb		2018-08-09_affiliations_report_summary.ipynb
2019-03-08_rf_w2v_example.ipynb		2019-03-08_rf_w2v_example.ipynb
2019-08-15_first_access_burst.ipynb		2019-08-15_first_access_burst.ipynb
2019-08-21_ratchet_example.ipynb		2019-08-21_ratchet_example.ipynb
2019-11_fasttext_from_iso.ipynb		2019-11_fasttext_from_iso.ipynb
README.rst		README.rst
affs_table_25.csv		affs_table_25.csv
experiments_2018-04-05.ipynb		experiments_2018-04-05.ipynb
experiments_2018-04-19.ipynb		experiments_2018-04-19.ipynb
experiments_2018-04-26.ipynb		experiments_2018-04-26.ipynb
experiments_2018-05-03.ipynb		experiments_2018-05-03.ipynb
experiments_2018-05-10.ipynb		experiments_2018-05-10.ipynb
experiments_2018-05-17.ipynb		experiments_2018-05-17.ipynb
experiments_2018-06-04.ipynb		experiments_2018-06-04.ipynb
experiments_2018-06-07.ipynb		experiments_2018-06-07.ipynb
experiments_2018-06_country.ipynb		experiments_2018-06_country.ipynb
experiments_2018-07-05.ipynb		experiments_2018-07-05.ipynb
experiments_2018-07-12.ipynb		experiments_2018-07-12.ipynb
experiments_2018-07-23.ipynb		experiments_2018-07-23.ipynb
experiments_2018-07-26.ipynb		experiments_2018-07-26.ipynb
experiments_2018-08_usp.ipynb		experiments_2018-08_usp.ipynb
experiments_2018-08_words_lsa.ipynb		experiments_2018-08_words_lsa.ipynb
experiments_2018-11_word2vec.ipynb		experiments_2018-11_word2vec.ipynb
experiments_2018-12_sao_paulo.ipynb		experiments_2018-12_sao_paulo.ipynb
experiments_2019-02_usp_unicamp_unesp_embrapa.ipynb		experiments_2019-02_usp_unicamp_unesp_embrapa.ipynb
experiments_2019-05_w2v_evaluation.ipynb		experiments_2019-05_w2v_evaluation.ipynb
fetch_crossref.py		fetch_crossref.py
headers_listener_tornado.py		headers_listener_tornado.py
requirements.w2v_country.txt		requirements.w2v_country.txt
w2v_country.py		w2v_country.py

scieloorg/normalizations-experiments

Folders and files

Latest commit

History

Repository files navigation

Data normalization/cleaning experiments

About

Topics

Resources

Stars

Watchers

Forks

Languages