This repository contains analyses and experiments performed with the goal of normalizing/cleaning the SciELO data, intended to find and fix unclean/inconsistent values in their raw format, as well as other similar issues, mainly towards the fields that regards to the affiliations.
Contents of this repository ordered by creation date:
Date | Description | Link |
2018-04-05 | Grabbing article <aff> and <country> data with BeautifulSoup 4 |
Notebook |
2018-04-19 | Article XML parsing with ElementTree /libxml2 /lxml , using XPath/XSLT |
Notebook / XML pack |
2018-04-26 | Creating a table with data from <aff> -<contrib> pairs (front matter) in 25 XML files using lxml |
Notebook / CSV |
2018-05-03 | Loading/cleaning/analyzing a table of manually normalized data, including a DBSCAN clustering model for the institution name | Notebook / Raw manual CSV / Manual CSV |
2018-05-10 | Looking for alternatives to the CSS/XPath/XSLT based XML parsing: xmltodict on article XML and fuzzy regex on custom paths |
Notebook |
2018-05-17 | Getting tags that looks like <article-id> , <aff> and <contrib> using fuzzy regex / Levenshtein distance |
Notebook |
2018-06-04 | CSV generation with Clea | Notebook / File list / CSV |
2018-06-07 | Analysis of the contrib_type field from Clea's CSV output |
Notebook |
2018-06-14 to 2018-07-05 | Country analysis of Clea's CSV output using graphs (NetworkX), including a substantial analysis of alternative libraries for country normalization/cleaning in Python/R/Ruby, resulting in a taxonomy/classification of techniques (exact match, regex, fuzzy, graphs) | Notebook |
2018-07-05 | Analysis of the country in the manual normalization CSV data using graphs | Notebook |
2018-07-12 | Creation of a CrossRef fetching script for all articles in a article_doi CSV column due to the presence of several DOI / PID empty fields |
Notebook / Script |
2018-07-23 | Matching and normalizing PID/DOI using Crossref data, besides a first experiment based on the SciELO's "XML debug" API to get the current article PID from its older PID | Notebook / Script |
2018-07-26 | Crunching/crawling data from SciELO's search engine and the XML debug API, looking for a specific DOI / PID | Notebook |
2018-08-02 to 2018-08-16 | Normalizing the USP institutions orgname (faculty name) and orgdiv1 (department name) fields filled in Brazilian Portuguese |
Notebook |
2018-08-09 | Summarization of the affiliations report from SciELO Analytics | Notebook / Summary |
2018-08-23 to 2018-11-14 | Latent Semantic Analysis (LSA) on the CSV data for predicting the country code, using k-Means, k-NN and random forest | Notebook |
2018-11-22 to 2019-03-08 | Experiments with word2vec to find the country code from a single string having the merged information of an affiliation-contributor pair | Notebook / Example / Dump Dictionary / Dump W2V 200 / Dump W2V 1000 |
2018-12-06 to 2018-12-13 | Looking for articles' PIDs from USP/UNESP/UNICAMP (SciELO Brazil) by analyzing the distinct values that appear as the institution name | Notebook / XLSX |
2019-01-10 to 2019-02-21 | Looking for articles from EMBRAPA and public state universities in SP (USP/UNESP/Unicamp) in the entire SciELO Network by analyzing the institution name, country, state and city, as well as the graph of authors and institutions | Notebook / XLSX |
2019-05-13 to 2019-06-05 | Analysis of the trained "W2V 200" model using other XML files | Notebook / List of training files / Script requirements / Script / W2V 200 results CSV |
2019-08-15 | Number of days until the first access burst | Notebook |
2019-08-21 | Analyzing accesses of a single journal with Ratchet and ArticleMeta | Notebook |
2019-11-14 onwards | Applying FastText directly on ISIS ISO data | Notebook / ISO files |
List of files that aren't stored in this repository:
- Dataset of manually normalized data: aff_norm_update.csv (raw), aff_n15.csv (fixed)
- Clea's 2018-06-04 CSV and the XML pack from which it was created: selecao_xml_br.tgz, inner_join_2018-06-04.csv, inner_join_2018-06-04_filenames.txt
- ISIS ISO dump: 2019-11-13_iso200.zip
- Random forest models based on Word2Vec: dictionary_w2v_both.dump, rf_w2v_200.dump, rf_w2v_1000.dump
- Results of applying the
rf_w2v_200.dump
model: 2019-05_w2v_country.csv - Country summary CSV based on the reports from SciELO Analytics (2018-06-10): documents_affiliations_country_summary.csv
- XLSX with articles' PIDs based on the reports from SciELO Analytics (2018-12-10): pids_network_2018-12-10_usp_unesp_unicamp_embrapa.xlsx, pids_2018-12-10_usp_unesp_unicamp.xlsx
Packages with old reports from SciELO Analytics on which some experiment was based: