Skip to content

scieloorg/normalizations-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data normalization/cleaning experiments

This repository contains analyses and experiments performed with the goal of normalizing/cleaning the SciELO data, intended to find and fix unclean/inconsistent values in their raw format, as well as other similar issues, mainly towards the fields that regards to the affiliations.

Contents of this repository ordered by creation date:

Date Description Link
2018-04-05 Grabbing article <aff> and <country> data with BeautifulSoup 4 Notebook
2018-04-19 Article XML parsing with ElementTree/libxml2/lxml, using XPath/XSLT Notebook / XML pack
2018-04-26 Creating a table with data from <aff>-<contrib> pairs (front matter) in 25 XML files using lxml Notebook / CSV
2018-05-03 Loading/cleaning/analyzing a table of manually normalized data, including a DBSCAN clustering model for the institution name Notebook / Raw manual CSV / Manual CSV
2018-05-10 Looking for alternatives to the CSS/XPath/XSLT based XML parsing: xmltodict on article XML and fuzzy regex on custom paths Notebook
2018-05-17 Getting tags that looks like <article-id>, <aff> and <contrib> using fuzzy regex / Levenshtein distance Notebook
2018-06-04 CSV generation with Clea Notebook / File list / CSV
2018-06-07 Analysis of the contrib_type field from Clea's CSV output Notebook
2018-06-14 to 2018-07-05 Country analysis of Clea's CSV output using graphs (NetworkX), including a substantial analysis of alternative libraries for country normalization/cleaning in Python/R/Ruby, resulting in a taxonomy/classification of techniques (exact match, regex, fuzzy, graphs) Notebook
2018-07-05 Analysis of the country in the manual normalization CSV data using graphs Notebook
2018-07-12 Creation of a CrossRef fetching script for all articles in a article_doi CSV column due to the presence of several DOI / PID empty fields Notebook / Script
2018-07-23 Matching and normalizing PID/DOI using Crossref data, besides a first experiment based on the SciELO's "XML debug" API to get the current article PID from its older PID Notebook / Script
2018-07-26 Crunching/crawling data from SciELO's search engine and the XML debug API, looking for a specific DOI / PID Notebook
2018-08-02 to 2018-08-16 Normalizing the USP institutions orgname (faculty name) and orgdiv1 (department name) fields filled in Brazilian Portuguese Notebook
2018-08-09 Summarization of the affiliations report from SciELO Analytics Notebook / Summary
2018-08-23 to 2018-11-14 Latent Semantic Analysis (LSA) on the CSV data for predicting the country code, using k-Means, k-NN and random forest Notebook
2018-11-22 to 2019-03-08 Experiments with word2vec to find the country code from a single string having the merged information of an affiliation-contributor pair Notebook / Example / Dump Dictionary / Dump W2V 200 / Dump W2V 1000
2018-12-06 to 2018-12-13 Looking for articles' PIDs from USP/UNESP/UNICAMP (SciELO Brazil) by analyzing the distinct values that appear as the institution name Notebook / XLSX
2019-01-10 to 2019-02-21 Looking for articles from EMBRAPA and public state universities in SP (USP/UNESP/Unicamp) in the entire SciELO Network by analyzing the institution name, country, state and city, as well as the graph of authors and institutions Notebook / XLSX
2019-05-13 to 2019-06-05 Analysis of the trained "W2V 200" model using other XML files Notebook / List of training files / Script requirements / Script / W2V 200 results CSV
2019-08-15 Number of days until the first access burst Notebook
2019-08-21 Analyzing accesses of a single journal with Ratchet and ArticleMeta Notebook
2019-11-14 onwards Applying FastText directly on ISIS ISO data Notebook / ISO files

List of files that aren't stored in this repository:

Packages with old reports from SciELO Analytics on which some experiment was based:

About

Exploratory experiments upon authors affiliations data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published