ro warning=FALSE, message=FALSE, comment=NA, cache=FALSE or
name: rmetadata layout: post title: Scholarly metadata from R date: 2012-09-15 author: Scott Chamberlain tags:
- open access
Metadata! Metadata is very cool. It's super hot right now - everybody is talking about it. Okay, maybe not everyone, but it's an important part of archiving scholarly work.
We are working on a repo on GitHub
rmetadata to be a one stop shop for querying metadata from around the web. Various repos on GitHub we have started - rpmc, rdatacite, rdryad, rpensoft, rhindawi - will at least in part be folded into
As a start we are writing functions to hit any metadata services that use the OAI-PMH: "Open Archives Initiative Protocol for Metadata Harvesting" framework.
OAI-PMH has six methods (or verbs as they are called) for data harvesting that are the same across different metadata providers:
OAI-PMH provides an updating list of data providers, which we can easily use to get the base URLs for their data. Then we just use one of the six above methods to query their metadata.
Install rmetadata first.
install_github('rmetadata', 'ropensci') library(rmetadata)
The most basic thing you can do with
OAI-PMH is identify the data provider, getting their basic information. The
# one provider md_identify(provider = "datacite") # many providers md_identify(provider = c("datacite","pensoft")) # no match for one, two matches for other md_identify(provider = c("harvard", "journal")) # let's pick one from the second md_identify(provider = "Hrcak")
There are a variety of metadata formats, depending on the data provider - list them with the
# List metadata formats for a provider md_listmetadataformats(provider = "dryad") # List metadata formats for a specific identifier for a provider md_listmetadataformats(provider = "pensoft", identifier = "10.3897/zookeys.1.10")
ListRecords verb is used to harvest records from a repository
head( md_listrecords(provider = "datacite")[][,2:4] )
ListIdentifiers is an abbreviated form of
ListRecords, retrieving only headers rather than records.
# Single provider md_listidentifiers(provider = "datacite", set = "REFQUALITY")[][1:10] md_listidentifiers(provider = "dryad", from = "2012-07-15")[][1:10] # Many providers out <- md_listidentifiers(provider = c("datacite","pensoft"), from = "2012-08-21") llply(out, function(x) x[1:10]) # display just a few of them
ListSets you can retrieve the set structure of a repository.
# arXiv, returns a data.frame head( md_listsets(provider = "arXiv")[] ) # many providers, returns a list md_listsets(provider = c("pensoft","arXiv"))
Retrieve an individual metadata record from a repository using the
# Single provider, one identifier md_getrecord(provider = "pensoft", identifier = "10.3897/zookeys.1.10") # Single provider, multiple identifiers md_getrecord(provider = "pensoft", identifier = c("10.3897/zookeys.1.10","10.3897/zookeys.4.57"))
Cool, so I hope people find this post and package useful. Let me know what you think in comments below, or if you have code specific comments or additions, go to the GitHub repo for
rmetadata. In a upcoming post I will show an example of what you can do with
rmetadata in terms of an actual research question.