Data Package in R
Data-packages is a standard format for describing meta-data for a collection of datasets. The package
datapkg provides convenience functions for retrieving and parsing data packages in R. To install in R:
library(devtools) install_github("hadley/readr") install_github("ropenscilabs/jsonvalidate") install_github("ropenscilabs/datapkg")
datapkg_read function retrieves and parses data packages from a local or remote sources. A few example packages are available from the datasets and testsuite-py repositories. The path needs to point to a directory on disk or git remote or URL containing the root of the data package.
# Load client library(datapkg) # Clone via git cities <- datapkg_read("git://github.com/datasets/world-cities") # Same data but download over http cities <- datapkg_read("https://raw.githubusercontent.com/datasets/world-cities/master")
The output object contains data and metadata from the data-package, with actual datasets inside the
# Package info print(cities) # Open actual data in RStudio Viewer View(cities$data[])
In the case of multiple datasets, each one is either referenced by index or, if available, by name (names are optional in data packages).
# Package with many datasets euribor <- datapkg_read("https://raw.githubusercontent.com/datasets/euribor/master") # List datasets in this package names(euribor$data) View(euribor$data[])
The package also has basic functionality to save a data frame into a data package and
datapackage.json file accordingly.
# Create new data package pkgdir <- tempfile() datapkg_write(mtcars, path = pkgdir) datapkg_write(iris, path = pkgdir) # Read it back mypkg <- datapkg_read(pkgdir) print(mypkg$data$mtcars)
From here you can modify the
datapackage.json file with other metadata.
This package is work in progress. Current open issues:
1values for booleans: PR#406
- Support "year only" dates (
%Y). Not sure if this constituates a valid date actually: PR#407
- R and
readrrequire to specify which strings are interepreted as missing values. Default are empty string
NA. A similar property needs to be defined in the spec.
- It is unclear what to do with parsing errors, or if the fields in
datapackage.jsondoes not match the csv data. Examples: s-and-p-500 and currency-codes
- Writing data packages from data frames.