rdataretriever
R interface to the Data Retriever.
The Data Retriever automates the tasks of finding, downloading, and cleaning up publicly available data, and loads them or stores them in variety of databases or flat file formats. This lets data analysts spend less time cleaning up and managing data, and more time analyzing it.
This package lets you work with the Data Retriever (written in Python) using R, so that the Retriever's data handling can easily be integrated into R workflows.
Table of Contents
- Installation
- Installing Tabular Datasets
- Installing Spatial Datasets
- Using Docker Containers
- Provenance
- Acknowledgements
Installation
The rdataretriever is an R wrapper for the Python package, Data Retriever. This means
that Python and the retriever Python package need to be installed first.
Basic Installation
If you just want to use the Data Retriever from within R follow these instuctions run the following commands in R. This will create a local Python installation that will only be used by R and install the needed Python package for you.
install.packages('reticulate') # Install R package for interacting with Python
reticulate::install_miniconda() # Install Python
reticulate::py_install('retriever') # Install the Python retriever package
install.packages('rdataretriever') # Install the R package for running the retriever
rdataretriever::get_updates() # Update the available datasetsAfter running these commands restart R.
Advanced Installation for Python Users
If you are using Python for other tasks you can use rdataretriever with your
existing Python installation (though the basic installation
above will also work in this case by creating a separate miniconda install and
Python environment).
Install the retriever Python package
Install the retriever Python package into your prefered Python environment
using either conda (64-bit conda is required):
conda install -c conda-forge retrieveror pip:
pip install retrieverSelect the Python environment to use in R
rdataretriever will try to find Python environments with retriever (see the
reticulate documentation on
order of discovery
for more details) installed. Alternatively you can select a Python environment
to use when working with rdataretriever (and other packages using
reticulate).
The most robust way to do this is to set the RETICULATE_PYTHON environment
variable to point to the preferred Python executable:
Sys.setenv(RETICULATE_PYTHON = "/path/to/python")This command can be run interactively or placed in .Renviron in your home
directory.
Alternatively you can do select the Python environment through the reticulate
package for either conda:
library(reticulate)
use_conda('name_of_conda_environment')or virtualenv:
library(reticulate)
use_virtualenv("path_to_virtualenv_environment")You can check to see which Python environment is being used with:
py_config()Install the rdataretriever R package
install.packages("rdataretriever") # latest release from CRANdevtools::install_github("ropensci/rdataretriever") # development version from GitHubInstalling Tabular Datasets
library(rdataretriever)
# List the datasets available via the Retriever
rdataretriever::datasets()
# Install the portal into csv files in your working directory
rdataretriever::install_csv('portal')
# Download the raw portal dataset files without any processing to the
# subdirectory named data
rdataretriever::download('portal', './data/')
# Install and load a dataset as a list
portal = rdataretriever::fetch('portal')
names(portal)
head(portal$species)
Installing Spatial Datasets
Set-up and Requirements
Tools
- PostgreSQL with PostGis, psql(client), raster2pgsql, shp2pgsql, gdal,
The rdataretriever supports installation of spatial data into Postgres DBMS.
-
Install PostgreSQL and PostGis
To install
PostgreSQLwithPostGisfor use with spatial data please refer to the OSGeo Postgres installation instructions.We recommend storing your PostgreSQL login information in a
.pgpassfile to avoid supplying the password every time. See the.pgpassdocumentation for more details.After installation, Make sure you have the paths to these tools added to your system's
PATHS. Please consult an operating system expert for help on how to change or add thePATHvariables.For example, this could be a sample of paths exported on Mac:
#~/.bash_profile file, Postgres PATHS and tools. export PATH="/Applications/Postgres.app/Contents/MacOS/bin:${PATH}" export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/10/bin"
-
Enable PostGIS extensions
If you have
Postgresset up, enablePostGISextensions. This is done by using eitherPostgres CLIorGUI(PgAdmin)and runFor psql CLI
psql -d yourdatabase -c "CREATE EXTENSION postgis;" psql -d yourdatabase -c "CREATE EXTENSION postgis_topology;"
For GUI(PgAdmin)
CREATE EXTENSION postgis; CREATE EXTENSION postgis_topology
For more details refer to the PostGIS docs.
Sample commands
rdataretriever::install_postgres('harvard-forest') # Vector data
rdataretriever::install_postgres('bioclim') # Raster data
# Install only the data of USGS elevation in the given extent
rdataretriever::install_postgres('usgs-elevation', list(-94.98704597353938, 39.027001800158615, -94.3599408119917, 40.69577051867074))
Provenance
rdataretriever allows users to save a dataset in its current state which can be used later.
Note: You can save your datasets in provenance directory by setting the environment variable PROVENANCE_DIR
Commit a dataset
rdataretriever::commit('abalone-age', commit_message='Sample commit', path='/home/user/')To commit directly to provenance directory:
rdataretriever::commit('abalone-age', commit_message='Sample commit')Log of committed dataset in provenance directory
rdataretriever::commit_log('abalone-age')Install a committed dataset
rdataretriever::install_sqlite('abalone-age-a76e77.zip') Datasets stored in provenance directory can be installed directly using hash value
rdataretriever::install_sqlite('abalone-age', hash_value='a76e77`)Using Docker Containers
To run the image interactively
docker-compose run --service-ports rdata /bin/bash
To run tests
docker-compose run rdata Rscript load_and_test.R
Release
Make sure you have tests passing on R-oldrelease, current R-release and R-devel
To check the package
R CMD Build #build the package
R CMD check --as-cran --no-manual rdataretriever_[version]tar.gzTo Test
setwd("./rdataretriever") # Set working directory
# install all deps
# install.packages("reticulate")
library(DBI)
library(RPostgreSQL)
library(RSQLite)
library(reticulate)
library(RMariaDB)
install.packages(".", repos = NULL, type="source")
roxygen2::roxygenise()
devtools::test()To get citation information for the rdataretriever in R use citation(package = 'rdataretriever')
Acknowledgements
A big thanks to Ben Morris for helping to develop the Data Retriever. Thanks to the rOpenSci team with special thanks to Gavin Simpson, Scott Chamberlain, and Karthik Ram who gave helpful advice and fostered the development of this R package. Development of this software was funded by the National Science Foundation as part of a CAREER award to Ethan White.
