Simple Features for R
Supporting authors: Edzer Pebesma, Roger Bivand, Michael Sumner, Robert Hijmans, Virgilio Gómez-Rubio
GDAL is an open source C++ library for reading and writing both raster and vector data with more than 225 drivers (supported file formats, data base connectors, web service interfaces). GDAL is used by practically all open source geospatial projects and by many industry products (including ESRI's ArcGIS, ERDAS, and FME). It provides coordinate transformations (built on top of PROJ.4) and geometric operations (e.g. polygon intersections, unions, buffers and distance). Standards for coordinate transformations change over time; such changes are typically adopted directly in GDAL/PROJ.4 but do not easily find their way into R-only packages such as
Since 2005, CRAN has package sp which provides classes and methods for spatial (point, line, polygon and raster) data. The approach
sp takes is similar to how
zoo handle the time index of time series data: objects store spatial geometries separately from associated attribute data, matching by order. Package spacetime, on CRAN since 2010, extends both
xts to handle data that varies over both space and time.
Today, 221 CRAN packages depend on, import or link to
sp, 259 when including Suggests; when including recursive dependencies these numbers are 376 and 5040. The implementation of
sp does not follow simple features, but rather the practice used at the time of release, following how ESRI shapefiles are implemented. The cluster of packages around
sp is shown in Andrie de Vries' blog on CRAN's network structure in green.
Off-CRAN package rgdal2 is an interface to GDAL 2.0, which uses raw pointers to interface features, but does not import any data in R, using GDAL to handle everything. CRAN Package wkb, contributed by Tibco Software, converts between WKB representations of several simple feature classes and corresponding classes in
sp, and seems to be needed for Tibco software purposes.
The problems we will solve are:
- R can currently not represent simple features directly. It can read most simple feature classes in
spclasses, but uses its own representation for this, and can only write data back without loss of information if it is furnished with ancilliary metadata encoded in a comment attribute to each Polygons object. It does for instance internally not distinguish between
MULTIPOLYGONnor deal with several simple feature classes, including
GEOMETRYCOLLECTION, nor handle
- The current implementation of lines and vector data in package
spis partly ambiguous (both slot
holeindicate whether a Polygon is a hole but are superceded by the comment attribute), complicated (to which exterior polygon does a hole belong - handled by the comment attribute), and by some considered difficult to work with (S4). The current implementation is hard to maintain because it contains incremental changes from a baseline that predated the industry-standard OGC/ISO (Simple Feature Interface Specification).
- The lack of support for simple features makes current interfaces to open source libraries (GDAL/OGR and PROJ.4: rgdal, GEOS: rgeos) difficult to understand and maintain, even though they work to specification.
- The current implementation has no scale model for coordinates.
- It is desirable that other R packages are offered the opportunity to migrate to more up-to-date libraries for coordinate transformations (providing proper support for datum transformation), and to avoid having to make simplifying assumptions (e.g., all spatial data come as longitude/latitude using datum
WGS84; all web maps use web Mercator).
In the longer run it will affect users of all packages currently reusing
sp classes, when we manage to migrate
sp to exclusively use the simple feature classes for representing vector data. Since the recent 2.0 release of GDAL integrates raster and vector data, having an R package that mirrors its classes makes it possible to implement operations in-database (similar to what
dplyr do), making it possible for R to manipulate spatial data that do not fit in memory.
Big Data analysis with R often proceeds by connecting R to a database that holds the data. All commonly used commercial and open source databases store spatial point, line and polygon data in the form of simple features. Representing simple features in R will simplify big data analysis for spatial data.
We want to solve the problem by carrying out the following steps (M1 refers to month 1):
- develop an R package that implements simple features in R, that is simple yet gives users access to the complete data, and includes an S3 representation that extends
- add to this package a C++ interface to GDAL 2.0, to read and write simple feature data, and to interface other functionality (coordinate transformation, geometry operations) (M3-8)
- develop and prototypically implement a migration path for sp to become compliant with simple features (M7-12)
- write user-oriented tutorial vignettes showing how to use it with files, data base connections, web API's, leaflet, ggmap, dplyr and so on (M7-10)
- write a tutorial vignette for R package writers reusing the package (M10)
- Collect and process community feed back (M6-12).
Failure modes and recovery plan:
Failure mode: S3 classes are too simple to represent simple features class hierarchy. Recovery plan: try (i) using a list column with geometry, and nested lists to represent nested structures; (ii) use a
WKTcharacter column; (iii) using a
spbreaks downstream packages. Recovery plan: involve Roger Bivand, Barry Rowlingson, Robert Hijmans (
raster) and Tim Keitt (
rgdal2) how to proceed; be patient and smooth out problems together with package maintainers.
How can the ISC help
The following table contains the cost items.
|employ a student assistant for one year (10 hrs/week)||€ 6500|
|one week visit of Roger Bivand to the Inst. for Geoinformatics||€ 1000|
|present the results at UseR! 2016||€ 1500|
|Total:||€ 9000 (9750 USD)|
The visit of Roger is anticipated halfway the project; further communications will use skype. The project has a planned duration of 12 months.
Development will take place on github, information will be shared and reactions and contributions invited through r-sig-geo, as well as StackOverflow and GIS StackExchange. The project will use an Apache 2.0 license for maximum dissemination (similar to GDAL, which uses X/MIT). The work will be published in 4 blogs (quarterly), announced on r-sig-geo (3300 subscribers), and intermediary results will be presented at UseR! 2016. The final result will be published in a paper either submitted to The R Journal or to the Journal of Statistical Software; this paper will be available before publication as a package vignette.
UseR! 2016 slides are found here.