Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

st_read and requiring layer name #39

Closed
rundel opened this issue Oct 28, 2016 · 44 comments
Closed

st_read and requiring layer name #39

rundel opened this issue Oct 28, 2016 · 44 comments

Comments

@rundel
Copy link
Contributor

rundel commented Oct 28, 2016

This is a small pet peeve of mine with regard to rgdal - is there a compelling reason to continue to force the user to provide the layer name when reading in data?

There isn't a technical reason within gdal to require this and in my experience most geometry files I've worked with only ever have one layer. It seems to me a more sensible approach is to make the layer name an optional argument and in the case of a single layer to just automatically return that object.

If more than one layer are present then the current error could be thrown, or alternatively throw a warning and return a list containing all layers.

@edzer
Copy link
Member

edzer commented Oct 29, 2016

Nice idea, would improve usability! What would be a good approach to try to derive layer name from a single string: remove the file name extension and possibly path from the file name?

@rundel
Copy link
Contributor Author

rundel commented Oct 29, 2016

I'm not sure, my use cases have almost always been geojson and shapefiles so I don't have direct experience with the more exotic vector types and their usage of layer names.

GDAL peculiarities mean that both of the following work without issue

nc <- st_read(system.file("shape/nc.shp", package="sf"), "nc", crs = 4267)
nc <- st_read(system.file("shape/", package="sf"), "nc", crs = 4267)

so I'm not sure if just looking at the path will be helpful.

The example I have the most issue with has always been with reading geojson where the user is expected to know that the layer is called OGRGeoJSON - that has always been a pain point when my students are first learning this stuff.

@Robinlovelace
Copy link
Contributor

I agree with the changes suggested by @rundel. That will increase usability lots. It was a godsend when I found out about raster::shapefile() and I think st_read() would benefit from working in the same or a similar way.

I think it's easiest to just point it at the file automatically deciding the driver/layer based on the extension. This is what rio does and I think sf should too: https://cran.r-project.org/web/packages/rio/vignettes/rio.html

@edzer
Copy link
Member

edzer commented Oct 30, 2016

So these might be some heuristics then: if st_read is called with a single argument, it must be an existing file, and

  1. if the file name ends in .shp, we assume it is a shapefile (with full path), and call the layer the basename (i.e. extension and path stripped) of the file name
  2. Geopackages end in .gpkg (see req 3, so use the same trick as for shapefiles;
  3. Other file types with fixed extension insert here
  4. other file names, we assume will be GeoJSON, and layer will be called OGRGeoJSON (see here)

@Robinlovelace
Copy link
Contributor

Robinlovelace commented Oct 30, 2016

Some other common extensions that I think would be good for sf to recognise:

.geojson
.osm
.gml
.kml
.kmz

I think raster already does a good job of handling the multiple raster formats.

@edzer
Copy link
Member

edzer commented Oct 30, 2016

Examples of using st_read on .geojson .osm, .gml, .kml and .kmz would be helpful!

@rsbivand
Copy link
Member

Please be careful to separate out st_read for dummies and the real st_read, which shouldn't make assumptions. A typical problem is with the shapefile driver which falls over in a dsn=directory with a given layer=, but where the directory contains a *.dbf without being part of a shapefile. Robert's use of shapefile as a general term for vector data is very unhelpful. The border between sensible simplification and going too far is very real. Something not (yet) in the frame is encoding, which is often very unpleasant (even CP1252 is a problem), and far from everyone is UTF-8 yet. So either do a wrapper for st_read which handles frequent cases, and explain that more often than we like, real data is messy and needs the underlying complications in the bare interface, or just provide the bare interface and enough guidance. Fixing GeoJSON for the current odd OGR naming may change as soon as the driver gets updated - there are reasons for exposing users to the actual complications, because it makes derailments their responsibility.

@Robinlovelace
Copy link
Contributor

Example of a 5 Mb file I created - cycling potential on major roads in London:

u = "https://github.com/npct/pct-data/raw/master/london/rnet.geojson"
download.file(url = u, destfile = "/tmp/rnet.geojson")
rnet = sf::st_read("/tmp/rnet.geojson") # fails
rnet = sf::st_read(dsn = "/tmp/rnet.geojson", layer = "OGRGeoJSON") # works

I take Roger's point about precision and robustness vs ease of use and like the idea of having a separate wrapper function for magically guessing, most of the time correctly, how to read in a file. For that I would propose something with a name like st_import(), akin to rio::import().

Would that satisfy all parties?

rnet <- sf::st_import("/tmp/rnet.geojson") 

Would work for me and think it's a function I can even help writing!

@edzer
Copy link
Member

edzer commented Oct 30, 2016

I will not provide two functions. We can give layer a default name where this makes sense, and an error message when it is missing otherwise (e.g. where dsn is a directory, or an extension is missing). Like rgdal, the function responds with printing which layer is being read from which data source, on success, so that is there. Even ogr2ogr doesn't force you to specify layers if they're obvious.

@Robinlovelace
Copy link
Contributor

Robinlovelace commented Oct 30, 2016

Sounds reasonable and can understand desire to minimise number of functions. So, to clarify, that would mean st_read("file.geojson") and st_read("file.shp") would work? I think that would make life easier for lots of people, myself included, especially if I end up teaching sf which I hope will happen based on the impressive rate of development.

@edzer
Copy link
Member

edzer commented Oct 30, 2016

no, because that doesn't point to a file with an extension that tells us what the layer name should be. I just pushed a commit that accepts

 st_read(system.file("shape/nc.shp", package="sf"))

though -- please test (now supports gpkg, geojson, shp).

To be discussed: what will it return as default?

@Robinlovelace
Copy link
Contributor

  • here's an example xml (.osm) file. Not sure if this can be read with rgdal or sf at present. Btw - fast work Edzer! Will aim to test during the week, maybe Weds.
    doc_txt.osm.zip

@Robinlovelace
Copy link
Contributor

Could you clarify what you mean by "that doesn't point to a file with an extension that tells us what the layer name should be" - not understanding 100%.

@edzer
Copy link
Member

edzer commented Oct 30, 2016

I meant: literally the command you typed. It has no file extension since the "/" comes after the ".". If you mean, either

st_read(file.geojson")

or

st_read("file.shp")

then yes, but please don't imply meaning in github issues -- I can only read what you write.

Which layer in your .osm? ogrinfo reports 5:

edzer@gin-edzer:/tmp$ ogrinfo doc_txt.osm
Had to open data source read-only.
INFO: Open of `doc_txt.osm'
      using driver `OSM' successful.
1: points (Point)
2: lines (Line String)
3: multilinestrings (Multi Line String)
4: multipolygons (Multi Polygon)
5: other_relations (Geometry Collection)

@edzer
Copy link
Member

edzer commented Oct 30, 2016

library(sf)
o1 = st_read("doc_txt.osm", "points")
o2 = st_read("doc_txt.osm", "lines")
o3 = st_read("doc_txt.osm", "multilinestrings")
o4 = st_read("doc_txt.osm", "multipolygons")
o5 = st_read("doc_txt.osm", "other_relations")

plot(o1)
plot(o2, add = TRUE, col = 'red')
plot(o3, add = TRUE, col = 'blue')
plot(o4, add = TRUE, col = 'green')
plot(o5, add = TRUE, col = 'orange')

(the last two are empty)

@Robinlovelace
Copy link
Contributor

Many thanks.

@edzer
Copy link
Member

edzer commented Oct 31, 2016

I changed strategy somewhat. In case no layer is specified, st_read now checks with GDAL: if there is just one layer, that layer is read; if there is more than one, the list with layers is returned. This avoids the step of having to guess from extensions etc:

> st_read("PG:dbname=postgis")
Driver: PostgreSQL 
  layer_name geometry_type
1      meuse         Point
2   meuse_sf         Point
3       sids Multi Polygon
4  meuse_tbl         Point
5 meuse_tbl2         Point

@Robinlovelace
Copy link
Contributor

Very cool - works a treat, as tested on the osm data:

# devtools::install_github("edzer/sfr") # latest version
library(sf)

# user: now I want to read in an osm file...
osm_data = st_read(dsn = "/tmp/doc_txt.osm")
# user: great - now I'll plot it...
# plot(osm_data) # fail
# user: what's in it?
osm_data
## Driver: OSM 
##         layer_name       geometry_type
## 1           points               Point
## 2            lines         Line String
## 3 multilinestrings   Multi Line String
## 4    multipolygons       Multi Polygon
## 5  other_relations Geometry Collection
# user: ah ok, now I'll read-in the points
osm_points = st_read(dsn = "/tmp/doc_txt.osm", layer = "points")
## Reading layer points from data source /tmp/doc_txt.osm using driver "OSM"
## features:       932
## fields:         10
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
plot(osm_points) 

I think there should be a message telling the user that no data has been read in when this happens though, e.g. There are multiple layers in this data source. Set the layer argument in st_read to read the desired layer., in the same way that there is a useful message when the data does get read in.

@Nowosad
Copy link
Contributor

Nowosad commented Oct 31, 2016

Yes, I think there should be a message, either:

  1. No layer (what then?)
  2. One layer (read that layer)
  3. Many layers (print a message - should it be a normal message/warning/error?)

@Robinlovelace
Copy link
Contributor

And the message could also say something like

Available layers are:
## Driver: OSM 
##         layer_name       geometry_type
## 1           points               Point
## 2            lines         Line String
## 3 multilinestrings   Multi Line String
## 4    multipolygons       Multi Polygon
## 5  other_relations Geometry Collection

@edzer
Copy link
Member

edzer commented Oct 31, 2016

OK, I now have:

> library(sf)
Linking to GEOS 3.5.0, GDAL 2.1.0
> x = st_read("PG:dbname=empty")   # no layers
No layers are present in data source PG:dbname=empty.
Returning a list of layer names and their type;
set the `layer' argument in `st_read' to read a particular layer.
> x
Driver: PostgreSQL 
Available layers:
<none>
> x = st_read("x.geojson")         # single layer
Reading layer OGRGeoJSON from data source x.geojson using driver "GeoJSON"
features:       3
fields:         2
proj4string:    +proj=longlat +datum=WGS84 +no_defs 
> x = st_read("PG:dbname=postgis") # multiple layers
Multiple layers are present in data source PG:dbname=postgis.
Returning a list of layer names and their type;
set the `layer' argument in `st_read' to read a particular layer.
> x
Driver: PostgreSQL 
Available layers:
  layer_name geometry_type
1      meuse         Point
2   meuse_sf         Point
3       sids Multi Polygon
4  meuse_tbl         Point
5 meuse_tbl2         Point
> x = st_read("PG:dbname=empty", quiet = TRUE)   # no layers, quiet
> 

@Robinlovelace
Copy link
Contributor

Looks like you've cracked it - can't think of any way to make that better. Your work will make loading geographic data into R more accessible for thousands of people when this takes off!

@Robinlovelace
Copy link
Contributor

Although from a grammar perspective I don't think the semi colon works, hence ecea519

(I know, I'm painting bike sheds again but someone's got to do it!)

@Robinlovelace
Copy link
Contributor

I'd say job done and this issue can be closed. This issue has led to an improvement in usability and, although you still need to provide a layer name, it's now clear where to find those names. Solved from your perspective @rundel?

@mdsumner
Copy link
Member

mdsumner commented Nov 1, 2016

I'm late to this, but concerned that you can now get a data set from st_read, or an error, or a list with layer names, which will present strange errors downstream when you try to do stuff as if it were a data set. To me it should be an error, or a valid data set (or NULL, maybe if the data source can execute queries).

Hadley calls this "type-stability": https://blog.rstudio.org/2016/01/06/purrr-0-2-0/ (and I'm a massive offender in my own packages, but it's important I think. )

Is there a function to get the layer names without using st_read, like ogrListLayers?

If you consider multiple NetCDF variables in a single file to be analogous to vector layers: I've always liked raster's behaviour to read the first variable in a NetCDF file as the one to read if the "varname" is not specified, with a message about that decision including a list of the available variable names. That way you get a data set without specifying a variable, but also an indication that a choice for the "varname" argument was made for you.

The behaviour would be to have an implicit default value of 1L for "layer" to act as an index into the list of available layers, and to trigger a message with the details in that case. Then an explicit integer for layer could stand in place of a name - sometimes you want to read everything and push it around somewhere else, without knowing what it is.

I'd meant to suggest this behaviour, since it's such a huge obstacle for newcomers who start by knowing nothing about multiple layers, and then (from what I've seen), turn from readOGR to the more obvious tools in maptools and raster to read their vector data.

(All that said, the current behaviour is a huge user-friendly improvement over readOGR, so I'm very glad to see it.)

@Robinlovelace
Copy link
Contributor

Robinlovelace commented Nov 1, 2016

Interesting points @mdsumner. I like the option of providing a NULL output if there are many layers and print the layer names. Then a separate function, e.g. st_listlayers() could actually provide the output. Can see benefit of this in terms of making function 'modular'.

Certainly agree with the general principle of type stability.

@edzer
Copy link
Member

edzer commented Nov 1, 2016

I like type stability. How about this (in case no layer is specified):

  1. reading an empty dataset generates an error
  2. reading a dataset with a single layer simply reads this layer, as above
  3. reading a dataset with multiple layers reads the first layer, prints all layers, and emits a warning (unless quiet = TRUE)
  4. add an option return_layers that, when TRUE returns the list of layers in cases 1, 2 and 3.

@rsbivand
Copy link
Member

rsbivand commented Nov 1, 2016

This looks at least reasonable, assuming one places more importance on user-friendliness than I would (user-friendliness as the opposite of the very beneficial steep learning curve). Should readOGR move to this structure too?

@Robinlovelace
Copy link
Contributor

Sound like good suggestions @edzer. I think it would do no harm for readOGR to shift to this behavior from @rsbivand, can't see any unintended consequences and would be beneficial for people new to spatial data files.

@rundel
Copy link
Contributor Author

rundel commented Nov 1, 2016

Wow, lots of discussion since I last had a chance to look at this. I am in favor of @edzer's proposed cases.

My only small quibble is including return_layers as an argument with st_read makes the function a bit monolithic and breaks type stability. Why not include an st_info function with ogrinfo like functionality which can then be used for retrieving layer informatio?

@Nowosad
Copy link
Contributor

Nowosad commented Nov 1, 2016

@edzer @rsbivand - what's are the future plans for readOGR? Will st_read and readOGR be used by users at the same time or readOGR will be replaced by st_read? If the second is true - I think is better to leave readOGR as it is - many scripts/tutorials/books are based on the dsn and layer behaviour of readOGR.

@rsbivand
Copy link
Member

rsbivand commented Nov 1, 2016

They are separate. rgdal will go to maintenance mode for vector, but just as maptools hasn't been deprecated, I guess rgdal vector will not be either. I'd just change a missing layer= argument from:
if (missing(layer)) stop("missing layer") to the suggested approach, and similar for ogrInfo().

@edzer
Copy link
Member

edzer commented Nov 1, 2016

OK, I currently see:

> library(sf)
Linking to GEOS 3.5.0, GDAL 2.1.0
> try(x <- st_read("PG:dbname=empty")) # no layers: error
Data source PG:dbname=empty contains no layers
Error in eval(substitute(expr), envir, enclos) : 
  Error: no layers in datasource.

> x = st_read("x.geojson")         # single layer: silent
Reading layer OGRGeoJSON from data source x.geojson using driver "GeoJSON"
features:       3
fields:         2
proj4string:    +proj=longlat +datum=WGS84 +no_defs 
> x = st_read("PG:dbname=postgis") # multiple layers: message + warning
Multiple layers are present in data source PG:dbname=postgis.
Reading layer `meuse'.
Use `st_list' to list all layer names and their type.
Set the `layer' argument in `st_read' to read a particular layer.
Reading layer meuse from data source PG:dbname=postgis using driver "PostgreSQL"
features:       155
fields:         12
proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957388243134,0.343987817378283,-1.87740163998045,4.0725 +units=m +no_defs 
Warning message:
In eval(substitute(expr), envir, enclos) :
  automatically selected the first layer in a data source containing more than one
> x = st_read("PG:dbname=postgis", quiet = TRUE) # multiple layers, suppress message
Warning message:
In eval(substitute(expr), envir, enclos) :
  automatically selected the first layer in a data source containing more than one
> suppressWarnings(x <- st_read("PG:dbname=postgis")) # multiple layers, suppress warning
Multiple layers are present in data source PG:dbname=postgis.
Reading layer `meuse'.
Use `st_list' to list all layer names and their type.
Set the `layer' argument in `st_read' to read a particular layer.
Reading layer meuse from data source PG:dbname=postgis using driver "PostgreSQL"
features:       155
fields:         12
proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957388243134,0.343987817378283,-1.87740163998045,4.0725 +units=m +no_defs 
> 

and

> st_list("PG:dbname=postgis")
Driver: PostgreSQL 
Available layers:
  layer_name geometry_type
1      meuse         Point
2   meuse_sf         Point
3       sids Multi Polygon
4  meuse_tbl         Point
5 meuse_tbl2         Point

so that should settle this issue.

@Robinlovelace
Copy link
Contributor

Very comprehensive solution and demonstration of the new functionality, that works for me and it seems @rundel who usefully opened this issue is happy. Job done I'd say!

@mdsumner
Copy link
Member

mdsumner commented Nov 1, 2016

Yes, very good indeed. Thanks all (and especially @edzer for the work!)

@rundel
Copy link
Contributor Author

rundel commented Nov 1, 2016

Looks great to me, thanks for all the hard work @edzer

@timelyportfolio
Copy link

timelyportfolio commented Jan 19, 2017

Perhaps this is a local issue or my own ignorance (likely), but as I start to develop on mapedit, I have had a very difficult time easily converting geojson as a list to sf. Here is an example using a drawn polygon returned from Leaflet.Draw.

drawn <- list(structure(list(type = "Feature", properties = structure(list(
    `_leaflet_id` = 102L, feature_type = "rectangle"), .Names = c("_leaflet_id", 
"feature_type")), geometry = structure(list(type = "Polygon", 
    coordinates = list(list(list(10.0854730652645, 48.8285498294904), 
        list(10.0854730652645, 49.393084372626), list(11.123681073077, 
            49.393084372626), list(11.123681073077, 48.8285498294904), 
        list(10.0854730652645, 48.8285498294904)))), .Names = c("type", 
"coordinates"))), .Names = c("type", "properties", "geometry"
)))

library(leaflet)
# draw it with leaflet
leaflet() %>% addTiles() %>% addGeoJSON(drawn)

What would be the correct way to convert drawn to sf? The only thing I can get to work is

library(sf)

st_polygon(
    list(
        matrix(
            unlist(drawn[[1]]$geometry$coordinates),
            byrow=TRUE,
            ncol=2
        )
    )
)

Thanks so much for sf. It is already amazing!

@edzer
Copy link
Member

edzer commented Jan 19, 2017

Yes, this is how it works. I'd have opted for

do.call(rbind, unlist(drawn[[1]]$geometry$coordinates))

but that really doesn't matter. The extra list nesting however suggests that coordinates may have more rings, in which case you need to do more - you'd need to know whether they are holes or other outer rings. @mdsumner might know this area better?

@timelyportfolio
Copy link

timelyportfolio commented Jan 19, 2017

Ok good to know, thanks! As far as I am aware, Leaflet.Draw does not support rings or holes, so I think I am safe in this case. However, I would always appreciate input from @mdsumner. For further reference, the Shiny gadget proof of concept is here.

@ateucher
Copy link
Contributor

ateucher commented Jan 19, 2017

@timelyportfolio I've implemented sf -> geojson (as a list) in geojsonio: https://github.com/ropensci/geojsonio/blob/master/R/geojson_list.R#L294. That may be of help to go backwards. Also, I'd love to have the same method you're working on in geojsonio... 😉

@mdsumner
Copy link
Member

mdsumner commented Jan 19, 2017

@timelyportfolio it will show holes correctly by virtue of the evenodd rule

drawn <-
  list(structure(
    list(
      type = "Feature",
      properties = structure(
        list(`_leaflet_id` = 102L, feature_type = "rectangle"),
        .Names = c("_leaflet_id",
                   "feature_type")
      ),
      geometry = structure(
        list(type = "Polygon",
             coordinates = list(
               list(
               list(10.0854730652645, 48.8285498294904),
               list(10.0854730652645, 49.393084372626),
               list(11.123681073077,
                    49.393084372626),
               list(11.123681073077, 48.8285498294904),
               list(10.0854730652645, 48.8285498294904)
             ), 
             list(
               list(10.3, 48.9), list(10.3, 49.1), list(10.7, 49.1), list(10.7, 48.9), 
               list(10.3, 48.9)))),
        .Names = c("type",
                   "coordinates")
      )
    ),
    .Names = c("type", "properties", "geometry")
  ))

Note that leaflet the R package will get support for true leaflet-MultiPolygon, but it doesn't have it yet.

I'm confused about why addGeoJSON takes an R list rather than a string though, we really need some clarity in this area, and some semantics for transferring between raw formats (like a string, or binary format), their analogous representation within R, and the form needed to drive another tool. Sometimes there are too many steps in the chain.
(I often encounter the idea that "simple features" is the right central form but it's clearly not able to represent everything we need, TopoJSON for example. I am working on the design of that central form but it's still a little early). We don't necessarily need a universal form all the time, but there is a need for guidance on what chained steps make sense in terms of efficiency and preserving all information with minimal redundancy.

And clearly we need better tools for GeoJSON and TopoJSON, rather than hacking these internal forms, we are half-ish way through that with some tools here - I'm not sure how what we are doing meshes with geojsonio, yet. TopoJSON requires partial indexing between shared tables (or structures), so it's a good bridge for what the broader central scheme I've been working towards can provide.

@edzer
Copy link
Member

edzer commented Jan 20, 2017

Maybe this, and linking to @ateucher helps?

@edzer
Copy link
Member

edzer commented Jan 20, 2017

Sorry, I now realize that my previous comment probably doesn't make sense in this context.

@timelyportfolio
Copy link

Thanks for all the input. I will promote to a new issue for geojson as R list to sf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants