Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run SQL query on OGR layers #834

Closed
barryrowlingson opened this issue Aug 30, 2018 · 24 comments
Closed

Run SQL query on OGR layers #834

barryrowlingson opened this issue Aug 30, 2018 · 24 comments

Comments

@barryrowlingson
Copy link
Contributor

An SO questioner: https://stackoverflow.com/questions/52086882/issue-with-query-in-st-read/52102296#52102296 wanted to filter shapefiles with an SQL query, as you can do in ogr2ogr for example, but the query parameter is only valid for DBIObject reads in st_read.

I answered by showing how to build a VRT file with the SQL query, which works fine but is a bit of a pain.

Could gdal_read.cpp use the ExecuteSQL method on an optional passed SQL query string? I experimented and had success with putting:

poDS->ExecuteSQL("select * from codes_postaux_region where POP2010 > 20000", NULL, NULL);

on line 225, which successfully reduced my read data from 6000 features to 707. I'm not sure I completely understand what the ExecuteSQL method does, or what the returned OGRLayer is (see https://www.gdal.org/classGDALDataset.html#a5b65948b1e15fa63e96c0640eb6c5d7c) or what the spatial filter or dialect parameters can do (I left them NULL) but its a proof-of-concept. Real code would have to pass the query string through, changing the parameters in the C++ and the R and so on, and my Rcpp is inadequate to get this all right, so I just hardcoded one SQL test string.

I can't see any discussion here about implementing this already, apologies if this has already been excluded as a possible enhancement.

B

@edzer
Copy link
Member

edzer commented Aug 30, 2018

Nice idea! PR welcome.

@jsta
Copy link
Contributor

jsta commented Aug 30, 2018

This issue is very similar to hypertidy/vapour#34. Maybe it would be of use here?

@barryrowlingson
Copy link
Contributor Author

There's a more complete working example in a branch of my fork here: https://github.com/barryrowlingson/sf/tree/execute_sql - I've no idea of the quality so I'm not sure I should hit the PR button. I can do if it helps, but I don't really know what I'm doing with C++...

@edzer
Copy link
Member

edzer commented Aug 31, 2018

Looks OK; I wonder whether the select * from nc cannot be constructed (probably on the R side), since we know that the layer is nc. The real query is the where clause, right? @mdsumner

Also, would it be possible to do selections based on geometries? Maybe not, or at least not portably, since GDAL on MacOS X does not compile against GEOS (where dynamic linking not allowed by CRAN, static not made possible by GDAL devs); @rsbivand ? Nevertheless, it might be nice-to-have on platforms that do support it?

@barryrowlingson
Copy link
Contributor Author

You might not want to select all columns from an OGR data source. I was following the ogr2ogr parameter -sql, but maybe st_read could also have a where= parameter?

I also think my patch needs to release the result set from the ExecuteSQL call, according to the documentation. https://www.gdal.org/classGDALDataset.html#ab2c2b105b8f76a279e6a53b9b4a182e0

@mdsumner
Copy link
Member

In my experience it's better to scan the extents and attributes, and short circuit on a limited number of features or on attribute values - ExecuteSQL is less efficient than that (for OGRSQL). I don't want all this loaded in this one mega package, but I'm happy to share lessons from vapour for those of you that do.

Some layer names cause tricky escaping problems, hence efforts to read FID values directly - I just got stuck on 64-bit int issues for a complete treatment.

@mdsumner
Copy link
Member

Here's my suggestion for sf, let the user take responsibility for what they get from:

  • input sql argument triggers ExecuteSQL in st_read and functions reading bbox, geometry or attributes
  • input limit_n argument short circuits feature iteration
  • add st_read_bbox for GDAL sources (API has a method on features)
  • add read_attributes for GDAL sources
  • consider custom filters tested during feature iteration

That covers a lot of desrable customization IMO, and might help trigger greater PR input. All those ideas are implemented in vapour at least to a proof of concept level. I want them as an independent API, so I'm figuring it out independently.

@mdsumner
Copy link
Member

mdsumner commented Sep 1, 2018

Can st_sfc create a vector of bbox? Probably not, is there anything wrong with that? I'll try

@mdsumner
Copy link
Member

mdsumner commented Sep 1, 2018

Nonstarter, raw vectors of bounds much better. NVM, I'll work on the other stuff.

@edzer
Copy link
Member

edzer commented Sep 1, 2018

demo(nc, ask = FALSE)
 st_as_sfc(st_bbox(nc))

@mdsumner
Copy link
Member

Let's go @barryrowlingson ! Any blocker here? I finally got DBI/dbplyr support to work so I'm keen to have this in sf proper

@edzer
Copy link
Member

edzer commented Sep 26, 2018

I'm planning to release a new sf version soon. Could one of you come up with a PR to do this, preferably with one or two tests?

@mdsumner
Copy link
Member

I'm close, didn't want to stomp on @barryrowlingson here.

There's an issue with the geometry column though, because (I think) CPL_read_ogr is getting the geom from the fields, rather than with poFeature->GetGeometryRef. I haven't tracked down if it can be made to work with GPKG, with for example

## gets no 'geom' so no geometry column
st_read("nc.gpkg", query = "SELECT NAME FROM 'nc.gpkg'")  

whereas

## gets 'geometry column
st_read("nc.shp", query = "SELECT NAME FROM nc")  

I guess I'll push the PR so it can be investigated

@barryrowlingson
Copy link
Contributor Author

barryrowlingson commented Sep 26, 2018 via email

@edzer
Copy link
Member

edzer commented Sep 26, 2018

Thanks! @mdsumner It uses GetGeomFieldRef(i) which according to this should be equivalent when i is 0. It just loops over the geometry columns, of which usually there is only 1.

@mdsumner mdsumner mentioned this issue Sep 26, 2018
@edzer
Copy link
Member

edzer commented Sep 26, 2018

Thanks so much for the fast action!!

@mdsumner
Copy link
Member

No worries, I'm definitely worried about this though - did you notice in the PR comment?

gpkg_layer <- "nc.gpkg"
## we specify  geom
nc_gpkg_sql = st_read(system.file("gpkg/nc.gpkg", package = "sf"),
        query = sprintf("SELECT NAME, SID74, FIPS, geom  FROM '%s' WHERE BIR74 > 20000", gpkg_layer))

## no geom, no sf  (shp doesn't need the select geom)
nc_gpkg_no_geom = st_read(system.file("gpkg/nc.gpkg", package = "sf"),
        query = sprintf("SELECT NAME, SID74, FIPS  FROM '%s' WHERE BIR74 > 20000", gpkg_layer))

@edzer
Copy link
Member

edzer commented Sep 26, 2018

Yes. Can we use layer for %s? By default, this is taken to be basename(dsn):

> basename("gpkg/nc.gpkg")
[1] "nc.gpkg"

@mdsumner
Copy link
Member

mdsumner commented Sep 26, 2018

I think that's peculiar to GPKG, usually it's basename without extension, or arbitrary, multiple, and unrelated to the filename. The layer name doesn't change when the actual geopackage file name is changed, so it must be baked in on creation.

@edzer
Copy link
Member

edzer commented Sep 26, 2018

Ah: sometimes (shape) we need no quotes around layer, other times (gpkg) we do. Is that it?

Yes, for gpkg layer name is baked into the gpkg file, not derived from its name.

@mdsumner
Copy link
Member

I though single quotes might be a safe default, but no - double quotes seem to be:

## fails
st_read(system.file("shape/nc.shp", package="sf"),
        query = "SELECT NAME, SID74, FIPS FROM 'nc' WHERE BIR74 > 20000")

# ok
st_read(system.file("shape/nc.shp", package="sf"),
                            query = "SELECT NAME, SID74, FIPS FROM \"nc\" WHERE BIR74 > 20000")

# ok
st_read(system.file("gpkg/nc.gpkg", package="sf"),
        query = "SELECT NAME, SID74, FIPS, geom FROM 'nc.gpkg' WHERE BIR74 > 20000")

That's definitely driver dependent sometimes, i.e. Access, Manifold and SQL Server afaik will take square brackets [layer], but there's also a difference when OGRSQL can deal with the query versus passing it to the engine.

@edzer
Copy link
Member

edzer commented Sep 27, 2018

This works with the PostGIS driver:

st_read("PG:dbname=postgis", "sids", 
      query = "SELECT NAME, BIR74, FIPS, wkb_geometry FROM \"sids\" WHERE BIR74 > 20000")

@edzer
Copy link
Member

edzer commented Sep 27, 2018

So, we settle with double quotes? I guess this has to go into the docs, as users supply a full query?

@mdsumner
Copy link
Member

mdsumner commented Sep 27, 2018

Yes I don't think it's worth wrapping here, there is just too many possibilities, and DBI has facilities for escaping. The geom-as-field is the major problem, because while it's easy enough to discover interactively 'select * from layer limit 1' that won't work for DBI. But, I think it's not for right now because it will need pretty big changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants