Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: helper function for working with GeoPackage schema extension #4

Open
elipousson opened this issue Oct 13, 2022 · 17 comments

Comments

@elipousson
Copy link
Collaborator

I'm note sure if this is already possible using the layer_options and dataset_options for sf::st_write() but I'm curious if you've explored the possibility of adding support for the GeoPackage schema extension, ideally allowing column labels or a dataframe with definitions/aliases to match column names, to serve as the data to use for the schema. Happy to spend some time figuring this out if you have any pointers on how to approach it!

@elipousson
Copy link
Collaborator Author

I think it also may be possible to read the schema information from a GeoPackage file by using the options parameter for sf::st_read(). See r-spatial/sf#1157 for more details.

@florisvdh
Copy link
Member

Thanks for this interesting idea! It will first need exploring the different possible options how to best approach this, as you refer. I've no experience with the schema extension and won't be able to look at it soon, but will be glad to do so. It's added in my todo list, but of course you're welcome to take this further already or add a pull request.

@elipousson
Copy link
Collaborator Author

elipousson commented Nov 7, 2022

I don't have a GeoPackage file that is using the Schema or Metadata extensions to test this with but I've got a working prototype for reading the relevant tables using the RSQLite package:

Update: I developed this a little more fully and stuck the code in this gist. Writing the data may be easier than I expected although it will take some work to incorporate validation for the schema and metadata specifications. I'm assuming there is some way to do this with gdal_utils() and st_write() which may be preferable but I saw this package also used RSQLite so figured I wasn't too far off track.

read_gpkg_schema <- function(dsn, ...) {
  read_gpkg_extension(
    dsn,
    "gpkg_schema",
    c("gpkg_data_columns", "gpkg_data_column_constraints"),
    ...
    )
}

read_gpkg_metadata <- function(dsn, ...) {
  read_gpkg_extension(
    dsn,
    "gpkg_metadata",
    c("gpkg_metadata", "gpkg_metadata_reference"),
    ...
  )
}

read_gpkg_extension <- function(dsn, extension, table_name = NULL, ...) {
  conn <- RSQLite::dbConnect(RSQLite::SQLite(), dsn)

  check_gpkg_extension(conn, extension, table_name)
  
  purrr::map(
    table_name,
    ~ read_gpkg_table(conn = conn, .x)
  )
}

read_gpkg_table <- function(conn = NULL,
                            dsn = NULL,
                            table_name = NULL) {
  rlang::check_exclusive(conn, dsn)
  conn <- conn %||% RSQLite::dbConnect(RSQLite::SQLite(), dsn)
  
  if (is.null(table_name)) {
    cli::cli_abort(
      "{.arg table_name} must be provided." 
    )
  }
  
  cli::cli_inform("Accessing the {.val {table_name}} table.")
  RSQLite::dbReadTable(conn, table_name)
}

#' 
check_gpkg_extension <- function(conn,
                               extension = NULL,
                               table_name = NULL) {
  gpkg_extensions <- RSQLite::dbReadTable(conn, "gpkg_extensions")
  extension_tables <-
    gpkg_extensions[gpkg_extensions$extension_name %in% extension, ]
  
  has_tables <-
    table_name %in% extension_tables$table_name
  
  if (all(has_tables)) {
    return(invisible(NULL))
  }
  
  cli::cli_abort(
    "{.arg table_name} {.val {table_name[!has_tables]}} can't be found for extension {.val {extension}}."
    )
}

@elipousson
Copy link
Collaborator Author

For really no good reason, I spent a bunch more time on my fork this tonight both expanding on the examples I shared in my comment this afternoon and refactoring the original functions.

If you're OK with expanding the Imports to include rlang, glue, and cli and moving RSQLite from Suggests to Imports, I'd be glad to open a pull request (or open a separate more general issue for this proposed refactoring). I'd also love to see the verbose parameter used in the timestamp functions be replaced with a quiet parameter for consistency with sf.

I haven't fully figured out how to use parameterized SQL queries but, once I do, I think implementing both the registered extensions (which includes the metadata and schema extensions) and broader list of community extensions is totally doable.

@elipousson
Copy link
Collaborator Author

Ah – I did just spot the note on the README that rgeopackage is "not tied to any particular package by design." Let me know if you're set on this – if so, I may continue development on the fork but rename it to avoid any confusion – or if there is any flexibility.

@florisvdh
Copy link
Member

Thanks!

not tied to any particular package by design

This essentially means that it is not necessarily built on top of one specific framework; hence some heterogeneity between functions is to be expected. No problem there for flexibility!

I will look further into your ideas and come back! Regarding adding dependencies (Imports), I think it depends on how frequently functions of these packages are used inside and across the new functions.

I'd also love to see the verbose parameter used in the timestamp functions be replaced with a quiet parameter for consistency with sf

Good point; feel free to change in a PR!

@florisvdh
Copy link
Member

@elipousson I've read your gist. Such awesome contributions; this rocks!! They would make the currently modest, tiny package (just toying with timestamps, which I needed once) more reflect its name, i.e. providing various useful functions to interact with GeoPackages. Also I've been learning a bit further about GPKG extensions, thanks for the links 👍.

As a historical note, rgeopackage was created after discussion in rkrug/rGeoPackage#9. (At the time of writing, its Readme refers to an older repo location of current rgeopackage; have made a pull request for that)

I didn't yet study your new code in your fork, so still lagging behind 🙈. But I've taken a quick peek; it looks impressive and it's clear to me you take this very serious ❤️.

I must add I'm very busy these days (months actually) with non-coding work, which makes it challenging to keep pace with all exciting developments. Also, at the moment rgeopackage isn't my first priority when it comes to maintenance or adding new features. Given this, and your obvious and most welcome ambition to take the package to 'the next level' and perhaps add more ideas in the future, I'd be glad to pass maintainership to you if you'd be willing to, and go to 'assistance mode' where I can. Which doesn't mean I won't look in detail to your upcoming pull requests, on the contrary 😄. Just please don't expect I can handle them very quickly, I hope I can get to it (or to further inspecting your fork, if not yet a PR) in the course of next week.

Looking forward to your ideas on this!

@florisvdh
Copy link
Member

florisvdh commented Nov 10, 2022

I'm assuming there is some way to do this with gdal_utils() and st_write() which may be preferable

Did you take a further look at those? It appears that GDAL supports the Schema and Metadata extensions, see https://gdal.org/drivers/vector/gpkg.html#level-of-support-of-geopackage-extensions. It is best to check what can be done with sf or not in this area, and document overlapping functionality, so that users are aware (similar to the case of preset_timestamp()).

@elipousson
Copy link
Collaborator Author

Thanks for the kind words about my code, @florisvdh, and the invitation to share in the development and stewardship of your package.

I'm going to try to get some additional read and write functionality working on my fork and add more tests before I open a pull request so it may be a week or two more before I have something to review. I mainly want to make sure I don't run into any issues that require significant restructuring (e.g. switching to use GDAL via sf rather than using RSQLite) and I think adding the tests per #5 should make code review easier. I've been doing package development for about two years but mostly working with APIs and sf objects so it is fun and different to learn how to access a GeoPackage file like a database!

@florisvdh
Copy link
Member

OK, I'll see it when you get there. Feel free to add a branch and work here, but indeed the fork is still easier to experiment with GH Action workflows that depend on the main branch. Oh yeah, master should better be renamed as main. Shall I do that already?

@elipousson
Copy link
Collaborator Author

@florisvdh Renaming the branch sounds good to me! I haven't done this before but I'm assuming there is a way to make a pull request to bring my changes back into a development branch? Then bring them back into main one piece at a time so that I can add tests as I go?

@florisvdh
Copy link
Member

Hi @elipousson main has now replaced master. In case you still had a master branch locally that tracks origin/master (with origin referring to r-spatial), you can run git fetch origin --prune, rename it as main (being checked out on master) with git branch -m main and update its tracking relationship by git branch -u origin/main. See e.g. output of git branch -vva:

$ git branch -vva
* main                                         66ea0fe [origin/main] Readme: update installation instruction
  remotes/origin/2022-11-15_development-branch 9f23f02 Merge pull request #9 from elipousson/master
  remotes/origin/main                          66ea0fe Readme: update installation instruction

@florisvdh
Copy link
Member

florisvdh commented Nov 16, 2022

Regarding the pull request strategy: since all work (different features) sits in one branch there are indeed two options:

  • proceed working in that branch, postponing a PR until all features are ready for review and merging. Even in this scenario, you can already make the PR in GitHub but mark it as draft, so that merging is not yet possible but communication is possible in the PR, e.g. to address specific aspects during code preparation.
    • if in this scenario you need something in main beforehand (e.g. to get GH actions in place), you'll need to do that in commits that belong to another branch that directly branches off main.
  • splitting work in several pull requests (different topics), so that parts ready for review can be reviewed and merged into main. In order not to make things complex (e.g. using git rebase to rearrange commits in branches directly based on main, unless you're comfortable with that), this will require:
    • that the commits that go in a PR are consecutive commits, with main (or one of its parent commits) as direct parent of the first commit you want to be part of the PR. So the first PR would be the first X commits of the dev branch. To make such a PR, you need to define an extra branch name at the Xth commit; I can post the git commands if needed or define and push such a topic branch if you wish.
    • a second PR to be created only after the first is merged, and so on; otherwise the second will still contain all commits common to PR 1 and PR 2, as long as PR 1 is not merged. PRs simply cover the difference between two branches.

The first approach will be the easiest one, but I'm open to both, depending on your preference.

@elipousson
Copy link
Collaborator Author

The first option definitely sounds easier to me as well. I'll see about pulling in the tests through a separate branch from main so we can take advantage of the GitHub action driven checks when I have a more full-featured pull request ready.

@elipousson
Copy link
Collaborator Author

I've been a little slow to dig back into development on this because I've thinking it may be better to use GDAL directly via sf::gdal_utils(), gdalUtilities::ogr2ogr() (which is a wrapper for sf::gdal_utils()), or or some combination of functions from {vapour} (although it is a read-only version of the API) rather than trying to work directly with the GeoPackage tables via {RSQLite}. I wish it were otherwise because it means I likely won't be able to use much of the code from my fork but the prospect of fully implementing the validation standards described in the OGC GeoPackage standard is really more than I can handle. Any thoughts on this change of approach?

@elipousson
Copy link
Collaborator Author

elipousson commented Jan 22, 2023

Ah – on reflection, it may not need to be an either-or proposition. I set up a new development branch and pulled some of the refactoring from my earlier fork/development branch (which uses {RSQLite}) but also set up a function that uses sf::st_read() to read the extension tables. I'm also looking at the sample SQL queries from the GeoPackage specification as a way of handling validation when adding a new extension to a GeoPackage file.

I'm going to focus on read and write functions for the schema and the metadata extensions and will share an update when I have working examples. It probably won't be a quick project but I think I'm starting to get my head around the options.

@florisvdh
Copy link
Member

Interesting @elipousson! It's indeed best to use GDAL as backend in R where it already can cover what you want, to avoid duplication of effort and because it will be maintained to support future GPKG versions. Actually I wasn't yet aware of this being a change of approach, postponing reflection to the PR reviewing process 🙈.

GDAL support for aspatial tables in a GeoPackage has been implemented in sf, both reading and writing (r-spatial/sf#1396). You may want to make use of it in programming the features that you want to add here. Also, SQL queries can be passed to GDAL with sf::st_read() IIRC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants