Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skim of sf objects #88

Closed
Nowosad opened this issue Jun 7, 2017 · 25 comments
Closed

skim of sf objects #88

Nowosad opened this issue Jun 7, 2017 · 25 comments

Comments

@Nowosad
Copy link
Contributor

Nowosad commented Jun 7, 2017

The sf package is the R implementation of Simple Features and starts to be a new standard for working with spatial data in R. More information at https://github.com/edzer/sfr and http://robinlovelace.net/geocompr/spatial-class.html.

The most important element of this package is the sf class. It is a simple data.frame with a one, additional list-column, which store a geometry of the data.

I think it would be useful to add an ability of creating a summary of sf objects. A summary of the geometry column could return some basic informations, such as projection, geometry type, etc.

library(sf)
library(skimr)

nc = st_read(system.file("shape/nc.shp", package="sf"))
nc
nc %>% skim()

Error in .f(.x[[i]], ...) : 
  (list) object cannot be coerced to type 'double'
In addition: Warning message:
Skim does not know how to summarize of vector of class: sfc_MULTIPOLYGON. Coercing to numericSkim does not know how to summarize of vector of class: sfc. Coercing to numeric 
@elinw
Copy link
Collaborator

elinw commented Jun 7, 2017

Can you say what list of statistics would make sense?
If you look in this file you will see examples of such lists.

https://github.com/ropenscilabs/skimr/blob/master/R/functions.R

We haven't done anything generic for list columns yet see #10 .

@elinw
Copy link
Collaborator

elinw commented Jun 7, 2017

This is a really interesting issue. I added this issue #90 related to the error message.

@elinw
Copy link
Collaborator

elinw commented Jun 7, 2017

Also relates to this issue #75 because the geometry object has class [1] "sfc_MULTIPOLYGON" "sfc" . If that fix is made it will no longer throw an error but I think that you would probably rather make a function for "sfc_MULTIPOLYGON" that actually returns meaningful information instead of generic character or list information.

@tiernanmartin
Copy link

@elinw One useful summary stat for simple features would be the count of valid geometries (see st_is_valid()).

This stat basically tells the user if the dataset contains any records that need to be omitted or pre-processed with st_make_valid(), so it's roughly analogous to working with missing values in non-spatial data.

@edzer
Copy link

edzer commented Jun 7, 2017

@tiernanmartin good point; my favorite analogy to missing values in non-spatial data is the empty geometry, found by st_dimension(obj) giving a NA.

@tiernanmartin
Copy link

@edzer That's a useful distinction: empty geometries and invalid geometries can both cause headaches downstream but in different ways.

In both cases, skim() could help alert users to the presence of these special cases.

@edzer
Copy link

edzer commented Jun 7, 2017

I agree, the analogy for both is headaches, really! (the default print method for sfc and sf objects prints the number of empty geometries if larger than zero, btw)

@Nowosad
Copy link
Contributor Author

Nowosad commented Jun 7, 2017

There is also @mdsumner's idea of decomposition stats, no. parts, no. holes, no. of segments, no. rook/queen neighbours, no. coordinates.

Source: https://twitter.com/mdsumner/status/872276792953917440

@elinw
Copy link
Collaborator

elinw commented Jun 8, 2017

So the proper name for that class is "sfc_MULTIPOLYGON" right?

@Nowosad
Copy link
Contributor Author

Nowosad commented Jun 8, 2017

@elinw take a look at http://edzer.github.io/sfr/articles/sf1.html#simple-feature-geometry-types. It could be
sfc_POINT, sfc_LINESTRING, sfc_POLYGON, sfc_MULTIPOINT, sfc_MULTILINESTRING, sfc_MULTIPOLYGON, and sfc_GEOMETRY. [ @edzer, Please correct me if I'm wrong]

@elinw
Copy link
Collaborator

elinw commented Jun 8, 2017

So I'm thinking that to start it might make sense to do something like this

sfc_multipolygon_funs<-list(
  missing = n_missing,
  complete = n_complete,
  n = length,
  n_unique = purrr::compose(length, n_unique),
  valid = purrr::compose(sum, sf::st_is_valid)
)

And then, if I'm understanding you could add support for each of these types.

@edzer
Copy link

edzer commented Jun 8, 2017

They all derive from sfc, I guess that is what you'd want to program against.

@mdsumner
Copy link

mdsumner commented Jun 8, 2017

Wow, this is a really amazing package! I'm still exploring how the classes (like sf) get registered, but I see now how skim_with helps you get started, and @elinw that list of funs works fine - as @edzer says it could simply be called sfc_funs, but further details will have to dispatch across all the other types.

I'll be working on this a bit so please anyone let me know if you start too.

@elinw
Copy link
Collaborator

elinw commented Jun 8, 2017

@mdsumner Make sure to look at stats.R also for anything that requires more complex handling. I think you have two options architecturally. If you look in skim.R you'll see that it's handling data frames but another option is to pass sfc right there and then create a make a separate sfc_funs list. I haven't really thought that through but I'm just thinking about how many functions are potentially getting pushed into the environment as more specialized data structures get added. I can see from this discussion and looking at the linked materials that there are going to be a lot.

@mdsumner
Copy link

mdsumner commented Jun 8, 2017

Thanks that's helpful! I'm having no problems with processing sfc with a custom sfc_funs list, that I register like this:

library(skimr)
library(sf)
sfp <- st_read(system.file("shape/nc.shp", package="sf"))
sfc_funs <- list(
  missing = n_missing,
  complete = n_complete,
  n = length,
  n_unique = purrr::compose(length, n_unique),
  valid = purrr::compose(sum, sf::st_is_valid)
)
skim_with(sfc = sfc_funs , append = TRUE)

skim_v(st_geometry(sfp))

I haven't been able to get it to apply to a data frame though, I thought this minimally would work:

library(skimr)
adhoc_funs <- list(
  missing = n_missing,
  complete = n_complete,
  n = length,
  n_unique = purrr::compose(length, n_unique),
  funny = function(x) length(x) + 1
)

d <- structure(list(a = 1:4, b = structure(as.list(letters[1:4]), class = "adhoc")), class = "data.frame", row.names = letters[1:4])

skim_with(adhoc = adhoc_funs , append = TRUE)
## no problems, and with much more complex funs too
skim_v(d$b)

## how do we get this to work? 
skim(d)

I think you're telling me what I need to know with the "pass sfc right there"?, do you mean we need a skim.sfc method? I'm confused about how the custom list-col type gets "registered".

@elinw
Copy link
Collaborator

elinw commented Jun 8, 2017

Can you try updating to current master? There are at least two issues mentioned above that need to be addressed for this to work. I think one has been merged but the other hasn't.

@mdsumner
Copy link

mdsumner commented Jun 8, 2017

I have updated now but still the same. More soon (this is great!), I'll keep fleshing out the skim_v function set.

I just realized I'm also not apply the _v functions correctly as my skim_v is returning a row for every element in the sfc, so that's what I need to get right first:

image

@elinw
Copy link
Collaborator

elinw commented Jun 9, 2017

I just push up a PR for handling generic lists.
One thing is that you have to constantly update the environment settings when you are creating new functions etc. Use show_skimmers() to double check what functions are in the environment.

@elinw
Copy link
Collaborator

elinw commented Jun 9, 2017

This would be my general idea of how to do it
https://github.com/ropenscilabs/skimr/tree/sf
https://github.com/ropenscilabs/skimr/compare/sf?expand=1
though it's not polished. @michaelquinn32 made a really great structure.

nc = st_read(system.file("shape/nc.shp", package="sf"))
skim_v(nc$geometry)
snc<-skim(nc)
long<-as.data.frame(snc)
head(long)
tail(long)

@mdsumner
Copy link

mdsumner commented Jun 10, 2017

Ah thanks, all makes sense.

@elinw what's your thoughts on importing sf versus another "sk.sf.imr" package, or perhaps sf including these summary funs so they are available from it? sf is a pretty heavy dependency requiring GDAL and GEOS so I tend to wrap around it to keep related packages lighter and keep GDAL out of my .travis.yml if possible.

I'll have to put this aside for a little while but hoping to help get it off the ground. (I've used an experimental package sc to derive the decomposition metrics for now (sc structures the data in way that makes that natural, but sf can bust itself into pieces reasonably well, copying out an feature-level ID and one way to do it, another is cunning lapply nrow/length cumulations. )

@elinw
Copy link
Collaborator

elinw commented Jun 10, 2017 via email

@elinw
Copy link
Collaborator

elinw commented Jun 18, 2017

I was thinking maybe we could use this in a vignette that explains how to extend to specialized types of data.

@elinw
Copy link
Collaborator

elinw commented Sep 12, 2017

@mdsumner @Nowosad @edzer @tiernanmartin I added a vignette using the code here as an example. It's in the develop branch if you want to take a look.

@elinw
Copy link
Collaborator

elinw commented Sep 12, 2017

Closing for now since the vignette is there. Happy to get PRs for improvements.

@elinw elinw closed this as completed Sep 12, 2017
@mdsumner
Copy link

Thanks @elinw - that vignette is really great!

Just FYI for anyone who might be pursuing this as well - I did some work on decomposing sf (in mdsumner/skimr - with sc) so that counts of vertices, paths, and edges were readily useable for skimr, but that project had to be reworked somewhat - now hypertidy/silicate. The package gibble on CRAN could be used for a skimr-r-for sf to give path and vertex counts, which is a connection I'd missed until now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants