-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dplyr compatibility? #42
Comments
Good catch! I guess (but am not sure) that this is out of my hands; right now you could coerce back by
but this is maybe, ehm, not very tidy? @hadley what do you think? |
I think it's just a matter of adding the right dplyr methods to restore the sf class, e,g.: filter_.sf <- function(.data, ..., .dots) {
st_as_sf(NextMethod())
} |
Got it, thanks! I see the following functions ending in a
@mdsumner which ones can we meaningfully support? Are there commands that need additional geometrical operations, like merging/unioning geometries? |
I short, I don't know the answer, I haven't explored it. At a guess, group_by() %>% summarize() is the main one for geometry, I imagine that it should work like the rgeos functions in byid=FALSE mode, but applied to each grouping. If there's no fun(geom) then perhaps error, or just drop the sf classing. In https://github.com/mdsumner/spdplyr I only do the simple ones, arrange, distinct, filter, mutate, rename, select, slice, - the group_by/summarize just bangs the geometries together without removing internal boundaries or intersections - but this was done in isolation, and without me knowing how to extend dplyr, really. Joins are another thing I haven't explored much what should/could happen there. Sadly, despite the common "we need dplyr for Spatial" cries in the community I haven't seen any useable details about what is desired. (Personally, this is not of high interest without exposing the underlying X, Y, [Z], [M] attributes, and the underlying grouping structures in the geom in a consistent way, but to do that you need to go beyond path-based lines and polys, and I've put all of that into work elsewhere. I had hoped for there to be a common framework between ggplot/ggvis/rgl and sf, but I think that all belongs outside of sf. There's no such common framework outside of R, for example, so there's no analogous target to adhere to). |
I'm really going to try to spend some time on this, here is just what I tried recently, without going into detail. library(sf)
example(st_read)
## simple (non aggregation) stuff already works
library(sf)
example(st_read)
library(dplyr)
nc %>% slice(10)
nc %>% filter(PERIMETER > 2.5)
## geometry dropped as expected
nc %>% group_by(SID79) %>% summarize(sum(PERIMETER))
## grouped mutate (without aggregation) keeps geometry but drops
## sf defs
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA))
## trick fun to apply to the geometry column
## works for grouped union, though drops the sf defs
fun <- function(x) st_union_cascaded(st_sfc(do.call(c, st_geometry(x))))
nc %>% group_by(SID79) %>% summarize(geometry = fun(geometry)) I know this requires a lot of thought, but I think there's value in
I haven't though deeply about the other dplyr behaviours. |
UPDATE
where |
May I also suggest |
Thanks; I did, but still untested. |
I updated the table 3 comments up; all relevant dplyr verbs + gather & spread are now implemented and lightly tested. Below is my test script. Happy testing - positive feedback also welcome!
|
Here's an example computing population (well, birth) densities for aggregated areas: library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
## Reading layer `nc.gpkg' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/3.3/sf/gpkg/nc.gpkg' using driver `GPKG'
## features: 100
## fields: 14
## proj4string: +proj=longlat +datum=NAD27 +no_defs
nc.ea <- st_transform(nc, 7314) # Lambert equal area
nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6, dens = BIR74/area) # births/km^2
summary(nc.ea$dens)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01834 0.08762 0.14850 0.23860 0.27700 1.37700
nc.ea$area_cl <- cut(nc$AREA, c(0, .1, .12, .15, .25))
nc.grp <- nc.ea %>% group_by(area_cl)
out <- nc.grp %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A) did anyone get lost?
No. You might have discovered by now that I'm brand new to |
Thank you for implementing this in such short order @edzer! When I opened this issue I had no expectation of seeing progress in the near term and had already started reverting my current project back to Going forward I'll be happy to share my feedback, tests, and any unusual situations I run into. As @mdsumner mentioned, there's plenty of thinking and hard work ahead. Thanks again and I look forward to following the development of |
A few observations about
You can find tests demonstrating these observations here: https://tiernanmartin.github.io/YCC-Baseline-Conditions/3-communication/other/sp-dplyr-test.nb.html |
Signed-off-by: Edzer Pebesma <edzer.pebesma@uni-muenster.de>
Great observations, beautiful tests! Now, they look even much better:
The issue you discoverd was that the |
I'm not sure if this deserves it's own issue, but a full set of I haven't dug into A proposed combo:
would result in an There are maybe more match conditions that you would want to include. Perhaps allowing any function that takes two
It probably also makes sense to do this so you don't have to reinvent the join wheel. I don't believe dplyr allows for arbitrary conditions for join matches yet (tidyverse/dplyr#557), but there might be a solution that isn't too difficult to implement. The user could also generate new geometry columns using any function that takes two
|
@edzer Is it better to keep adding |
Yes, @kendonB could you pls make this into a separate issue, for instance called "Dplyr-style join operations with spatial predicates" ? |
@edzer this is really nice, there's some code reduction if you put the cut and the group by actions in the pipeline together: nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6,
dens = BIR74/area,
area_cl = cut(AREA, c(0, .1, .12, .15, .25)))
out <- nc.ea %>% group_by(area_cl) %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)
## compare
out %>% summarise(sum(A * new_dens))
nc.ea %>% summarise(sum(area * dens)) It reminds me that applying the st_ functions inside verbs is a motivator, and seems fine: library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
nc.ea <- nc %>% mutate(area_nonsense = st_area(geom),
geom = st_transform(geom, 7314),
area = st_area(geom) / 1e6,
dens = BIR74 / area,
AREA_cl = cut(AREA, c(0, .1, .12, .15, .25)))
out <- nc.ea %>% group_by(AREA_cl) %>%
summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)
out %>% summarise(sum(A * new_dens))
##sum(A * new_dens)
## 1 329962
But also I see that 7314 is Transverse Mercator, not Lambert EA - it's not so important re the example, but worth correcting in case this gets used elsewhere. |
And a check for the grouping and area calculations sp/rgeos style: library(rgdal)
spnc <- spTransform(as(nc, "Spatial"), "+proj=tmerc +lat_0=40.35 +lon_0=-86.15000000000001 +k=1.000031 +x_0=240000 +y_0=36000 +ellps=GRS80 +units=us-ft +no_defs")
sp.out <- rgeos::gUnionCascaded(spnc, as.character(cut(spnc$AREA, c(0, .1, .12, .15, .25))))
rgeos::gArea(sp.out, byid = TRUE)/1e6
# (0,0.1] (0.1,0.12] (0.12,0.15] (0.15,0.25]
# 290620.6 182825.3 322369.5 585439.0
out$A
##[1] 290620.6 182825.3 322369.5 585439.0 |
To follow up in relation to #121, I like the way I ran an experiment (it fails after |
I'm noticing some potentially odd behaviour with
|
I agree that while duplicating, and then de-duplicating geometries, there is the implicit assumption that no one messes up things in between. Why would you do this? How would you want this to work differently? |
I don't know what the best approach is, or if this even matters, largely because I'm unclear of how
|
Gives me > nc_gathered %>%
+ spread(year, births)
Error in .subset2(x, i, exact = exact) :
attempt to select less than one element in get1index on your example. Did you do any testing? |
Oops, sorry, forgot a crucial line! This is actually tested and works for me.
Anyhow, not sure if this approach is necessarily better, just thought I'd bring it up... Thanks! |
Thanks; I replace the original |
Above @mdsumner gives an example of a grouped mutate. I get an error when running this example: library(sf)
library(tidyverse)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA))
#> Warning in is.na(st_agr(x)): is.na() applied to non-(list or vector) of
#> type 'NULL'
#> Error in .subset2(x, i, exact = exact): attempt to select less than one element in get1index Appears to be because mutate_.sf <- function(.data, ..., .dots) {
class(.data) <- setdiff(class(.data), "sf")
st_as_sf(NextMethod())
} Grouped |
Clumsy as it looks, I agree that might be the best thing to do. Thanks! |
Any reason why I'm still getting |
We might have to do it everywhere. Does it solve your problem? With which function is that? |
It does solve it. |
I know sf is on the mid-way toward achieving compatibility with dplyr, but I'm a bit afraid the compatibility will be degraded with the next release of dplyr. For example, 0.5.0(current): library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "sf" "data.frame"
packageVersion("dplyr")
#> [1] '0.5.0' RC for 0.6.0: library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "data.frame"
packageVersion("dplyr")
#> [1] '0.5.0.9002' One more thing I want to ask is, while dplyr has a plan to deprecate SE functions (e.g. |
Ah, my comment above may be duplicated with #304. |
There is no next release of dplyr yet, and there are no signs that @hadley has done any reverse dependency checks so far. The rstudio blog mentions that the |
Oh, I see. thanks for your quick response:) I will consider filing this issue to dplyr's repo. |
We have been careful not to make any backward incompatible changes - we should have a system of default methods that ensures existing backends still work. If that isn't true, I'd really appreciate a minimal reprex filed in dplyr that illustrates the problem. @edzer revdep emails will go out (hopefully) later today. |
Scripts that use dplyr verbs such as I don't mind modifying |
That is not the intent - can you please file a bug on dplyr? |
addresses tidyverse/dplyr#2664 fix #304 addresses #42
I see that working with
sf
objects withdplyr
is listed on the ISC Proposal, but it looks like you haven't gotten around to documenting that yet.Without going into a vignette-length explanation, could you shed some light on why
dplyr
functions appear to stripsf
objects of theirsf
class and suggest a way to effectively combine the power of these two tools?Example:
Many thanks.
The text was updated successfully, but these errors were encountered: