New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dplyr compatibility? #42

Closed
tiernanmartin opened this Issue Oct 31, 2016 · 38 comments

Comments

Projects
None yet
9 participants
@tiernanmartin

tiernanmartin commented Oct 31, 2016

I see that working with sf objects with dplyr is listed on the ISC Proposal, but it looks like you haven't gotten around to documenting that yet.

Without going into a vignette-length explanation, could you shed some light on why dplyr functions appear to strip sf objects of their sf class and suggest a way to effectively combine the power of these two tools?

Example:

library(dplyr)
library(sf)

nc <- st_read(system.file("shapes/", package="maptools"), "sids")
## Reading layer sids from data source 
## /Library/Frameworks/R.framework/Versions/3.3/Resources/library/maptools/shapes/ using driver "ESRI Shapefile"
## features:       100
## fields:         14
## converted into: MULTIPOLYGON

class(nc)
## [1] "sf"         "data.frame"

class(nc[nc$AREA > .1,])
## [1] "sf"         "data.frame"

class(dplyr::filter(nc,AREA > .1))
## [1] "data.frame"

Many thanks.

@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Oct 31, 2016

Member

Good catch! I guess (but am not sure) that this is out of my hands; right now you could coerce back by

nc %>% filter(AREA > .1) %>% st_as_sf() %>% plot()

but this is maybe, ehm, not very tidy? @hadley what do you think?

Member

edzer commented Oct 31, 2016

Good catch! I guess (but am not sure) that this is out of my hands; right now you could coerce back by

nc %>% filter(AREA > .1) %>% st_as_sf() %>% plot()

but this is maybe, ehm, not very tidy? @hadley what do you think?

@hadley

This comment has been minimized.

Show comment
Hide comment
@hadley

hadley Oct 31, 2016

Contributor

I think it's just a matter of adding the right dplyr methods to restore the sf class, e,g.:

filter_.sf <- function(.data, ..., .dots) {
  st_as_sf(NextMethod())
}
Contributor

hadley commented Oct 31, 2016

I think it's just a matter of adding the right dplyr methods to restore the sf class, e,g.:

filter_.sf <- function(.data, ..., .dots) {
  st_as_sf(NextMethod())
}
@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Oct 31, 2016

Member

Got it, thanks! I see the following functions ending in a _ in dplyr:

 [1] "arrange_"        "count_"          "data_frame_"     "distinct_"      
 [5] "do_"             "filter_"         "funs_"           "group_by_"      
 [9] "group_indices_"  "lst_"            "mutate_"         "mutate_each_"   
[13] "rename_"         "rename_vars_"    "select_"         "select_vars_"   
[17] "slice_"          "summarise_"      "summarise_each_" "summarize_"     
[21] "summarize_each_" "translate_sql_"  "transmute_"     

@mdsumner which ones can we meaningfully support? Are there commands that need additional geometrical operations, like merging/unioning geometries?

Member

edzer commented Oct 31, 2016

Got it, thanks! I see the following functions ending in a _ in dplyr:

 [1] "arrange_"        "count_"          "data_frame_"     "distinct_"      
 [5] "do_"             "filter_"         "funs_"           "group_by_"      
 [9] "group_indices_"  "lst_"            "mutate_"         "mutate_each_"   
[13] "rename_"         "rename_vars_"    "select_"         "select_vars_"   
[17] "slice_"          "summarise_"      "summarise_each_" "summarize_"     
[21] "summarize_each_" "translate_sql_"  "transmute_"     

@mdsumner which ones can we meaningfully support? Are there commands that need additional geometrical operations, like merging/unioning geometries?

@mdsumner

This comment has been minimized.

Show comment
Hide comment
@mdsumner

mdsumner Nov 1, 2016

Member

I short, I don't know the answer, I haven't explored it. At a guess, group_by() %>% summarize() is the main one for geometry, I imagine that it should work like the rgeos functions in byid=FALSE mode, but applied to each grouping. If there's no fun(geom) then perhaps error, or just drop the sf classing.

In https://github.com/mdsumner/spdplyr I only do the simple ones, arrange, distinct, filter, mutate, rename, select, slice, - the group_by/summarize just bangs the geometries together without removing internal boundaries or intersections - but this was done in isolation, and without me knowing how to extend dplyr, really. Joins are another thing I haven't explored much what should/could happen there.

Sadly, despite the common "we need dplyr for Spatial" cries in the community I haven't seen any useable details about what is desired. (Personally, this is not of high interest without exposing the underlying X, Y, [Z], [M] attributes, and the underlying grouping structures in the geom in a consistent way, but to do that you need to go beyond path-based lines and polys, and I've put all of that into work elsewhere. I had hoped for there to be a common framework between ggplot/ggvis/rgl and sf, but I think that all belongs outside of sf. There's no such common framework outside of R, for example, so there's no analogous target to adhere to).

Member

mdsumner commented Nov 1, 2016

I short, I don't know the answer, I haven't explored it. At a guess, group_by() %>% summarize() is the main one for geometry, I imagine that it should work like the rgeos functions in byid=FALSE mode, but applied to each grouping. If there's no fun(geom) then perhaps error, or just drop the sf classing.

In https://github.com/mdsumner/spdplyr I only do the simple ones, arrange, distinct, filter, mutate, rename, select, slice, - the group_by/summarize just bangs the geometries together without removing internal boundaries or intersections - but this was done in isolation, and without me knowing how to extend dplyr, really. Joins are another thing I haven't explored much what should/could happen there.

Sadly, despite the common "we need dplyr for Spatial" cries in the community I haven't seen any useable details about what is desired. (Personally, this is not of high interest without exposing the underlying X, Y, [Z], [M] attributes, and the underlying grouping structures in the geom in a consistent way, but to do that you need to go beyond path-based lines and polys, and I've put all of that into work elsewhere. I had hoped for there to be a common framework between ggplot/ggvis/rgl and sf, but I think that all belongs outside of sf. There's no such common framework outside of R, for example, so there's no analogous target to adhere to).

@mdsumner

This comment has been minimized.

Show comment
Hide comment
@mdsumner

mdsumner Nov 1, 2016

Member

I'm really going to try to spend some time on this, here is just what I tried recently, without going into detail.

library(sf)
example(st_read)

## simple (non aggregation) stuff already works
library(sf)
example(st_read)

library(dplyr)
nc %>% slice(10)

nc %>% filter(PERIMETER > 2.5)

## geometry dropped as expected
nc %>% group_by(SID79) %>% summarize(sum(PERIMETER))

## grouped mutate (without aggregation) keeps geometry but drops
## sf defs
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA)) 

## trick fun to apply to the geometry column
## works for grouped union, though drops the sf defs
fun <- function(x) st_union_cascaded(st_sfc(do.call(c, st_geometry(x))))
nc %>% group_by(SID79) %>% summarize(geometry = fun(geometry)) 

I know this requires a lot of thought, but I think there's value in

  • not dropping the sf defs for a mutate into group_by (will require effort on how the grouped_df is handled)
  • a default aggregation behaviour, where any group_by keeps the sf defs and geometry column is unioned, even if it isn't operated on in summarize()

I haven't though deeply about the other dplyr behaviours.

Member

mdsumner commented Nov 1, 2016

I'm really going to try to spend some time on this, here is just what I tried recently, without going into detail.

library(sf)
example(st_read)

## simple (non aggregation) stuff already works
library(sf)
example(st_read)

library(dplyr)
nc %>% slice(10)

nc %>% filter(PERIMETER > 2.5)

## geometry dropped as expected
nc %>% group_by(SID79) %>% summarize(sum(PERIMETER))

## grouped mutate (without aggregation) keeps geometry but drops
## sf defs
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA)) 

## trick fun to apply to the geometry column
## works for grouped union, though drops the sf defs
fun <- function(x) st_union_cascaded(st_sfc(do.call(c, st_geometry(x))))
nc %>% group_by(SID79) %>% summarize(geometry = fun(geometry)) 

I know this requires a lot of thought, but I think there's value in

  • not dropping the sf defs for a mutate into group_by (will require effort on how the grouped_df is handled)
  • a default aggregation behaviour, where any group_by keeps the sf defs and geometry column is unioned, even if it isn't operated on in summarize()

I haven't though deeply about the other dplyr behaviours.

@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Nov 1, 2016

Member

UPDATE
Please update to c4dde04 which has tested dplyr verb methods for sf objects; an overview:

  +"arrange_"       -"count_"         -"data_frame_"    +"distinct_"
  -"do_"            +"filter_"        -"funs_"          +"group_by_"
  -"group_indices_" -"lst_"           +"mutate_"        -"mutate_each_"
  +"rename_"        -"rename_vars_"   +"select_"        -"select_vars_"
  +"slice_"         +"summarise_"     -"summarise_each_"+"summarize_"
  *"summarize_each_"-"translate_sql_" +"transmute_"

from tidyr:  +gather, +spread

   +: added sf methods
   -: not relevant (?)
   *: will be deprecated?

where summarisze_ automatically merges geometries over groups (when called with a grouped_df).

Member

edzer commented Nov 1, 2016

UPDATE
Please update to c4dde04 which has tested dplyr verb methods for sf objects; an overview:

  +"arrange_"       -"count_"         -"data_frame_"    +"distinct_"
  -"do_"            +"filter_"        -"funs_"          +"group_by_"
  -"group_indices_" -"lst_"           +"mutate_"        -"mutate_each_"
  +"rename_"        -"rename_vars_"   +"select_"        -"select_vars_"
  +"slice_"         +"summarise_"     -"summarise_each_"+"summarize_"
  *"summarize_each_"-"translate_sql_" +"transmute_"

from tidyr:  +gather, +spread

   +: added sf methods
   -: not relevant (?)
   *: will be deprecated?

where summarisze_ automatically merges geometries over groups (when called with a grouped_df).

@tiernanmartin

This comment has been minimized.

Show comment
Hide comment
@tiernanmartin

tiernanmartin Nov 1, 2016

May I also suggesttidyr::gather_() and tidyr::spread_() as worthy candidates for inclusion in the list of tidyverse functions that play well with sf.

tiernanmartin commented Nov 1, 2016

May I also suggesttidyr::gather_() and tidyr::spread_() as worthy candidates for inclusion in the list of tidyverse functions that play well with sf.

@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Nov 1, 2016

Member

Thanks; I did, but still untested.

Member

edzer commented Nov 1, 2016

Thanks; I did, but still untested.

@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Nov 2, 2016

Member

I updated the table 3 comments up; all relevant dplyr verbs + gather & spread are now implemented and lightly tested. Below is my test script. Happy testing - positive feedback also welcome!

library(sf)
library(dplyr)
nc = st_read(system.file("shape/nc.shp", package="sf"), crs = 4267)
nc %>% filter(AREA > .1) %>% plot()

# plot 10 smallest counties in grey:
nc %>% plot()
nc %>% arrange(AREA) %>% slice(1:10) %>% plot(add = TRUE, col = 'grey')

# select: check both when geometry is part of the selection, and when not:
nc %>% select(SID74, SID79) %>% names()
nc %>% select(SID74, SID79, geometry) %>% names()
nc %>% select(SID74, SID79) %>% class()
nc %>% select(SID74, SID79, geometry) %>% class()

# arrange: ten smallest counties
nc %>% arrange(AREA) %>% slice(1:10) %>% plot(add = TRUE, col = 'grey')

# group_by:
nc$area_cl = cut(nc$AREA, c(0, .1, .12, .15, .25))
nc %>% group_by(area_cl) %>% class()

# mutate:
nc2 <- nc %>% mutate(area10 = AREA/10)

# transmute:
nc %>% transmute(AREA = AREA/10, geometry = geometry) %>% class()
nc %>% transmute(AREA = AREA/10) %>% class()

# rename:
nc2 <- nc %>% rename(area = AREA)

# distinct:
nc[c(1:100,1:10),] %>% distinct() %>% nrow()

# summarize:
nc$area_cl = cut(nc$AREA, c(0, .1, .12, .15, .25))
nc.g <- nc %>% group_by(area_cl)
nc.g %>% summarise(mean(AREA))
nc.g %>% summarize(mean(AREA)) %>% plot(col = 3:6/7)

library(tidyr)

# time-wide to long table, using tidyr::gather
# stack the two SID columns for the July 1, 1974 - June 30, 1978 and July 1, 1979 - June 30, 1984 periods
# (see https://cran.r-project.org/web/packages/spdep/vignettes/sids.pdf)
nc %>% select(SID74, SID79, geometry) %>% gather(VAR, SID, -geometry) %>% summary()

# spread:
nc$row = 1:100
nc.g <- nc %>% select(SID74, SID79, row) %>% gather(VAR, SID, -row)
nc.g %>% tail()
nc.g %>% spread(VAR, SID) %>% head()
nc %>% select(SID74, SID79, geometry, row) %>% gather(VAR, SID, -geometry, -row) %>% spread(VAR, SID) %>% head()
Member

edzer commented Nov 2, 2016

I updated the table 3 comments up; all relevant dplyr verbs + gather & spread are now implemented and lightly tested. Below is my test script. Happy testing - positive feedback also welcome!

library(sf)
library(dplyr)
nc = st_read(system.file("shape/nc.shp", package="sf"), crs = 4267)
nc %>% filter(AREA > .1) %>% plot()

# plot 10 smallest counties in grey:
nc %>% plot()
nc %>% arrange(AREA) %>% slice(1:10) %>% plot(add = TRUE, col = 'grey')

# select: check both when geometry is part of the selection, and when not:
nc %>% select(SID74, SID79) %>% names()
nc %>% select(SID74, SID79, geometry) %>% names()
nc %>% select(SID74, SID79) %>% class()
nc %>% select(SID74, SID79, geometry) %>% class()

# arrange: ten smallest counties
nc %>% arrange(AREA) %>% slice(1:10) %>% plot(add = TRUE, col = 'grey')

# group_by:
nc$area_cl = cut(nc$AREA, c(0, .1, .12, .15, .25))
nc %>% group_by(area_cl) %>% class()

# mutate:
nc2 <- nc %>% mutate(area10 = AREA/10)

# transmute:
nc %>% transmute(AREA = AREA/10, geometry = geometry) %>% class()
nc %>% transmute(AREA = AREA/10) %>% class()

# rename:
nc2 <- nc %>% rename(area = AREA)

# distinct:
nc[c(1:100,1:10),] %>% distinct() %>% nrow()

# summarize:
nc$area_cl = cut(nc$AREA, c(0, .1, .12, .15, .25))
nc.g <- nc %>% group_by(area_cl)
nc.g %>% summarise(mean(AREA))
nc.g %>% summarize(mean(AREA)) %>% plot(col = 3:6/7)

library(tidyr)

# time-wide to long table, using tidyr::gather
# stack the two SID columns for the July 1, 1974 - June 30, 1978 and July 1, 1979 - June 30, 1984 periods
# (see https://cran.r-project.org/web/packages/spdep/vignettes/sids.pdf)
nc %>% select(SID74, SID79, geometry) %>% gather(VAR, SID, -geometry) %>% summary()

# spread:
nc$row = 1:100
nc.g <- nc %>% select(SID74, SID79, row) %>% gather(VAR, SID, -row)
nc.g %>% tail()
nc.g %>% spread(VAR, SID) %>% head()
nc %>% select(SID74, SID79, geometry, row) %>% gather(VAR, SID, -geometry, -row) %>% spread(VAR, SID) %>% head()
@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Nov 2, 2016

Member

Here's an example computing population (well, birth) densities for aggregated areas:

library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
## Reading layer `nc.gpkg' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/3.3/sf/gpkg/nc.gpkg' using driver `GPKG'
## features:       100
## fields:         14
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
nc.ea <- st_transform(nc, 7314) # Lambert equal area
nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6, dens = BIR74/area) # births/km^2
summary(nc.ea$dens)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01834 0.08762 0.14850 0.23860 0.27700 1.37700
nc.ea$area_cl <- cut(nc$AREA, c(0, .1, .12, .15, .25))
nc.grp <- nc.ea %>% group_by(area_cl)
out <- nc.grp %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)

did anyone get lost?

out %>% summarise(sum(A * new_dens))
##   sum(A * new_dens)
## 1            329962
nc.ea %>% summarise(sum(area * dens))
##   sum(area * dens)
## 1           329962

No. You might have discovered by now that I'm brand new to dplyr - happy to take comments how to tidy this up!

Member

edzer commented Nov 2, 2016

Here's an example computing population (well, birth) densities for aggregated areas:

library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
## Reading layer `nc.gpkg' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/3.3/sf/gpkg/nc.gpkg' using driver `GPKG'
## features:       100
## fields:         14
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
nc.ea <- st_transform(nc, 7314) # Lambert equal area
nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6, dens = BIR74/area) # births/km^2
summary(nc.ea$dens)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01834 0.08762 0.14850 0.23860 0.27700 1.37700
nc.ea$area_cl <- cut(nc$AREA, c(0, .1, .12, .15, .25))
nc.grp <- nc.ea %>% group_by(area_cl)
out <- nc.grp %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)

did anyone get lost?

out %>% summarise(sum(A * new_dens))
##   sum(A * new_dens)
## 1            329962
nc.ea %>% summarise(sum(area * dens))
##   sum(area * dens)
## 1           329962

No. You might have discovered by now that I'm brand new to dplyr - happy to take comments how to tidy this up!

@tiernanmartin

This comment has been minimized.

Show comment
Hide comment
@tiernanmartin

tiernanmartin Nov 2, 2016

Thank you for implementing this in such short order @edzer!

When I opened this issue I had no expectation of seeing progress in the near term and had already started reverting my current project back to sp conventions. But now, thanks to your responsiveness and the feedback from others on this thread, I'm thrilled to start testing the sf x dplyr interface.

Going forward I'll be happy to share my feedback, tests, and any unusual situations I run into. As @mdsumner mentioned, there's plenty of thinking and hard work ahead.

Thanks again and I look forward to following the development of sf.

tiernanmartin commented Nov 2, 2016

Thank you for implementing this in such short order @edzer!

When I opened this issue I had no expectation of seeing progress in the near term and had already started reverting my current project back to sp conventions. But now, thanks to your responsiveness and the feedback from others on this thread, I'm thrilled to start testing the sf x dplyr interface.

Going forward I'll be happy to share my feedback, tests, and any unusual situations I run into. As @mdsumner mentioned, there's plenty of thinking and hard work ahead.

Thanks again and I look forward to following the development of sf.

@tiernanmartin

This comment has been minimized.

Show comment
Hide comment
@tiernanmartin

tiernanmartin Nov 2, 2016

A few observations about summarise()'s interaction with sf objects:

  • the CRS gets dropped
  • the aggregated groups aren't what I expected
  • some geometries may be dropped

You can find tests demonstrating these observations here: https://tiernanmartin.github.io/YCC-Baseline-Conditions/3-communication/other/sp-dplyr-test.nb.html

tiernanmartin commented Nov 2, 2016

A few observations about summarise()'s interaction with sf objects:

  • the CRS gets dropped
  • the aggregated groups aren't what I expected
  • some geometries may be dropped

You can find tests demonstrating these observations here: https://tiernanmartin.github.io/YCC-Baseline-Conditions/3-communication/other/sp-dplyr-test.nb.html

edzer added a commit that referenced this issue Nov 3, 2016

address #42 (comment)
Signed-off-by: Edzer Pebesma <edzer.pebesma@uni-muenster.de>
@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Nov 3, 2016

Member

Great observations, beautiful tests! Now, they look even much better:

  • CRS now get passed through by summarise (for the other commands it did work)
  • aggregate groups and dropped geometries have now been fixed.

The issue you discoverd was that the indices attribute in objects returned by group_by is 0-based, where I assumed it was 1-based. @hadley if you are ever going to change this, remember telling sf about it!

Member

edzer commented Nov 3, 2016

Great observations, beautiful tests! Now, they look even much better:

  • CRS now get passed through by summarise (for the other commands it did work)
  • aggregate groups and dropped geometries have now been fixed.

The issue you discoverd was that the indices attribute in objects returned by group_by is 0-based, where I assumed it was 1-based. @hadley if you are ever going to change this, remember telling sf about it!

@kendonB

This comment has been minimized.

Show comment
Hide comment
@kendonB

kendonB Nov 3, 2016

I'm not sure if this deserves it's own issue, but a full set of *_join operations would be a good addition here.

I haven't dug into sfr yet, so I don't know if you're allowing sf objects to contain multiple geometries (i.e. two MULTIPOLYGONs or a MULTIPOLYGON and a MULTIPOINT). If you can, (and if you can't, I think you should), then *join operations would result in sf objects with multiple geometries.

A proposed combo:

mp1 # sf with MULTIPOLYGON
mp2 # sf with MULTIPOLYGON
mp3 <- left_join(mp1, mp2, border = FALSE)

would result in an sf with two columns geometry1 and geometry2 and would behave just like left_join in dplyr except a match is defined as two geometries with positive spatial overlap; border = TRUE would include matches with common borders. left_join(x, y) includes all observations in x, regardless of whether they match or not. Multiple matches return as multiple rows.

There are maybe more match conditions that you would want to include. Perhaps allowing any function that takes two sfg objects and returns a logical would make the most sense. Does st_intersects, for example, allow sfg objects as inputs? If so, then this could be fully general for any pair of sf types. Usage would be something like:

mp3 <- left_join(mp1, mp2, match_fun = st_touches)

It probably also makes sense to do this so you don't have to reinvent the join wheel. I don't believe dplyr allows for arbitrary conditions for join matches yet (tidyverse/dplyr#557), but there might be a solution that isn't too difficult to implement.

The user could also generate new geometry columns using any function that takes two sfgs and returns an sfg. i.e.:

mp3 <- mp3 %>%
  mutate(geometry3 = purrr::map2(geometry1, geometry2, st_intersection))

kendonB commented Nov 3, 2016

I'm not sure if this deserves it's own issue, but a full set of *_join operations would be a good addition here.

I haven't dug into sfr yet, so I don't know if you're allowing sf objects to contain multiple geometries (i.e. two MULTIPOLYGONs or a MULTIPOLYGON and a MULTIPOINT). If you can, (and if you can't, I think you should), then *join operations would result in sf objects with multiple geometries.

A proposed combo:

mp1 # sf with MULTIPOLYGON
mp2 # sf with MULTIPOLYGON
mp3 <- left_join(mp1, mp2, border = FALSE)

would result in an sf with two columns geometry1 and geometry2 and would behave just like left_join in dplyr except a match is defined as two geometries with positive spatial overlap; border = TRUE would include matches with common borders. left_join(x, y) includes all observations in x, regardless of whether they match or not. Multiple matches return as multiple rows.

There are maybe more match conditions that you would want to include. Perhaps allowing any function that takes two sfg objects and returns a logical would make the most sense. Does st_intersects, for example, allow sfg objects as inputs? If so, then this could be fully general for any pair of sf types. Usage would be something like:

mp3 <- left_join(mp1, mp2, match_fun = st_touches)

It probably also makes sense to do this so you don't have to reinvent the join wheel. I don't believe dplyr allows for arbitrary conditions for join matches yet (tidyverse/dplyr#557), but there might be a solution that isn't too difficult to implement.

The user could also generate new geometry columns using any function that takes two sfgs and returns an sfg. i.e.:

mp3 <- mp3 %>%
  mutate(geometry3 = purrr::map2(geometry1, geometry2, st_intersection))
@tiernanmartin

This comment has been minimized.

Show comment
Hide comment
@tiernanmartin

tiernanmartin Nov 4, 2016

@edzer Is it better to keep adding dplyr-related requests to this issue or to open fresh issues for each request?

tiernanmartin commented Nov 4, 2016

@edzer Is it better to keep adding dplyr-related requests to this issue or to open fresh issues for each request?

@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Nov 4, 2016

Member

Yes, @kendonB could you pls make this into a separate issue, for instance called "Dplyr-style join operations with spatial predicates" ?

Member

edzer commented Nov 4, 2016

Yes, @kendonB could you pls make this into a separate issue, for instance called "Dplyr-style join operations with spatial predicates" ?

@mdsumner

This comment has been minimized.

Show comment
Hide comment
@mdsumner

mdsumner Nov 12, 2016

Member

@edzer this is really nice, there's some code reduction if you put the cut and the group by actions in the pipeline together:

nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6, 
                          dens = BIR74/area, 
                          area_cl = cut(AREA, c(0, .1, .12, .15, .25)))
out <- nc.ea %>% group_by(area_cl) %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)
## compare
out %>% summarise(sum(A * new_dens))
nc.ea %>% summarise(sum(area * dens))

It reminds me that applying the st_ functions inside verbs is a motivator, and seems fine:

library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
nc.ea <- nc %>% mutate(area_nonsense = st_area(geom), 
              geom = st_transform(geom, 7314), 
              area = st_area(geom) / 1e6, 
              dens = BIR74 / area,
              AREA_cl = cut(AREA, c(0, .1, .12, .15, .25))) 

out <- nc.ea %>%   group_by(AREA_cl) %>% 
  summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)

out %>% summarise(sum(A * new_dens))
##sum(A * new_dens)
## 1            329962

But also I see that 7314 is Transverse Mercator, not Lambert EA - it's not so important re the example, but worth correcting in case this gets used elsewhere.

Member

mdsumner commented Nov 12, 2016

@edzer this is really nice, there's some code reduction if you put the cut and the group by actions in the pipeline together:

nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6, 
                          dens = BIR74/area, 
                          area_cl = cut(AREA, c(0, .1, .12, .15, .25)))
out <- nc.ea %>% group_by(area_cl) %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)
## compare
out %>% summarise(sum(A * new_dens))
nc.ea %>% summarise(sum(area * dens))

It reminds me that applying the st_ functions inside verbs is a motivator, and seems fine:

library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
nc.ea <- nc %>% mutate(area_nonsense = st_area(geom), 
              geom = st_transform(geom, 7314), 
              area = st_area(geom) / 1e6, 
              dens = BIR74 / area,
              AREA_cl = cut(AREA, c(0, .1, .12, .15, .25))) 

out <- nc.ea %>%   group_by(AREA_cl) %>% 
  summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)

out %>% summarise(sum(A * new_dens))
##sum(A * new_dens)
## 1            329962

But also I see that 7314 is Transverse Mercator, not Lambert EA - it's not so important re the example, but worth correcting in case this gets used elsewhere.

@mdsumner

This comment has been minimized.

Show comment
Hide comment
@mdsumner

mdsumner Nov 12, 2016

Member

And a check for the grouping and area calculations sp/rgeos style:

library(rgdal)
spnc <- spTransform(as(nc, "Spatial"), "+proj=tmerc +lat_0=40.35 +lon_0=-86.15000000000001 +k=1.000031 +x_0=240000 +y_0=36000 +ellps=GRS80 +units=us-ft +no_defs")
sp.out <- rgeos::gUnionCascaded(spnc, as.character(cut(spnc$AREA, c(0, .1, .12, .15, .25))))
rgeos::gArea(sp.out, byid = TRUE)/1e6
#    (0,0.1]  (0.1,0.12] (0.12,0.15] (0.15,0.25] 
#  290620.6    182825.3    322369.5    585439.0 
out$A
##[1] 290620.6 182825.3 322369.5 585439.0
Member

mdsumner commented Nov 12, 2016

And a check for the grouping and area calculations sp/rgeos style:

library(rgdal)
spnc <- spTransform(as(nc, "Spatial"), "+proj=tmerc +lat_0=40.35 +lon_0=-86.15000000000001 +k=1.000031 +x_0=240000 +y_0=36000 +ellps=GRS80 +units=us-ft +no_defs")
sp.out <- rgeos::gUnionCascaded(spnc, as.character(cut(spnc$AREA, c(0, .1, .12, .15, .25))))
rgeos::gArea(sp.out, byid = TRUE)/1e6
#    (0,0.1]  (0.1,0.12] (0.12,0.15] (0.15,0.25] 
#  290620.6    182825.3    322369.5    585439.0 
out$A
##[1] 290620.6 182825.3 322369.5 585439.0
@etiennebr

This comment has been minimized.

Show comment
Hide comment
@etiennebr

etiennebr Dec 22, 2016

Contributor

To follow up in relation to #121, I like the way tibble handles simple features, because they don't drop a class on select() (data.frames would do the same, though tibbles are very well behaved). Because they have no expectations of a spatial column when selecting they can just drop the geometry, which makes it coherent and explicit. It's also easy to add another spatial column.

I ran an experiment (it fails after summarise, I don't know exactly why, maybe related to #49 ?).

Contributor

etiennebr commented Dec 22, 2016

To follow up in relation to #121, I like the way tibble handles simple features, because they don't drop a class on select() (data.frames would do the same, though tibbles are very well behaved). Because they have no expectations of a spatial column when selecting they can just drop the geometry, which makes it coherent and explicit. It's also easy to add another spatial column.

I ran an experiment (it fails after summarise, I don't know exactly why, maybe related to #49 ?).

@mstrimas

This comment has been minimized.

Show comment
Hide comment
@mstrimas

mstrimas Mar 7, 2017

I'm noticing some potentially odd behaviour with spread(). When the geometries of features with the same key are different spread() results in a single row with only the geometry of the first instance of that key. Here's an example:

nc <- st_read(system.file("shape/nc.shp", package="sf"))
# make a long format data set
nc_gathered <- nc %>% 
  select(county = NAME, BIR74, BIR79, -geometry) %>% 
  slice(1:3) %>% 
  gather(year, births, BIR74, BIR79)
# works as expected
nc_gathered %>% 
  spread(year, births)
# change second Alleghany geometry
nc_gathered$geometry[[4]] <- nc_gathered$geometry[[2]]
nc_gathered %>% 
  spread(year, births)
# now there's only one Alleghany entry
# in contrast, if this were a normal data frame there would be 
# two Alleghany entries, one for each of the different geometries

mstrimas commented Mar 7, 2017

I'm noticing some potentially odd behaviour with spread(). When the geometries of features with the same key are different spread() results in a single row with only the geometry of the first instance of that key. Here's an example:

nc <- st_read(system.file("shape/nc.shp", package="sf"))
# make a long format data set
nc_gathered <- nc %>% 
  select(county = NAME, BIR74, BIR79, -geometry) %>% 
  slice(1:3) %>% 
  gather(year, births, BIR74, BIR79)
# works as expected
nc_gathered %>% 
  spread(year, births)
# change second Alleghany geometry
nc_gathered$geometry[[4]] <- nc_gathered$geometry[[2]]
nc_gathered %>% 
  spread(year, births)
# now there's only one Alleghany entry
# in contrast, if this were a normal data frame there would be 
# two Alleghany entries, one for each of the different geometries
@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Mar 7, 2017

Member

I agree that while duplicating, and then de-duplicating geometries, there is the implicit assumption that no one messes up things in between. Why would you do this? How would you want this to work differently?

Member

edzer commented Mar 7, 2017

I agree that while duplicating, and then de-duplicating geometries, there is the implicit assumption that no one messes up things in between. Why would you do this? How would you want this to work differently?

@mstrimas

This comment has been minimized.

Show comment
Hide comment
@mstrimas

mstrimas Mar 8, 2017

I don't know what the best approach is, or if this even matters, largely because I'm unclear of how spread() might be used for spatial data in the wild. Just seemed to me that the behaviour of spread_.sf() is inconsistent with spread() for data frames. Without having done much testing, wouldn't this method work:

spread_.sf <- function (data, key_col, value_col, fill = NA, convert = FALSE, 
    drop = TRUE, sep = NULL) 
{
  st_as_sf(NextMethod())
}

mstrimas commented Mar 8, 2017

I don't know what the best approach is, or if this even matters, largely because I'm unclear of how spread() might be used for spatial data in the wild. Just seemed to me that the behaviour of spread_.sf() is inconsistent with spread() for data frames. Without having done much testing, wouldn't this method work:

spread_.sf <- function (data, key_col, value_col, fill = NA, convert = FALSE, 
    drop = TRUE, sep = NULL) 
{
  st_as_sf(NextMethod())
}
@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Mar 8, 2017

Member

Gives me

> nc_gathered %>% 
+   spread(year, births)
Error in .subset2(x, i, exact = exact) : 
  attempt to select less than one element in get1index

on your example. Did you do any testing?

Member

edzer commented Mar 8, 2017

Gives me

> nc_gathered %>% 
+   spread(year, births)
Error in .subset2(x, i, exact = exact) : 
  attempt to select less than one element in get1index

on your example. Did you do any testing?

@mstrimas

This comment has been minimized.

Show comment
Hide comment
@mstrimas

mstrimas Mar 9, 2017

Oops, sorry, forgot a crucial line! This is actually tested and works for me.

spread_.sf <- function(data, key_col, value_col, fill = NA, 
                       convert = FALSE, drop = TRUE, sep = NULL) {
  data <- as.data.frame(data)
  st_as_sf(NextMethod())
}

Anyhow, not sure if this approach is necessarily better, just thought I'd bring it up... Thanks!

mstrimas commented Mar 9, 2017

Oops, sorry, forgot a crucial line! This is actually tested and works for me.

spread_.sf <- function(data, key_col, value_col, fill = NA, 
                       convert = FALSE, drop = TRUE, sep = NULL) {
  data <- as.data.frame(data)
  st_as_sf(NextMethod())
}

Anyhow, not sure if this approach is necessarily better, just thought I'd bring it up... Thanks!

edzer added a commit that referenced this issue Mar 9, 2017

@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Mar 9, 2017

Member

Thanks; I replace the original spread_.sf with yours.

Member

edzer commented Mar 9, 2017

Thanks; I replace the original spread_.sf with yours.

@mstrimas

This comment has been minimized.

Show comment
Hide comment
@mstrimas

mstrimas Mar 10, 2017

Above @mdsumner gives an example of a grouped mutate. I get an error when running this example:

library(sf)
library(tidyverse)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA))
#> Warning in is.na(st_agr(x)): is.na() applied to non-(list or vector) of
#> type 'NULL'
#> Error in .subset2(x, i, exact = exact): attempt to select less than one element in get1index

Appears to be because mutate() drops the st_column attribute, but keeps the sf class def, when run on grouped data frames (though not on normal data frames for some reason). Then st_as_sf.sf() is called instead of st_as_sf.data.frame() which doesn't add the st_column back. I have no idea if this is the best approach, but the only way I was able to fix this is by removing the sf class with:

mutate_.sf <- function(.data, ..., .dots) {
  class(.data) <- setdiff(class(.data), "sf")
  st_as_sf(NextMethod())
}

Grouped filter() does not suffer from this issue.

mstrimas commented Mar 10, 2017

Above @mdsumner gives an example of a grouped mutate. I get an error when running this example:

library(sf)
library(tidyverse)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA))
#> Warning in is.na(st_agr(x)): is.na() applied to non-(list or vector) of
#> type 'NULL'
#> Error in .subset2(x, i, exact = exact): attempt to select less than one element in get1index

Appears to be because mutate() drops the st_column attribute, but keeps the sf class def, when run on grouped data frames (though not on normal data frames for some reason). Then st_as_sf.sf() is called instead of st_as_sf.data.frame() which doesn't add the st_column back. I have no idea if this is the best approach, but the only way I was able to fix this is by removing the sf class with:

mutate_.sf <- function(.data, ..., .dots) {
  class(.data) <- setdiff(class(.data), "sf")
  st_as_sf(NextMethod())
}

Grouped filter() does not suffer from this issue.

edzer added a commit that referenced this issue Mar 10, 2017

@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Mar 10, 2017

Member

Clumsy as it looks, I agree that might be the best thing to do. Thanks!

Member

edzer commented Mar 10, 2017

Clumsy as it looks, I agree that might be the best thing to do. Thanks!

@eivindhammers

This comment has been minimized.

Show comment
Hide comment
@eivindhammers

eivindhammers Mar 24, 2017

Any reason why class(.data) <- setdiff(class(.data), "sf") was only added to group_by() in f51b9a3?

I'm still getting Error in .subset2(x, i, exact = exact) : attempt to select less than one element in get1index.

eivindhammers commented Mar 24, 2017

Any reason why class(.data) <- setdiff(class(.data), "sf") was only added to group_by() in f51b9a3?

I'm still getting Error in .subset2(x, i, exact = exact) : attempt to select less than one element in get1index.

@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Mar 24, 2017

Member

We might have to do it everywhere. Does it solve your problem? With which function is that?

Member

edzer commented Mar 24, 2017

We might have to do it everywhere. Does it solve your problem? With which function is that?

@eivindhammers

This comment has been minimized.

Show comment
Hide comment
@eivindhammers

eivindhammers Mar 24, 2017

It does solve it. mutate()

eivindhammers commented Mar 24, 2017

It does solve it. mutate()

@yutannihilation

This comment has been minimized.

Show comment
Hide comment
@yutannihilation

yutannihilation Apr 14, 2017

Contributor

I know sf is on the mid-way toward achieving compatibility with dplyr, but I'm a bit afraid the compatibility will be degraded with the next release of dplyr. For example, sf object loses its class (sf) with dplyr 0.6.0rc. In case you've not noticed yet, I report the issue here:

0.5.0(current):

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "sf"         "data.frame"

packageVersion("dplyr")
#> [1] '0.5.0'

RC for 0.6.0:

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "data.frame"

packageVersion("dplyr")
#> [1] '0.5.0.9002'

One more thing I want to ask is, while dplyr has a plan to deprecate SE functions (e.g. select_, filter_, ...), do you plan to introduce tidyeval?

Contributor

yutannihilation commented Apr 14, 2017

I know sf is on the mid-way toward achieving compatibility with dplyr, but I'm a bit afraid the compatibility will be degraded with the next release of dplyr. For example, sf object loses its class (sf) with dplyr 0.6.0rc. In case you've not noticed yet, I report the issue here:

0.5.0(current):

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "sf"         "data.frame"

packageVersion("dplyr")
#> [1] '0.5.0'

RC for 0.6.0:

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "data.frame"

packageVersion("dplyr")
#> [1] '0.5.0.9002'

One more thing I want to ask is, while dplyr has a plan to deprecate SE functions (e.g. select_, filter_, ...), do you plan to introduce tidyeval?

@yutannihilation

This comment has been minimized.

Show comment
Hide comment
@yutannihilation

yutannihilation Apr 14, 2017

Contributor

Ah, my comment above may be duplicated with #304.

Contributor

yutannihilation commented Apr 14, 2017

Ah, my comment above may be duplicated with #304.

@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Apr 14, 2017

Member

There is no next release of dplyr yet, and there are no signs that @hadley has done any reverse dependency checks so far. The rstudio blog mentions that the verb_ methods will be deprecated but remain around for backward compatibility. This is, as you found out, not yet the case, so I consider your worries a little premature. sf is not mid-way something, dplyr is.

Member

edzer commented Apr 14, 2017

There is no next release of dplyr yet, and there are no signs that @hadley has done any reverse dependency checks so far. The rstudio blog mentions that the verb_ methods will be deprecated but remain around for backward compatibility. This is, as you found out, not yet the case, so I consider your worries a little premature. sf is not mid-way something, dplyr is.

@yutannihilation

This comment has been minimized.

Show comment
Hide comment
@yutannihilation

yutannihilation Apr 14, 2017

Contributor

Oh, I see. thanks for your quick response:) I will consider filing this issue to dplyr's repo.

Contributor

yutannihilation commented Apr 14, 2017

Oh, I see. thanks for your quick response:) I will consider filing this issue to dplyr's repo.

@hadley

This comment has been minimized.

Show comment
Hide comment
@hadley

hadley Apr 14, 2017

Contributor

We have been careful not to make any backward incompatible changes - we should have a system of default methods that ensures existing backends still work. If that isn't true, I'd really appreciate a minimal reprex filed in dplyr that illustrates the problem.

@edzer revdep emails will go out (hopefully) later today.

Contributor

hadley commented Apr 14, 2017

We have been careful not to make any backward incompatible changes - we should have a system of default methods that ensures existing backends still work. If that isn't true, I'd really appreciate a minimal reprex filed in dplyr that illustrates the problem.

@edzer revdep emails will go out (hopefully) later today.

@edzer

This comment has been minimized.

Show comment
Hide comment
@edzer

edzer Apr 14, 2017

Member

Scripts that use dplyr verbs such as mutate (see 5 comments above for a reprex) no longer work with packages that dispatch on verb_ (mutate_), which is the only way dplyr 0.5 provides. To be compatible, users now need to change mutate into mutate_ in their scripts, did you have in mind to recommend them doing that?

I don't mind modifying sf, but consider it a backward incompatible change.

Member

edzer commented Apr 14, 2017

Scripts that use dplyr verbs such as mutate (see 5 comments above for a reprex) no longer work with packages that dispatch on verb_ (mutate_), which is the only way dplyr 0.5 provides. To be compatible, users now need to change mutate into mutate_ in their scripts, did you have in mind to recommend them doing that?

I don't mind modifying sf, but consider it a backward incompatible change.

@hadley

This comment has been minimized.

Show comment
Hide comment
@hadley

hadley Apr 14, 2017

Contributor

That is not the intent - can you please file a bug on dplyr?

Contributor

hadley commented Apr 14, 2017

That is not the intent - can you please file a bug on dplyr?

@edzer

This comment has been minimized.

Show comment
Hide comment
Member

edzer commented Apr 14, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment