Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr::group_map() usage with sf data #969

Closed
mattbk opened this issue Feb 1, 2019 · 8 comments
Closed

dplyr::group_map() usage with sf data #969

mattbk opened this issue Feb 1, 2019 · 8 comments
Labels
feature a feature request or enhancement

Comments

@mattbk
Copy link

mattbk commented Feb 1, 2019

First posted at tidyverse/dplyr#4143, they suggested I ask over here.

Using dplyr 0.8.0, using group_map() with sf objects is either failing or I'm using it wrong.

Example below is using st_centroid() as a standin for a custom function I want to use that will keep all rows and create a new column of values, each value calculated for each row but only using the grouped rows.

Thanks for any thoughts.

library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
nc <- st_read(system.file("shape/nc.shp", package="sf"))
#> Reading layer `nc' from data source `C:\Users\matt\Documents\R\win-library\sf\shape\nc.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
# Add grouping column
nc$gp <- sample(1:10, replace=T)
# Example of centroid of each polygon; works
cent <- st_centroid(nc)
#> Warning in st_centroid.sf(nc): st_centroid assumes attributes are constant
#> over geometries of x
#> Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
#> of_largest_polygon): st_centroid does not give correct centroids for
#> longitude/latitude data
# Example of summary; works
nc_gp_area <- nc %>%
    group_by(gp) %>%
    summarize(area_mean = mean(AREA))
# Get centroid of each group of polygons; does not work
nc_gp_cent <- nc %>%
                group_by(gp) %>%
                group_map(st_centroid)
#> Error in UseMethod("st_centroid") : 
#  no applicable method for 'st_centroid' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')" 

# This method is what dplyr::group_map() is supposed to replace; works
# (https://github.com/tidyverse/dplyr/issues/4066#issue-395061423)
nc_gp_cent <- nc %>%
                group_by(gp) %>%
                nest() %>%
                mutate(out = purrr::map(data, ~st_centroid(.x))) %>%
                unnest(out) %>%
                st_as_sf()

Created on 2019-01-31 by the reprex package (v0.2.1)

@karldw
Copy link
Contributor

karldw commented Feb 1, 2019

For more context, group_map is one of several new generics. dplyr 0.8.0 is scheduled to be released today (Feb 1).

@edzer edzer added the feature a feature request or enhancement label Feb 2, 2019
@obrl-soil
Copy link

tbh I'm going to need to see some demo code before group_map() and summarise() are properly distinct in my mind. As far as I can tell you can already derive the centroids you want from nc_gp_cent, as summarise() sf method unions the group geometries:

library(sf)
library(dplyr)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc$gp <- sample(1:10, replace=T)
# Example of centroid of each polygon; works
cent <- st_centroid(nc)
nc_gp_area <- nc %>%
  group_by(gp) %>%
  summarize(area_mean = mean(AREA))

grp <- sample(seq(10), 1)
plot(nc_gp_area[grp, 0], axes = T, reset = F)
plot(st_centroid(nc_gp_area[grp, 0]), add = T, pch = 19, col = 'red')

image

yes/no?

@mattbk
Copy link
Author

mattbk commented Feb 13, 2019

Centroid was an example of an existing function. What I'm actually using is a custom function to find the nearest neighbor within each group and add the distance as a new column. Because this needs to create a unique value for each row in the group, I can't use summarize.

In the past I might have split groups into a list, but this method seems better.

edzer added a commit that referenced this issue Feb 18, 2019
@edzer
Copy link
Member

edzer commented Feb 19, 2019

Would be great if someone could report this works as expected!

group_nest seems to be a whole other problem, as it is currently implemented.

@EhrmannS
Copy link

EhrmannS commented Feb 26, 2019

Not sure if related, but I it seems I can't load sf when I first load tidyverse.

> library(tidyverse)
── Attaching packages ────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.3.0
✔ tibble  2.0.1     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.3.0
── Conflicts ───────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
> library(sf)
Error: package or namespace load failed for ‘sf’:
 .onLoad failed in loadNamespace() for 'sf', details:
  call: get(genname, envir = envir)
  error: object 'group_map' not found
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.3.0   stringr_1.4.0   dplyr_0.7.8     purrr_0.3.0     readr_1.3.1     tidyr_0.8.2     tibble_2.0.1   
[8] ggplot2_3.1.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       cellranger_1.1.0 pillar_1.3.1     compiler_3.5.2   plyr_1.8.4       bindr_0.1.1     
 [7] class_7.3-15     tools_3.5.2      jsonlite_1.6     lubridate_1.7.4  nlme_3.1-137     gtable_0.2.0    
[13] lattice_0.20-38  pkgconfig_2.0.2  rlang_0.3.1      DBI_1.0.0        cli_1.0.1        rstudioapi_0.9.0
[19] yaml_2.2.0       haven_2.0.0      bindrcpp_0.2.2   e1071_1.7-0.1    withr_2.1.2      xml2_1.2.0      
[25] httr_1.4.0       generics_0.0.2   hms_0.4.2        classInt_0.3-1   grid_3.5.2       tidyselect_0.2.5
[31] glue_1.3.0       R6_2.3.0         readxl_1.2.0     modelr_0.1.3     magrittr_1.5     units_0.6-2     
[37] backports_1.1.3  scales_1.0.0     rvest_0.3.2      assertthat_0.2.0 colorspace_1.4-0 stringi_1.3.1   
[43] lazyeval_0.2.1   munsell_0.5.0    broom_0.5.1      crayon_1.3.4 

When I load them the other way round, tidyverse reports an error on the same call for group_map and group_split, but loads nevertheless.

@edzer
Copy link
Member

edzer commented Feb 26, 2019

You'll have to update dplyr to >= 0.8-0; since sf only Suggests: dplyr, it can't enforce this by installing or loading.

@EhrmannS
Copy link

Thanks for the fast response! Having updated dplyr, running the above code seems to work, with a bunch of warnings.

> nc_gp_cent <- nc %>%
+     group_by(gp) %>%
+     group_map(st_centroid)
There were 21 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In st_centroid.sf(.x, .y, ...) :
  st_centroid assumes attributes are constant over geometries of x
2: In st_centroid.sfc(st_geometry(x), of_largest_polygon = of_largest_polygon) :
  st_centroid does not give correct centroids for longitude/latitude data
3: In st_centroid.sf(.x, .y, ...) :
  st_centroid assumes attributes are constant over geometries of x
4: In st_centroid.sfc(st_geometry(x), of_largest_polygon = of_largest_polygon) :
  st_centroid does not give correct centroids for longitude/latitude data
5: In st_centroid.sf(.x, .y, ...) :
  st_centroid assumes attributes are constant over geometries of x
6: In st_centroid.sfc(st_geometry(x), of_largest_polygon = of_largest_polygon) :
  st_centroid does not give correct centroids for longitude/latitude data
7: In st_centroid.sf(.x, .y, ...) :
  st_centroid assumes attributes are constant over geometries of x
8: In st_centroid.sfc(st_geometry(x), of_largest_polygon = of_largest_polygon) :
  st_centroid does not give correct centroids for longitude/latitude data
9: In st_centroid.sf(.x, .y, ...) :
  st_centroid assumes attributes are constant over geometries of x
10: In st_centroid.sfc(st_geometry(x), of_largest_polygon = of_largest_polygon) :
  st_centroid does not give correct centroids for longitude/latitude data
11: In st_centroid.sf(.x, .y, ...) :
  st_centroid assumes attributes are constant over geometries of x
12: In st_centroid.sfc(st_geometry(x), of_largest_polygon = of_largest_polygon) :
  st_centroid does not give correct centroids for longitude/latitude data
13: In st_centroid.sf(.x, .y, ...) :
  st_centroid assumes attributes are constant over geometries of x
14: In st_centroid.sfc(st_geometry(x), of_largest_polygon = of_largest_polygon) :
  st_centroid does not give correct centroids for longitude/latitude data
15: In bind_rows_(x, .id) :
  Vectorizing 'sfc_POINT' elements may not preserve their attributes
16: In bind_rows_(x, .id) :
  Vectorizing 'sfc_POINT' elements may not preserve their attributes
17: In bind_rows_(x, .id) :
  Vectorizing 'sfc_POINT' elements may not preserve their attributes
18: In bind_rows_(x, .id) :
  Vectorizing 'sfc_POINT' elements may not preserve their attributes
19: In bind_rows_(x, .id) :
  Vectorizing 'sfc_POINT' elements may not preserve their attributes
20: In bind_rows_(x, .id) :
  Vectorizing 'sfc_POINT' elements may not preserve their attributes
21: In bind_rows_(x, .id) :
  Vectorizing 'sfc_POINT' elements may not preserve their attributes
>

@mattbk
Copy link
Author

mattbk commented Mar 7, 2019

I can confirm that the current master works, with the same results as @EhrmannS. Thanks!

The "may not preserve their attributes" warning means that the CRS information has been lost. I am working around this by copying the CRS from another variable, e.g., st_crs(nc_gp_cent) <- st_crs(nc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants