Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] dplyr::bind_rows() for sf #49

Closed
tiernanmartin opened this issue Nov 4, 2016 · 15 comments

Comments

Projects
None yet
8 participants
@tiernanmartin
Copy link

commented Nov 4, 2016

Building on the list of dplyr verbs requested in edzer/sfr#42, could you please consider adding dplyr::bind_rows()?

This function makes it easy to combine dataframe-like objects, which would also be useful for sf objects.

Unlike rbind() or spRbind(), the bind_rows() function allows the merger of objects with non-matching columns, filling any unshared columns with NA. I find this convenience feature saves me a lot of time, even if it does lead to the creation of the occasional ugly dataframe.

There are probably some details that would need to be worked out with the geom list-columns, especially in cases where sf objects with differing geometry classes are present. Perhaps you could coerce the geom column to geometry type: GEOMETRY?

@edzer

This comment has been minimized.

Copy link
Member

commented Nov 4, 2016

I don't see how this can be done, since neither bind_rows nor bind_rows_ (which does the work), are generics:

> methods(bind_rows_)
Error in methods(bind_rows_) : object 'bind_rows_' not found

@hadley maybe I'm overlooking something?

@kendonB

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2016

bind_cols also likely has a use case. Say I'm doing some distributed operation by column, and I want to bring the pieces back together, for example.

@hadley

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2016

The problem is that I don't know how to make an efficient generic. I could possibly provide a generic for restoring attributes afterwards. I'm not sure what the best approach is.

@tiernanmartin

This comment has been minimized.

Copy link
Author

commented Nov 5, 2016

I put together a test exploring the way sf objects interact with both rbind() and bind_rows(). The test shows two fairly common situations when working with vector data:

  1. Needing to combine datasets with different geometry types
  2. Needing to combine datasets with non-matching columns

It sounds like the larger question of efficient generics needs to be resolved before the bind_*
verbs can be adapted for sf. In the meantime, perhaps a function could be added to make it easier to combine two sfc's with different geometry types into a single sfc with a GEOMETRY type.

edzer added a commit that referenced this issue Nov 6, 2016

add [c|r]bind methods for sf objects, address #49 (comment)
Signed-off-by: Edzer Pebesma <edzer.pebesma@uni-muenster.de>
@edzer

This comment has been minimized.

Copy link
Member

commented Nov 6, 2016

Thanks for your test, @tiernanmartin ! cbind and rbind now work the way they do in base, except that a cbind on two sf objects generates a warning that multiple geometries are not allowed and that it is dropping all but the first geometry list-column.

Interestingly, bind_cols(sf_pol2,sf_mpol2) works, although it retains the secondary geometry, which is what st_sf would not do. It will be dropped with a warning when we do st_sf(bind_cols(sf_pol2,sf_mpol2)).

Also, unlike base::cbind and dplyr::bind_cols, sf::cbind renames duplicate variables. Maybe I missed it, but I don't see how duplication of variable names fits in a tidyverse.

> bind_cols(data.frame(a=1:2), data.frame(a=4:5))    
  a a
1 1 4
2 2 5

As of bind_rows: I don't see anything I can do in sf for this. @hadley : is it on purpose that bind_cols retains all attributes (of object, as well as of its columns) but bind_rows does not?

@hadley

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2016

The duplicate name issue is definitely a bug. I'm not sure the semantics on bind_rows are well defined, but it should probably preserve the attributes, at least of the first df. Maybe we can preserve the performance of bind_rows but make it generic by creating a method for preserving attributes.

@edzer

This comment has been minimized.

Copy link
Member

commented Nov 6, 2016

As we see in @tiernanmartin 's test above and rbind.sf, the geometry needs to be postprocessed anyway in case two different geometry types are rbind-ed, so for bind_rows we also need a mechanism where sf can provide a method instance for this, and take care of geometry type mixing.

@hadley

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2016

Maybe bind_rows() should fall back to a bind_rows() that works with a pair of data frames. But then you lose a lot of the efficiency - but at least that's better than not working

@edzer

This comment has been minimized.

Copy link
Member

commented Apr 23, 2017

With dplyr 0.5.0.9004, this still doesn't work:

library(sf)
a  = st_sf(a=1, geom=st_sfc(st_point(0:1)))
library(dplyr)
b = bind_rows(a, a)
# Warning messages:
# 1: In bind_rows_(x, .id) :
#   Vectorizing 'sfc_POINT' elements may not preserve their attributes
# 2: In bind_rows_(x, .id) :
#   Vectorizing 'sfc_POINT' elements may not preserve their attributes
b
# Error in .subset2(x, i, exact = exact) : 
#   attempt to select less than one element in get1index
attributes(b)
# $names
# [1] "a"    "geom"
# 
# $row.names
# [1] 1 2
# 
# $class
# [1] "sf"         "data.frame"
### -> sf_column and agr are missing

I don't see anything we can do about this on the sf side, and propose to close this issue here.

@hadley

This comment has been minimized.

Copy link
Contributor

commented Apr 23, 2017

Yeah, it needs some pretty deep changes on our side.

@jsta

This comment has been minimized.

Copy link
Contributor

commented Oct 30, 2018

My workaround for this issue is to temporarily remove geometries, bind, and rejoin.

library(sf)
library(dplyr)

a                     <- st_sf(a=1, geom=st_sfc(st_point(0:1)))
a_nogeom              <- a
st_geometry(a_nogeom) <- NULL

b <- bind_rows(a_nogeom, a_nogeom)
b <- dplyr::left_join(b, a, by = "a") 
b
#  a        geom
# 1 1 POINT (0 1)
# 2 1 POINT (0 1)
@Robinlovelace

This comment has been minimized.

Copy link
Contributor

commented Oct 30, 2018

what's wrong with rbind()? out of interest - see here for context: https://geocompr.github.io/geocompkg/articles/tidyverse-pitfalls.html

@jsta

This comment has been minimized.

Copy link
Contributor

commented Oct 30, 2018

Binding more than 2 objects. Looks like we both arrived at roughly the same strategy: https://geocompr.github.io/geocompkg/articles/tidyverse-pitfalls.html#pitfall-binding-rows

@adrfantini

This comment has been minimized.

Copy link

commented Oct 30, 2018

What's wrong with do.call('rbind')?

@ramarty

This comment has been minimized.

Copy link

commented May 21, 2019

do.call('rbind') works great when all the columns are the same. When they're not, I use a solution similar to @jsta's:

library(sf)
library(dplyr)

bind_rows_sf <- function(...){
  sf_list <- rlang::dots_values(...)[[1]]
  
  sfg_list_column <- lapply(sf_list, function(sf) sf$geometry[[1]]) %>% st_sfc
  df <- lapply(sf_list, function(sf) st_set_geometry(sf, NULL)) %>% bind_rows
  
  sf_appended <- st_sf(data.frame(df, geom=sfg_list_column))

  return(sf_appended)
}

sf_1 <- st_sf(data.frame(a=1, geom=st_sfc(st_point(0:1))))
sf_2 <- st_sf(data.frame(a=2, b=4, geom=st_sfc(st_point(1:2))))
sf_3 <- st_sf(data.frame(a=3, b=5, c=6, geom=st_sfc(st_point())))
sf_list <- list(sf_1, sf_2, sf_3)

sf_123 <- sf_list %>% bind_rows_sf
sf_123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.