Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bind_rows_sf for combining lists of sf objects to single dataframe #798

Closed
sheffe opened this issue Jul 15, 2018 · 23 comments
Closed

bind_rows_sf for combining lists of sf objects to single dataframe #798

sheffe opened this issue Jul 15, 2018 · 23 comments

Comments

@sheffe
Copy link

sheffe commented Jul 15, 2018

Filing this issue after brief discussion on #r-spatial/mapedit/issues/46 and a bit more research.

Background: split/map/combine patterns are very useful for sf objects too, and I couldn't find a fast and officially-recommended way of recombining a list of sf objects to a single data frame.

The three approaches I've found in other issues/repos were:

  • purrr::reduce(list_of_sf, sf:::rbind.sf) is the slowest (creates, for N elements, N-1 intermediate objects that are progressively rbind-ed together)
  • do.call(what = sf:::rbind.sf, args = list_of_sf) which is fast for small lists, but slower past a few thousand elements/moderate data size (details below); and
  • mapedit:::combine_list_of_sf(list_of_sf) which scales quite well by using dplyr::bind_rows under the hood and handling geometry/attrs separately.

I profiled the three approaches in this gist, confirming above.

Performance over 1,000 list elements:
screenshot 2018-07-15 18 34 55

Performance over 10,000 list elements (dropping the purrr::reduce strategy to run in finite time).
screenshot 2018-07-15 18 35 02

@edzer 's comments in the other issue:

  • iirc, bind_rows cannot be extended for sf objects because its first argument is ... and it is an S3 generic, unlike rbind and cbind which use a different method dispatch mechanism
  • given that bind_rows is easily a factor 20 faster than rbind.sf, I think it makes sense to add an bind_rows_sf to sf.

As always, thanks for sf!

@adrfantini
Copy link

Note that bind_rows not working with sf objects causes things like purrr::map_dfr to fail, which can be quite annoying. I'm starting to love purrr::map functions and their parallel application via furrr.

@tim-salabim
Copy link
Member

For completeness, I have successfully used sf::st_as_sf(data.table::rbindlist(<list_of_sf>)) though I have not benchmarked it. Would be good to see how this approach compares to the ones benchmarked above.

@tim-salabim
Copy link
Member

Ok, so the data.table approach is something else (as expected I guess):

> nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
> set.seed(1234)
> nc_list_1k <- nc %>%
+   dplyr::sample_n(size = 1000, replace = TRUE) %>%
+   mutate(sample_id = paste0("sample_", row_number())) %>%
+   split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_1k),
+                do.call(what = sf:::rbind.sf,
+                        args = nc_list_1k),
+                purrr::reduce(.x = nc_list_1k,
+                              .f = sf:::rbind.sf), 
+                sf::st_as_sf(data.table::rbindlist(nc_list_1k)),
+                times = 25L)
Unit: milliseconds
                                               expr         min          lq        mean      median          uq        max neval  cld
           mapedit:::combine_list_of_sf(nc_list_1k) 1175.212699 1190.186592 1219.459763 1201.873386 1234.360720 1440.05542    25   c 
   do.call(what = sf:::rbind.sf, args = nc_list_1k)  894.177444  912.931392  928.037925  925.873504  938.368144  992.35103    25  b  
 purrr::reduce(.x = nc_list_1k, .f = sf:::rbind.sf) 8783.179875 8981.547539 9215.896749 9182.200163 9371.552815 9919.67273    25    d
    sf::st_as_sf(data.table::rbindlist(nc_list_1k))    4.802647    5.438064    5.904802    5.678197    5.819243   10.58275    25 a   

> set.seed(1234)
> nc_list_10k <- nc %>%
+   dplyr::sample_n(size = 10000, replace = TRUE) %>%
+   mutate(sample_id = paste0("sample_", row_number())) %>%
+   split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_10k),
+                do.call(what = sf:::rbind.sf,
+                        args = nc_list_10k), 
+                sf::st_as_sf(data.table::rbindlist(nc_list_10k)),
+                times = 25L)
Unit: milliseconds
                                              expr        min          lq        mean      median         uq        max neval cld
         mapedit:::combine_list_of_sf(nc_list_10k) 12212.8645 12531.08956 13102.34104 12778.99254 13164.2622 15267.2328    25  b 
 do.call(what = sf:::rbind.sf, args = nc_list_10k) 41520.9927 42003.92886 43894.39777 42703.18971 45595.1675 52469.4150    25   c
  sf::st_as_sf(data.table::rbindlist(nc_list_10k))    61.6442    63.85123    72.41966    66.14747    70.3542   112.2899    25 a  

I think for me, personally it is pretty clear how I will do such an operation in the future :-).

@jannes-m
Copy link

promoting data.table, I like it! I still remember, @mattdowle presenting his package at the EARL conference in London a few years ago, slightly confused about all the fuss that was made back then about the pipe operator, saying almost exasperatedly "but we have been doing chaining in data.table for years". Don't get me wrong I also like dplyr, just a nice anecdote I wanted to share.

@sheffe
Copy link
Author

sheffe commented Jul 16, 2018

That's great! the data.table::rbindlist solution also gives an option for adding the .id column, which I also use all the time in dplyr::bind_rows. (mapedit:::combine_list_of_sf doesn't expose the .id but it's not an exported function.)

On the original mapedit issue there were some good questions on crs handling where lists may have the same/ mixed / no crs. I'm new-ish to r-spatial/sf and don't have clear thoughts on a "right way" to implement that, but wanted to highlight the questions -- I've definitely stubbed my toe on trying to bind lists of different-crs dataframes in the past. ("Past" = "3 times last week"...)

One other thought from the new-ish to sf perspective -- there's a bunch of demo text on working with sf and dplyr verbs for non-spatial columns in vignettes and how-to blogs, but relatively less on split/map/combine and common purrr workflows with sf. (Like @adrfantini I've started using this all the time.) If it's worth another smallish vignette after this direction is established, I'd be happy to take a crack at it.

@tim-salabim as an aside, I've run across several helpful tweets from you on blending sf and data.table to optimize pipelines, and always meant to dig in further. Your benchmark ^ is a nice kick to do that. Consider this an appreciative audience request for an encore, too...

@tim-salabim
Copy link
Member

I will. I've just revisited a long intended implementation for building quadtrees using Rcpp and data.table (after a gentle reminder). https://twitter.com/TimSalabim3/status/943194334333227008 Hopefully I can come up with a general function for this in the not too distant future... It's such a fun problem!

@shermstats
Copy link

shermstats commented Jan 29, 2019

On the original mapedit issue there were some good questions on crs handling where lists may have the same/ mixed / no crs. I'm new-ish to r-spatial/sf and don't have clear thoughts on a "right way" to implement that, but wanted to highlight the questions -- I've definitely stubbed my toe on trying to bind lists of different-crs dataframes in the past. ("Past" = "3 times last week"...)

I was recently dealing with State Plane Coordinate Systems (SPCSes) and ran into this issue. I had WGS84-projected data and wanted to get the area, in meters, of each of a few million polygons spread throughout the state of Washington. My options were:

  1. Convert all of the polygons to Washington State Plane North and calculate st_area. This is accurate for the northern half of the state, and off for the southern half of the state.
  2. Find polygons for Washington State Plane North and Washington State Plane South, spatially join to CRS, calculate st_area for both CRSes separately, convert them both back to WGS84, and combine the results. This would have been the most accurate, but would have involved a lot of reading documentation and writing split/apply/combine code.

I went with 1 due to time constraints. I would have gone with 2 if the split/apply/combine elements were handled automatically in sf.

As for the right way to handle this, my intuition is that we should operate on each data frame separately in a vectorized manner that takes the crs into account. It would work similarly to a dplyr groupby object, but with a different crs for each "group". For combining, I imagine an st_bind_rows function with a crs_transform argument (I'm agnostic about the actual name) that specifies "which crs should we transform all of the data.frames to?"

We might also want to allow a named vector argument of c(orig_crs1 = new_crs1, ..., orig_crs_i = orig_crs_j, ...) to reduce the original data.frame collection to a smaller subset, but not necessarily a single data.frame.

@edzer
Copy link
Member

edzer commented Jan 30, 2019

It would help if you could provide a minimal reproducible data example, along with what you've tried.

@joshbrinks
Copy link

Ok, so the data.table approach is something else (as expected I guess):

> nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
> set.seed(1234)
> nc_list_1k <- nc %>%
+   dplyr::sample_n(size = 1000, replace = TRUE) %>%
+   mutate(sample_id = paste0("sample_", row_number())) %>%
+   split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_1k),
+                do.call(what = sf:::rbind.sf,
+                        args = nc_list_1k),
+                purrr::reduce(.x = nc_list_1k,
+                              .f = sf:::rbind.sf), 
+                sf::st_as_sf(data.table::rbindlist(nc_list_1k)),
+                times = 25L)
Unit: milliseconds
                                               expr         min          lq        mean      median          uq        max neval  cld
           mapedit:::combine_list_of_sf(nc_list_1k) 1175.212699 1190.186592 1219.459763 1201.873386 1234.360720 1440.05542    25   c 
   do.call(what = sf:::rbind.sf, args = nc_list_1k)  894.177444  912.931392  928.037925  925.873504  938.368144  992.35103    25  b  
 purrr::reduce(.x = nc_list_1k, .f = sf:::rbind.sf) 8783.179875 8981.547539 9215.896749 9182.200163 9371.552815 9919.67273    25    d
    sf::st_as_sf(data.table::rbindlist(nc_list_1k))    4.802647    5.438064    5.904802    5.678197    5.819243   10.58275    25 a   

> set.seed(1234)
> nc_list_10k <- nc %>%
+   dplyr::sample_n(size = 10000, replace = TRUE) %>%
+   mutate(sample_id = paste0("sample_", row_number())) %>%
+   split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_10k),
+                do.call(what = sf:::rbind.sf,
+                        args = nc_list_10k), 
+                sf::st_as_sf(data.table::rbindlist(nc_list_10k)),
+                times = 25L)
Unit: milliseconds
                                              expr        min          lq        mean      median         uq        max neval cld
         mapedit:::combine_list_of_sf(nc_list_10k) 12212.8645 12531.08956 13102.34104 12778.99254 13164.2622 15267.2328    25  b 
 do.call(what = sf:::rbind.sf, args = nc_list_10k) 41520.9927 42003.92886 43894.39777 42703.18971 45595.1675 52469.4150    25   c
  sf::st_as_sf(data.table::rbindlist(nc_list_10k))    61.6442    63.85123    72.41966    66.14747    70.3542   112.2899    25 a  

I think for me, personally it is pretty clear how I will do such an operation in the future :-).

I know this is a necro comment, however I just implemented the data.table 'melting' of a list of sf objects into a single sf object, and while it works and is fast, the newly created object's extent is set to the extent of the first element of the original list without (to my knowledge) a simple way to reassign.

This behavior is not reproduced when using do.call(rbind...).

@stragu
Copy link

stragu commented Dec 2, 2019

If anyone rocks up here wondering if bind_rows() should work on sf objects, it looks like it will be supported in the future 0.9.0 version: tidyverse/dplyr#2457 (comment)

For the sake of completeness, here's an example and the error we currently get (with dplyr 0.8.3).

# dataframe of US states with coordinates
states <- data.frame(name = state.name,
                     x = state.center$x,
                     y = state.center$y,
                     this = "this",
                     that = "that")
# create sf subsets, with different columns
library(sf)
states1 <- st_as_sf(states[1:10,1:4], coords = c("x", "y"))
states2 <- st_as_sf(states[21:30,c(1:3,5)], coords = c("x", "y"))

# with dplyr
library(dplyr)
bind_rows(states1, states2)

Which results in the following error and warnings:

Error in .subset2(x, i, exact = exact) : 
  attempt to select less than one element in get1index
In addition: Warning messages:
1: In bind_rows_(x, .id) :
  Vectorizing 'sfc_POINT' elements may not preserve their attributes
2: In bind_rows_(x, .id) :
  Vectorizing 'sfc_POINT' elements may not preserve their attributes

@markbneal
Copy link

markbneal commented May 11, 2020

Good news, dplyr v1 is coming, and if you install the most recent version from github now...
remotes::install_github("tidyverse/dplyr")
you'll get a version where a simple bind_rows() of sf objects just works!
NZ_fixed <- bind_rows(NZ_2020, NZ_1995)
If you do a basic mistake, like joining objects with a different crs (who would do that :)), you get a neat informative error!
Error in common_crs(x, y) : coordinate reference systems not equal: use st_transform() first?
Edit Also solves this issue

@Robinlovelace
Copy link
Contributor

Just hit this after my computer's CPU ran all night trying to bind together 100k linestrings broken up into 1000 chunks of 100 rows each.

I re-ran Tim's benchmark to check and yes, rbind(), and interestingly bind_rows() come out as v. slow:

library(sf)  
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidyverse)
nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
set.seed(1234)
nc_list_10 <- nc %>%
  dplyr::sample_n(size = 10, replace = TRUE) %>%
  mutate(sample_id = paste0("sample_", row_number())) %>%
  split(.$sample_id)

bench::mark(max_iterations = 2, check = FALSE,
  rbind = do.call(what = rbind, args = nc_list_10),
  bind_rows = do.call(what = bind_rows, args = nc_list_10),
  rbindlist = sf::st_as_sf(data.table::rbindlist(nc_list_10)) # incorrect bb
)
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rbind        3.19ms   3.27ms      306.  185.02KB        0
#> 2 bind_rows    3.98ms   4.05ms      247.  689.34KB        0
#> 3 rbindlist  236.51µs 264.04µs     3787.    3.26MB        0

Created on 2023-02-15 with reprex v2.0.2

Thoughts:

There may be new approaches, e.g. using the geos package? I know @paleolimbot is a fan of benchmarking so wonder if that would provide a good solution to the geometry column...

There may be recent developments on this, tempted to open a new issue on rbind() being slow but want to establish if its in scope for the package first.

atumscott added a commit to nptscot/npt that referenced this issue Feb 15, 2023
@edzer
Copy link
Member

edzer commented Feb 15, 2023

It would be interesting to know whether rbindlist's speed is entirely caused by not updating the bounding box; this could be checked by st_sfc() using recompute_bbox=TRUE. I guess that not checking that CRS's of components are identical is another cause. It would be fairly simple to give rbind an option to not check CRS's; I don't think this can easily be done by bind_rows, as that is not a generic.

@kadyb
Copy link
Contributor

kadyb commented Feb 15, 2023

@Robinlovelace, could you also include unlist2d from {collapse} (and see if the results are ok)? This will probably be the fastest option (probably even than {data.table}).

Like this: st_as_sf(unlist2d(nc_list_10, idcols = FALSE, recursive = FALSE))

Edit: In this answer on SO is done some magic with base unlist() and works pretty fast, but I also do not know if it will work in this case.

@Robinlovelace
Copy link
Contributor

Thanks @kadyb will give it a spin!

@Robinlovelace
Copy link
Contributor

Robinlovelace commented Feb 16, 2023

Update, you're right @kadyb, the collapse function is even faster and returns a data.frame:

library(sf)  
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidyverse)
nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
set.seed(1234)
nc_list_10 <- nc %>%
  dplyr::sample_n(size = 10, replace = TRUE) %>%
  mutate(sample_id = paste0("sample_", row_number())) %>%
  split(.$sample_id)

bench::mark(max_iterations = 2, check = FALSE,
  rbind = (res_rbind <<- do.call(what = rbind, args = nc_list_10)),
  bind_rows = (res_bind_rows <<- do.call(what = bind_rows, args = nc_list_10)),
  rbindlist = (res_rbindlist <<- sf::st_as_sf(data.table::rbindlist(nc_list_10))), # incorrect bb
  res_unlist2d = (res_unlist2d <<- st_as_sf(collapse::unlist2d(nc_list_10, idcols = FALSE, recursive = FALSE)))
  # ,
  # unlist = {res_unlist2d <<- lapply(1:n, function(x) nc_list_10)
  # list2DF(lapply(setNames(seq_along(.[[1]]), names(.[[1]])), \(i)
  #                unlist(lapply(., `[[`, i), FALSE, FALSE)))}
)
#> # A tibble: 4 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rbind          8.54ms   9.67ms     103.   185.02KB      0  
#> 2 bind_rows     13.84ms  13.84ms      72.3  689.34KB     72.3
#> 3 rbindlist    548.26µs 582.35µs    1717.     3.26MB      0  
#> 4 res_unlist2d 253.51µs 277.65µs    3602.  1016.23KB      0

waldo::compare(res_rbind, res_bind_rows)
#> ✔ No differences
waldo::compare(res_rbind, res_rbindlist) # bbox is wrong
#> `class(old)`: "sf" "tbl_df"     "tbl" "data.frame"
#> `class(new)`: "sf" "data.table"       "data.frame"
#> 
#> `names(old)[12:16]`: "BIR79" "SID79" "NWBIR79" "geometry"  "sample_id"
#> `names(new)[12:16]`: "BIR79" "SID79" "NWBIR79" "sample_id" "geometry" 
#> 
#> `attr(old$geometry, 'bbox')`: "((-82.96275,33.94867),(-75.77316,36.55716))"
#> `attr(new$geometry, 'bbox')`: "((-77.33221,35.81418),(-76.69376,36.24098))"
waldo::compare(res_rbind, res_unlist2d) # bbox is wrong
#> `class(old)`: "sf" "tbl_df" "tbl" "data.frame"
#> `class(new)`: "sf"                "data.frame"
#> 
#> `names(old)[12:16]`: "BIR79" "SID79" "NWBIR79" "geometry"  "sample_id"
#> `names(new)[12:16]`: "BIR79" "SID79" "NWBIR79" "sample_id" "geometry" 
#> 
#> `attr(old$geometry, 'bbox')`: "((-82.96275,33.94867),(-75.77316,36.55716))"
#> `attr(new$geometry, 'bbox')`: "((-77.33221,35.81418),(-76.69376,36.24098))"
waldo::compare(res_rbindlist, res_unlist2d) # bbox is wrong
#> `class(old)`: "sf" "data.table" "data.frame"
#> `class(new)`: "sf"              "data.frame"

Created on 2023-02-16 with reprex v2.0.2

Remaining questions:

  • What's the fastest way to fix the bbox?
  • Is there any way to speed this up in sf and if not is there an need to package this kind of code?

One other thought: sfheaders by @dcooley may also help here...

@edzer
Copy link
Member

edzer commented Feb 16, 2023

What's the fastest way to fix the bbox?

st_sfc(recompute_bbox = TRUE)?

@kadyb
Copy link
Contributor

kadyb commented Feb 16, 2023

## collapse
f1 = function(x) {
  if (length(x) == 0) stop("Empty list")
  geom_name = attr(x[[1]], "sf_column")
  x = collapse::unlist2d(x, idcols = FALSE, recursive = FALSE)
  x[[geom_name]] = st_sfc(x[[geom_name]], recompute_bbox = TRUE)
  x = st_as_sf(x)
}

## data.table
f2 = function(x) {
  if (length(x) == 0) stop("Empty list")
  geom_name = attr(x[[1]], "sf_column")
  x = data.table::rbindlist(x, use.names = FALSE)
  x[[geom_name]] = st_sfc(x[[geom_name]], recompute_bbox = TRUE)
  x = st_as_sf(x)
}

Will it be okay? Edit: If these functions work correctly, someone could now compare their performance on real (big) dataset and we will see which is better choice.

@edzer
Copy link
Member

edzer commented Feb 16, 2023

I'd take the default from x: geom_name = attr(x[[1]], "sf_column") and take care of the case where x is an empty list.

@dcooley
Copy link
Contributor

dcooley commented Feb 16, 2023

One other thought: sfheaders by @dcooley may also help here...

@Robinlovelace there's not out-of-the-box solution in sfheaders either. Although there is a sfheaders::sf_box() for calculating the bounding box.

@Robinlovelace
Copy link
Contributor

These functions are impressively fast. Many thanks Krzysztof and Edzer, I really hope everyone using sf, not just enthusiasts reading rather esoteric issue comments, can benefit from these amazing speed-ups. One possibility: add collapse as a Suggests and then use the f1() option if available?

@paleolimbot
Copy link
Contributor

Sorry for being late to the party here...it seems like there's a good workaround!

Neither geos nor wk can help here...combining those vectors has all the same safety/speed tradeoffs that exist in sf (e.g., checking CRS equivalence). Neither geos nor wk bother with the precomputed bounding box (since it's very cheap to compute and rarely used compared to operations that invalidate it) but that doesn't seem like it's a consideration here.

@kadyb
Copy link
Contributor

kadyb commented Feb 17, 2023

So Robin maybe just for completeness you should include also CRS and column checks, but maybe leave it as an optional choice because it is quite slow operation.

check = function(x) {
  ref_crs = st_crs(x[[1]])
  ref_colname = colnames(x[[1]])
  for (i in seq_len(length(x))) {
    if (isFALSE(st_crs(x[[i]]) == ref_crs)) stop("Diffrent CRS")
    if (isFALSE(all(colnames(x[[i]]) == ref_colname))) stop("Diffrent columns")
  }
}

Edit: I posted the prepared function on GIST: fastrbindsf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests