Skip to content

Filtering a sf object could be easier, and also prettier #1148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pasipasi123 opened this issue Sep 12, 2019 · 5 comments
Closed

Filtering a sf object could be easier, and also prettier #1148

pasipasi123 opened this issue Sep 12, 2019 · 5 comments

Comments

@pasipasi123
Copy link

I had a similar problem as in this SO question and I started thinking that maybe sf could use a filtering function, eg. st_filter().

Since the default error message for when one or both objects are not sf objects is that the crs's don't match, I've also added a more informative error message.

The default method is st_intersects but one can specify another by using eg, st_filter(x, y, st_crosses).

What do you think?

library(sf)
library(tidyverse)

nc <- st_read(system.file("shape/nc.shp", package="sf"))

st_filter <- function(.x, .y, f = st_intersects) {
  geoms_attr <- sapply(list(.x, .y), attr, "sf_column")
  null_geoms <- sapply(geoms_attr, is.null)
  
  if(any(null_geoms)) {
    stop("`st_geometry` should not be `NULL`")
  }
  
  .x[f(.x, .y, sparse = FALSE), ]
}

nc %>% st_filter(mtcars)
#> Error in st_filter(., mtcars): `st_geometry` should not be `NULL`

my_line <- tibble(name = "my line", x = c(-81, -79), y = c(34.5, 37)) %>% 
  st_as_sf(coords = c("x", "y")) %>% 
  summarize() %>% 
  st_cast("LINESTRING") %>% 
  st_set_crs(4267)

ggplot(nc) + 
  geom_sf() + 
  geom_sf(data = my_line)

nc %>% 
  st_filter(my_line) %>% 
  ggplot() + 
  geom_sf()
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar

Created on 2019-09-12 by the reprex package (v0.3.0)

@sheffe
Copy link

sheffe commented Sep 16, 2019

I like this concept a lot. I grepped a few recent projects and found that more than half of my st_join calls were doing this operation via st_join(..., left = FALSE). An inner join has the same row-subsetting consequences as the proposed st_filter, but the downside is that it drags extra columns along. I often use st_join(..., y = dplyr::select(foo, geometry), left = FALSE) or similar.

A minor note: it's counterintuitive to me, but I've found that st_join(..., left = FALSE) is consistently faster for subsetting than using the geometry predicates like st_intersects directly. Here's an example building on the code above:

library(sf)
library(tidyverse)

nc <- st_read(system.file("shape/nc.shp", package="sf"))

st_filter <- function(.x, .y, f = st_intersects) {
  # Leaving the error checks out for a direct comparison in function timings
  # geoms_attr <- sapply(list(.x, .y), attr, "sf_column")
  # null_geoms <- sapply(geoms_attr, is.null)
  # if(any(null_geoms)) {
  #   stop("`st_geometry` should not be `NULL`")
  # }
  
  .x[f(.x, .y, sparse = FALSE), ]
}

my_line <- tibble(name = "my line", x = c(-81, -79), y = c(34.5, 37)) %>% 
  st_as_sf(coords = c("x", "y")) %>% 
  summarize() %>% 
  st_cast("LINESTRING") %>% 
  st_set_crs(4267)

suppressMessages(suppressWarnings({
  microbenchmark::microbenchmark(
    sf::st_join(x = nc, y = my_line, 
                join = sf::st_intersects, left = FALSE),
    st_filter(.x = nc, .y = my_line),
    times = 100
  )
}))

Results are pretty close here on the small dataset:

Unit: milliseconds
                                                                     expr      min       lq     mean   median       uq      max neval
 sf::st_join(x = nc, y = my_line, join = sf::st_intersects, left = FALSE) 3.475222 3.555523 3.813325 3.645118 3.894801 8.765784   100
                                         st_filter(.x = nc, .y = my_line) 3.637339 3.689342 3.877169 3.722460 3.899177 9.239481   100

But the timing gap grows considerably over larger datasets -- median time is ~10% better here.

guilford_roads <- tigris::roads(state = "37", county = "081", class = "sf") %>%
  sf::st_transform(4267)

suppressMessages(suppressWarnings({
  microbenchmark::microbenchmark(
    sf::st_join(x = guilford_roads, y = my_line, 
                join = sf::st_intersects, left = FALSE),
    st_filter(.x = guilford_roads, .y = my_line),
    times = 25
  )
}))
Unit: milliseconds
                                                                                      expr      min       lq     mean  median       uq      max neval
 sf::st_join(x = guilford_roads, y = my_line, join = sf::st_intersects,      left = FALSE) 183.3485 187.7024 192.1303 190.482 197.0052 205.0862    25
                                              st_filter(.x = guilford_roads, .y = my_line) 194.5699 200.8530 207.8659 207.605 210.8725 230.3960    25

When subsetting really large sf objects like US Census Block files (~11M detailed shapes), I have found nearly 20% improvements consistently. I can't explain this behavior offhand.

@edzer
Copy link
Member

edzer commented Sep 17, 2019

@pasipasi123 that approach does not hold when my_line (or .y) holds more than one feature:

ml = rbind(my_line,my_line)
nc %>% st_filter(ml)
...
Simple feature collection with 16 features and 14 fields (with 8 geometries empty)
...

Two approaches currently work:

nc[ml,] # which also has an op argument to control which predicate to use

and

nc %>% filter(lengths(st_intersects(., ml)) > 0)

Am I right that your proposal is to make st_filter work like the second, but written as

nc %>% st_filter(ml)

with a parameter to control the predicate, defaulting to st_intersects?

@pasipasi123
Copy link
Author

Am I right that your proposal is to make st_filter work like the second, but written as

nc %>% st_filter(ml)

with a parameter to control the predicate, defaulting to st_intersects?

Yes, that's what I was after.

@edzer
Copy link
Member

edzer commented Sep 17, 2019

OK, that would then be

st_filter = function(.x, .y, .pred = st_intersects) {
    # check that dplyr is loaded, then
    filter(.x, lengths(.pred(.x, .y)) > 0)
}

nc %>% st_filter(ml)

@edzer edzer closed this as completed in d00445a Sep 17, 2019
@edzer
Copy link
Member

edzer commented Sep 17, 2019

@sheffe I think there is a substantial difference: when y has more than one feature, st_join may return more records than x has, filter never would do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants