Faster `s2_dwithin_matrix()`; add `s2_prepared_dwithin()` #174

paleolimbot · 2022-05-22T01:08:13Z

Fixes #157. Before this PR, the index wasn't efficiently being used for a matrix within distance query. I also added s2_prepred_dwithin() to match the added GEOS function, although the "prepare" step is much more expensive here. Still, it's super useful for the maybe common s2_prepared_dwithin(big_long_points_vector, a_single_polygon_vector).

Before this PR:

library(s2)

points <- s2_point(runif(1e4, -1, 1), runif(1e4, -1, 1), runif(1e4, -1, 1))
geog <- s2::as_s2_geography(s2::as_s2_lnglat(points))
countries <- s2_data_countries()

bench::mark(
  s2_dwithin_matrix(countries, geog, 1e6),
  s2_dwithin_matrix(geog, countries, 1e6),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                                     min  median `itr/sec` mem_alloc
#>   <bch:expr>                                <bch:tm> <bch:t>     <dbl> <bch:byt>
#> 1 s2_dwithin_matrix(countries, geog, 1e+06)    3.03s   3.03s     0.330   125.7KB
#> 2 s2_dwithin_matrix(geog, countries, 1e+06)    2.93s   2.93s     0.342    80.8KB
#> # … with 1 more variable: `gc/sec` <dbl>

bench::mark(
  s2_dwithin(geog, countries[1], 1e6)
  # s2_prepared_dwithin(geog, countries[1], 1e6)
)
#> # A tibble: 1 × 6
#>   expression                                 min   median `itr/sec` mem_alloc
#>   <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 s2_dwithin(geog, countries[1], 1e+06)   11.3ms   11.7ms      83.9     296KB
#> # … with 1 more variable: `gc/sec` <dbl>

After this PR:

library(s2)

points <- s2_point(runif(1e4, -1, 1), runif(1e4, -1, 1), runif(1e4, -1, 1))
geog <- s2::as_s2_geography(s2::as_s2_lnglat(points))
countries <- s2_data_countries()

bench::mark(
  s2_dwithin_matrix(countries, geog, 1e6),
  s2_dwithin_matrix(geog, countries, 1e6),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                                     min  median `itr/sec` mem_alloc
#>   <bch:expr>                                <bch:tm> <bch:t>     <dbl> <bch:byt>
#> 1 s2_dwithin_matrix(countries, geog, 1e+06)    165ms   166ms      6.02   137.1KB
#> 2 s2_dwithin_matrix(geog, countries, 1e+06)    428ms   428ms      2.34    80.7KB
#> # … with 1 more variable: `gc/sec` <dbl>

bench::mark(
  s2_dwithin(geog, countries[1], 1e6),
  s2_prepared_dwithin(geog, countries[1], 1e6)
)
#> # A tibble: 2 × 6
#>   expression                                          min   median `itr/sec`
#>   <bch:expr>                                     <bch:tm> <bch:tm>     <dbl>
#> 1 s2_dwithin(geog, countries[1], 1e+06)            8.86ms    9.2ms      108.
#> 2 s2_prepared_dwithin(geog, countries[1], 1e+06)   3.16ms   3.25ms      305.
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

TODO: Add tests for s2_prepared_dwithin() and make sure that the code paths I expect are getting tested for s2_dwithin_matrix() are getting tested.

(@rsbivand sorry for taking so long to get here but hopefully this is what you were after!)

codecov-commenter · 2022-05-22T01:27:25Z

Codecov Report

Merging #174 (dfbece7) into main (b22a4c0) will increase coverage by 0.08%.
The diff coverage is 100.00%.

❗ Current head dfbece7 differs from pull request most recent head 9878f81. Consider uploading reports for the commit 9878f81 to get more accurate results

@@            Coverage Diff             @@
##             main     #174      +/-   ##
==========================================
+ Coverage   93.93%   94.02%   +0.08%     
==========================================
  Files          49       49              
  Lines        3463     3514      +51     
==========================================
+ Hits         3253     3304      +51     
  Misses        210      210

Impacted Files	Coverage Δ
R/s2-matrix.R	`100.00% <100.00%> (ø)`
R/s2-predicates.R	`100.00% <100.00%> (ø)`
src/s2-matrix.cpp	`97.60% <100.00%> (+0.22%)`	⬆️
src/s2-predicates.cpp	`98.92% <100.00%> (+0.37%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b22a4c0...9878f81. Read the comment docs.

edzer · 2022-05-22T08:35:09Z

Looks like a game changer!!

rsbivand · 2022-05-22T19:38:25Z

Thanks, only tested small data sets yet, but see major improvements. s2::s2_closest_edges() still faster as it does min-bound to max-bound in one pass, while s2::s2_dwithin_matrix() needs two passes for lower and upper bounds. Certainly both of these are a clear improvement on previous results!

paleolimbot · 2022-05-23T01:10:36Z

I should have checked s2_closest_edges()! There's definitely room for improvement, particularly for the case where we know that one or both of the inputs are points. There's no system right now for that kind of dispatch (I'm hoping geoarrow will help, since geoarrow-encoded arrays have to be typed as all points/lines/polygons/etc).

The code I was using to test the complexity is below. Apparently it's still linear, just a lot faster...if you switch geog and countries (i.e., use the index on the huge vector of points) it switches to quadratic complexity, which seemed odd to me but I figured I'd merge the huge improvement and tackle fine-tuning if anybody started complaining.

library(s2)

countries <- s2_data_countries()

bench_points <- function(n) {
  points <- s2_point(runif(n, -1, 1), runif(n, -1, 1), runif(n, -1, 1))
  geog <- s2::as_s2_geography(s2::as_s2_lnglat(points))
  tibble::tibble(
    n = n,
    bench::mark(s2_dwithin_matrix(geog, countries, 1e6))[2:3]
  )
}

results <- rbind(
  bench_points(1e2),
  bench_points(3e2),
  bench_points(5e2),
  bench_points(7e2),
  bench_points(9e2),
  bench_points(1e3),
  bench_points(3e3),
  bench_points(5e3),
  bench_points(7e3),
  bench_points(9e3),
  bench_points(1e5),
  bench_points(5e5),
  bench_points(1e6)
)
plot(median ~ n, data = results)

rsbivand · 2022-05-23T07:46:05Z

s2::s2_closest_edges() is still a little better, mostly because it avoids running s2::s2_dwithin_matrix() twice when the minimum distance is > 0; however, s2::s2_closest_edges() needs an arbitrary k=, which s2::s2_dwithin_matrix() does not, so avoids mistakes if k= is set too tightly. Both need post-processing to drop self-neighours, but that is as expected and is specific to spatial neighbours (in most cases zero on the diagonal for (sparse) matrix representations of the graph). Looks very good now - any plans for submitting 1.0-8 ? Would be good to have, spdep::dnearneigh() conditions on packageVersion("s2") > "1.0.7" ! May I add you as a contributor to spdep?

paleolimbot · 2022-05-23T17:17:59Z

I'd be honoured to have contributor status!

In some future interface that's maybe arrow array in -> arrow array out, the output will probably be data.frame(index_x = ..., index_y = ...) instead of list(y_indices1, y_indices2...) (This is how geopandas produces its join keys for that type of operation, and it's faster because it only has to allocate 2 arrays). That will at least make the self-exclusion easier because the diagonal will then be index_x == index_y and might make some other index math more vectorizable.

I'd like to work through the open s2 issue list while I'm "in it", but I'll set a hard deadline of May 29 (Sunday) to start the revdep process since I'm a fantastically bad estimator of what I can get done during my kids' naps.

rsbivand · 2022-05-24T10:00:18Z

Updated docs and a vignette with a subsection describing spherical distance-based neighbours ready for your s2 release. I don't know what distribution fits nap duration, I'm just as poor at guessing for kids and their kids too. I think they help generate useful thoughts, but no idea how.

paleolimbot · 2022-05-28T00:23:39Z

Anywhere from 0 minutes to 3 hours! Right now they both have sore throats and fevers and won't nap unless they are literally clinging to us (although, fingers crossed, they are both currently asleep).

paleolimbot added 2 commits May 21, 2022 15:15

add s2_prepared_dwithin

9a7e69c

efficiently use index for s2_dwithin_matrix

0675dbe

rsbivand mentioned this pull request May 22, 2022

Error building on GitHub Actions + R 4.2 #173

Closed

paleolimbot added 2 commits May 22, 2022 13:42

test new dwithin implementations

9878f81

update NEWS

d5b086b

paleolimbot merged commit d32b372 into main May 22, 2022

rsbivand added a commit to r-spatial/spdep that referenced this pull request May 23, 2022

changes to match r-spatial/s2#174

8416ab5

paleolimbot deleted the dwithin branch May 24, 2022 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster `s2_dwithin_matrix()`; add `s2_prepared_dwithin()` #174

Faster `s2_dwithin_matrix()`; add `s2_prepared_dwithin()` #174

paleolimbot commented May 22, 2022

codecov-commenter commented May 22, 2022 •

edited

Loading

edzer commented May 22, 2022

rsbivand commented May 22, 2022

paleolimbot commented May 23, 2022

rsbivand commented May 23, 2022 •

edited

Loading

paleolimbot commented May 23, 2022

rsbivand commented May 24, 2022

paleolimbot commented May 28, 2022

Faster s2_dwithin_matrix(); add s2_prepared_dwithin() #174

Faster s2_dwithin_matrix(); add s2_prepared_dwithin() #174

Conversation

paleolimbot commented May 22, 2022

codecov-commenter commented May 22, 2022 • edited Loading

Codecov Report

edzer commented May 22, 2022

rsbivand commented May 22, 2022

paleolimbot commented May 23, 2022

rsbivand commented May 23, 2022 • edited Loading

paleolimbot commented May 23, 2022

rsbivand commented May 24, 2022

paleolimbot commented May 28, 2022

Faster `s2_dwithin_matrix()`; add `s2_prepared_dwithin()` #174

Faster `s2_dwithin_matrix()`; add `s2_prepared_dwithin()` #174

codecov-commenter commented May 22, 2022 •

edited

Loading

rsbivand commented May 23, 2022 •

edited

Loading