Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster s2_dwithin_matrix(); add s2_prepared_dwithin() #174

Merged
merged 4 commits into from
May 22, 2022
Merged

Conversation

paleolimbot
Copy link
Collaborator

Fixes #157. Before this PR, the index wasn't efficiently being used for a matrix within distance query. I also added s2_prepred_dwithin() to match the added GEOS function, although the "prepare" step is much more expensive here. Still, it's super useful for the maybe common s2_prepared_dwithin(big_long_points_vector, a_single_polygon_vector).

Before this PR:

library(s2)

points <- s2_point(runif(1e4, -1, 1), runif(1e4, -1, 1), runif(1e4, -1, 1))
geog <- s2::as_s2_geography(s2::as_s2_lnglat(points))
countries <- s2_data_countries()

bench::mark(
  s2_dwithin_matrix(countries, geog, 1e6),
  s2_dwithin_matrix(geog, countries, 1e6),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                                     min  median `itr/sec` mem_alloc
#>   <bch:expr>                                <bch:tm> <bch:t>     <dbl> <bch:byt>
#> 1 s2_dwithin_matrix(countries, geog, 1e+06)    3.03s   3.03s     0.330   125.7KB
#> 2 s2_dwithin_matrix(geog, countries, 1e+06)    2.93s   2.93s     0.342    80.8KB
#> # … with 1 more variable: `gc/sec` <dbl>

bench::mark(
  s2_dwithin(geog, countries[1], 1e6)
  # s2_prepared_dwithin(geog, countries[1], 1e6)
)
#> # A tibble: 1 × 6
#>   expression                                 min   median `itr/sec` mem_alloc
#>   <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 s2_dwithin(geog, countries[1], 1e+06)   11.3ms   11.7ms      83.9     296KB
#> # … with 1 more variable: `gc/sec` <dbl>

After this PR:

library(s2)

points <- s2_point(runif(1e4, -1, 1), runif(1e4, -1, 1), runif(1e4, -1, 1))
geog <- s2::as_s2_geography(s2::as_s2_lnglat(points))
countries <- s2_data_countries()

bench::mark(
  s2_dwithin_matrix(countries, geog, 1e6),
  s2_dwithin_matrix(geog, countries, 1e6),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                                     min  median `itr/sec` mem_alloc
#>   <bch:expr>                                <bch:tm> <bch:t>     <dbl> <bch:byt>
#> 1 s2_dwithin_matrix(countries, geog, 1e+06)    165ms   166ms      6.02   137.1KB
#> 2 s2_dwithin_matrix(geog, countries, 1e+06)    428ms   428ms      2.34    80.7KB
#> # … with 1 more variable: `gc/sec` <dbl>

bench::mark(
  s2_dwithin(geog, countries[1], 1e6),
  s2_prepared_dwithin(geog, countries[1], 1e6)
)
#> # A tibble: 2 × 6
#>   expression                                          min   median `itr/sec`
#>   <bch:expr>                                     <bch:tm> <bch:tm>     <dbl>
#> 1 s2_dwithin(geog, countries[1], 1e+06)            8.86ms    9.2ms      108.
#> 2 s2_prepared_dwithin(geog, countries[1], 1e+06)   3.16ms   3.25ms      305.
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

TODO: Add tests for s2_prepared_dwithin() and make sure that the code paths I expect are getting tested for s2_dwithin_matrix() are getting tested.

(@rsbivand sorry for taking so long to get here but hopefully this is what you were after!)

@codecov-commenter
Copy link

codecov-commenter commented May 22, 2022

Codecov Report

Merging #174 (dfbece7) into main (b22a4c0) will increase coverage by 0.08%.
The diff coverage is 100.00%.

❗ Current head dfbece7 differs from pull request most recent head 9878f81. Consider uploading reports for the commit 9878f81 to get more accurate results

@@            Coverage Diff             @@
##             main     #174      +/-   ##
==========================================
+ Coverage   93.93%   94.02%   +0.08%     
==========================================
  Files          49       49              
  Lines        3463     3514      +51     
==========================================
+ Hits         3253     3304      +51     
  Misses        210      210              
Impacted Files Coverage Δ
R/s2-matrix.R 100.00% <100.00%> (ø)
R/s2-predicates.R 100.00% <100.00%> (ø)
src/s2-matrix.cpp 97.60% <100.00%> (+0.22%) ⬆️
src/s2-predicates.cpp 98.92% <100.00%> (+0.37%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b22a4c0...9878f81. Read the comment docs.

@edzer
Copy link
Member

edzer commented May 22, 2022

Looks like a game changer!!

@paleolimbot paleolimbot merged commit d32b372 into main May 22, 2022
@rsbivand
Copy link
Member

Thanks, only tested small data sets yet, but see major improvements. s2::s2_closest_edges() still faster as it does min-bound to max-bound in one pass, while s2::s2_dwithin_matrix() needs two passes for lower and upper bounds. Certainly both of these are a clear improvement on previous results!

@paleolimbot
Copy link
Collaborator Author

I should have checked s2_closest_edges()! There's definitely room for improvement, particularly for the case where we know that one or both of the inputs are points. There's no system right now for that kind of dispatch (I'm hoping geoarrow will help, since geoarrow-encoded arrays have to be typed as all points/lines/polygons/etc).

The code I was using to test the complexity is below. Apparently it's still linear, just a lot faster...if you switch geog and countries (i.e., use the index on the huge vector of points) it switches to quadratic complexity, which seemed odd to me but I figured I'd merge the huge improvement and tackle fine-tuning if anybody started complaining.

library(s2)

countries <- s2_data_countries()

bench_points <- function(n) {
  points <- s2_point(runif(n, -1, 1), runif(n, -1, 1), runif(n, -1, 1))
  geog <- s2::as_s2_geography(s2::as_s2_lnglat(points))
  tibble::tibble(
    n = n,
    bench::mark(s2_dwithin_matrix(geog, countries, 1e6))[2:3]
  )
}

results <- rbind(
  bench_points(1e2),
  bench_points(3e2),
  bench_points(5e2),
  bench_points(7e2),
  bench_points(9e2),
  bench_points(1e3),
  bench_points(3e3),
  bench_points(5e3),
  bench_points(7e3),
  bench_points(9e3),
  bench_points(1e5),
  bench_points(5e5),
  bench_points(1e6)
)
plot(median ~ n, data = results)

@rsbivand
Copy link
Member

rsbivand commented May 23, 2022

s2::s2_closest_edges() is still a little better, mostly because it avoids running s2::s2_dwithin_matrix() twice when the minimum distance is > 0; however, s2::s2_closest_edges() needs an arbitrary k=, which s2::s2_dwithin_matrix() does not, so avoids mistakes if k= is set too tightly. Both need post-processing to drop self-neighours, but that is as expected and is specific to spatial neighbours (in most cases zero on the diagonal for (sparse) matrix representations of the graph). Looks very good now - any plans for submitting 1.0-8 ? Would be good to have, spdep::dnearneigh() conditions on packageVersion("s2") > "1.0.7" ! May I add you as a contributor to spdep?

rsbivand added a commit to r-spatial/spdep that referenced this pull request May 23, 2022
@paleolimbot
Copy link
Collaborator Author

I'd be honoured to have contributor status!

In some future interface that's maybe arrow array in -> arrow array out, the output will probably be data.frame(index_x = ..., index_y = ...) instead of list(y_indices1, y_indices2...) (This is how geopandas produces its join keys for that type of operation, and it's faster because it only has to allocate 2 arrays). That will at least make the self-exclusion easier because the diagonal will then be index_x == index_y and might make some other index math more vectorizable.

I'd like to work through the open s2 issue list while I'm "in it", but I'll set a hard deadline of May 29 (Sunday) to start the revdep process since I'm a fantastically bad estimator of what I can get done during my kids' naps.

@rsbivand
Copy link
Member

Updated docs and a vignette with a subsection describing spherical distance-based neighbours ready for your s2 release. I don't know what distribution fits nap duration, I'm just as poor at guessing for kids and their kids too. I think they help generate useful thoughts, but no idea how.

@paleolimbot paleolimbot deleted the dwithin branch May 24, 2022 11:50
@paleolimbot
Copy link
Collaborator Author

Anywhere from 0 minutes to 3 hours! Right now they both have sore throats and fevers and won't nap unless they are literally clinging to us (although, fingers crossed, they are both currently asleep).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

s2_dwithin() and s2_dwithin_matrix() are slow
4 participants