REF: use STRtree.query_bulk in _area_tables_binning #110

martinfleis · 2020-12-23T10:06:26Z

This is the latest version of single-core refactoring of area_interpolate. It turned out, that using pygeos-backed STRtree and its vectorized query_bulk in geopandas is even faster than the numba version @darribas mentioned in #104.

While pygeos is not strictly necessary and geopandas can use rtree for this operation, pygeos will bring speedup of the order of magnitude so I am adding it to requirements to make sure user have it installed.

codecov-io · 2020-12-23T10:10:47Z

Codecov Report

Merging #110 (c652fb2) into master (c10dbda) will decrease coverage by 6.72%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #110      +/-   ##
==========================================
- Coverage   41.34%   34.62%   -6.73%     
==========================================
  Files          11       11              
  Lines         549      491      -58     
==========================================
- Hits          227      170      -57     
+ Misses        322      321       -1

Impacted Files	Coverage Δ
tobler/area_weighted/area_interpolate.py	`41.57% <100.00%> (-13.94%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c10dbda...c652fb2. Read the comment docs.

martinfleis · 2020-12-23T10:20:25Z

Timings:

from tobler.area_weighted import area_interpolate
import geopandas

p = ("https://geographicdata.science/book/_downloads/"\
     "f2341ee89163afe06b42fc5d5ed38060/sandiego_tracts.gpkg")
src = geopandas.read_file(p)

p = ("https://geographicdata.science/book/_downloads/"\
     "d740a1069144baa1302b9561c3d31afe/sd_h3_grid.gpkg")
tgt = geopandas.read_file(p).to_crs(src.crs)

%timeit estimates = area_interpolate(src, tgt, ['total_pop'])

master - 2.4 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
this PR - 394 ms ± 35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

darribas · 2020-12-23T10:34:11Z

+1 on going ahead with this approach. After several experiments and thinking, my sense with the issue is that the second test case was different balance of polygons, but also polygons were distributed differently over space, giving some bins very few but others a lot of polygons to check, and making the binning approach less efficient. Think of the case of western MSAs in the US, if we bin uniformly, the coastal buckets will have a lot more than the inlands... Not sure if that's the case, but it's a hunch.

More generally, what we need for this operation is a spatial index and, given the ecosystem now has very good off-the-shelf options, my vote would be to jump on that wagon and enjoy further improvements down the line.

martinfleis · 2020-12-23T11:50:37Z

@knaaptime there is a test_dasymetric.py file in the root folder, I guess that is not an intention :).

Docs for `STRTree` area interpolation

knaaptime · 2020-12-23T21:49:24Z

@martinfleis yep, not sure how that got there...

In general though this is greatT dramatically simplifies the code too. I'm all for it

sjsrey · 2020-12-26T20:11:45Z

Excellent improvement!

martinfleis added 2 commits December 23, 2020 09:53

use STRtree.query_bulk

4953b01

require pygeos

c652fb2

martinfleis added 3 commits December 23, 2020 11:27

allow specification of sindex

f3ac01b

add tests

52f22d1

more tests

2cdf108

darribas and others added 3 commits December 23, 2020 11:51

Add documentation for area interpolation based on STRTRee

01b8092

Clarify source of spatial index

a31c1a8

Merge pull request #1 from darribas/strtre

10d15c4

Docs for `STRTree` area interpolation

martinfleis marked this pull request as ready for review December 23, 2020 11:56

martinfleis requested review from sjsrey and knaaptime and removed request for sjsrey December 23, 2020 11:57

darribas requested a review from sjsrey December 23, 2020 11:58

darribas mentioned this pull request Dec 23, 2020

numba/parallel implementation of _area_tables_binning #104

Closed

knaaptime approved these changes Dec 23, 2020

View reviewed changes

darribas mentioned this pull request Dec 24, 2020

[WIP] STRTree parallel implementation #112

Merged

sjsrey approved these changes Dec 26, 2020

View reviewed changes

sjsrey merged commit a5d8649 into pysal:master Dec 26, 2020

martinfleis deleted the strtree branch December 26, 2020 20:17

martinfleis restored the strtree branch December 26, 2020 20:17

martinfleis deleted the strtree branch December 26, 2020 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: use STRtree.query_bulk in _area_tables_binning #110

REF: use STRtree.query_bulk in _area_tables_binning #110

martinfleis commented Dec 23, 2020

codecov-io commented Dec 23, 2020 •

edited

Loading

martinfleis commented Dec 23, 2020

darribas commented Dec 23, 2020

martinfleis commented Dec 23, 2020

knaaptime commented Dec 23, 2020

sjsrey commented Dec 26, 2020

REF: use STRtree.query_bulk in _area_tables_binning #110

REF: use STRtree.query_bulk in _area_tables_binning #110

Conversation

martinfleis commented Dec 23, 2020

codecov-io commented Dec 23, 2020 • edited Loading

Codecov Report

martinfleis commented Dec 23, 2020

darribas commented Dec 23, 2020

martinfleis commented Dec 23, 2020

knaaptime commented Dec 23, 2020

sjsrey commented Dec 26, 2020

codecov-io commented Dec 23, 2020 •

edited

Loading