Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: use STRtree.query_bulk in _area_tables_binning #110

Merged
merged 8 commits into from
Dec 26, 2020
Merged

REF: use STRtree.query_bulk in _area_tables_binning #110

merged 8 commits into from
Dec 26, 2020

Conversation

martinfleis
Copy link
Member

xref #104

This is the latest version of single-core refactoring of area_interpolate. It turned out, that using pygeos-backed STRtree and its vectorized query_bulk in geopandas is even faster than the numba version @darribas mentioned in #104.

While pygeos is not strictly necessary and geopandas can use rtree for this operation, pygeos will bring speedup of the order of magnitude so I am adding it to requirements to make sure user have it installed.

@codecov-io
Copy link

codecov-io commented Dec 23, 2020

Codecov Report

Merging #110 (c652fb2) into master (c10dbda) will decrease coverage by 6.72%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #110      +/-   ##
==========================================
- Coverage   41.34%   34.62%   -6.73%     
==========================================
  Files          11       11              
  Lines         549      491      -58     
==========================================
- Hits          227      170      -57     
+ Misses        322      321       -1     
Impacted Files Coverage Δ
tobler/area_weighted/area_interpolate.py 41.57% <100.00%> (-13.94%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c10dbda...c652fb2. Read the comment docs.

@martinfleis
Copy link
Member Author

Timings:

from tobler.area_weighted import area_interpolate
import geopandas

p = ("https://geographicdata.science/book/_downloads/"\
     "f2341ee89163afe06b42fc5d5ed38060/sandiego_tracts.gpkg")
src = geopandas.read_file(p)

p = ("https://geographicdata.science/book/_downloads/"\
     "d740a1069144baa1302b9561c3d31afe/sd_h3_grid.gpkg")
tgt = geopandas.read_file(p).to_crs(src.crs)

%timeit estimates = area_interpolate(src, tgt, ['total_pop'])

master - 2.4 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
this PR - 394 ms ± 35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@darribas
Copy link
Member

+1 on going ahead with this approach. After several experiments and thinking, my sense with the issue is that the second test case was different balance of polygons, but also polygons were distributed differently over space, giving some bins very few but others a lot of polygons to check, and making the binning approach less efficient. Think of the case of western MSAs in the US, if we bin uniformly, the coastal buckets will have a lot more than the inlands... Not sure if that's the case, but it's a hunch.

More generally, what we need for this operation is a spatial index and, given the ecosystem now has very good off-the-shelf options, my vote would be to jump on that wagon and enjoy further improvements down the line.

@martinfleis
Copy link
Member Author

@knaaptime there is a test_dasymetric.py file in the root folder, I guess that is not an intention :).

@martinfleis martinfleis marked this pull request as ready for review December 23, 2020 11:56
@martinfleis martinfleis requested review from sjsrey and knaaptime and removed request for sjsrey December 23, 2020 11:57
@darribas darribas requested a review from sjsrey December 23, 2020 11:58
@knaaptime
Copy link
Member

@martinfleis yep, not sure how that got there...

In general though this is greatT dramatically simplifies the code too. I'm all for it

@sjsrey
Copy link
Member

sjsrey commented Dec 26, 2020

Excellent improvement!

@sjsrey sjsrey merged commit a5d8649 into pysal:master Dec 26, 2020
@martinfleis martinfleis deleted the strtree branch December 26, 2020 20:17
@martinfleis martinfleis restored the strtree branch December 26, 2020 20:17
@martinfleis martinfleis deleted the strtree branch December 26, 2020 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants