Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] start kriging module #140

Closed
wants to merge 2 commits into from
Closed

Conversation

knaaptime
Copy link
Member

this is a first draft at adding a kriging module based on pykrige. Initial explorations were pretty positive, though the quality of the interpolation obviously depends a great deal on the variogram fit

@knaaptime knaaptime requested a review from sjsrey April 27, 2021 21:01
@codecov-commenter
Copy link

codecov-commenter commented Apr 27, 2021

Codecov Report

Merging #140 (e3a07b6) into master (32c8525) will decrease coverage by 3.45%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #140      +/-   ##
==========================================
- Coverage   81.25%   77.79%   -3.46%     
==========================================
  Files          17       19       +2     
  Lines         832      869      +37     
==========================================
  Hits          676      676              
- Misses        156      193      +37     
Impacted Files Coverage Δ
tobler/kriging/__init__.py 0.00% <0.00%> (ø)
tobler/kriging/kriging.py 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32c8525...e3a07b6. Read the comment docs.

@knaaptime
Copy link
Member Author

currently this is just to get started exploring the mechanics of the external libraries. The first fraft takes a really naive approach assigning the predicted value for the target_df centroid to the whole polygon. Instead, we should probably generate a geocube raster of the prediction surface then allow both (a) averaging of pixel values inside the polygon and (b) proper block kriging

Copy link
Member

@sjsrey sjsrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to consider different treatments for the extensive and intensive variables. At first glance, kriging seems more appropriate for the latter.

@knaaptime
Copy link
Member Author

knaaptime commented Jun 27, 2021

agreed on both.

i've also played around a bit further and there are a few different ways we could go about this (and maybe provide options for more than one). The question is how we want to shoehorn the very discrete process of human geography into a continuous spatial model (though as you said, it should work reasonably for percentages).

  1. (in the current PR) is the simplest (probably overly so). It estimates the model using polygon centroids from source_df as observations, then uses that model to predict values at the centroids of target_df. The issue here is that, especially for extensive variables like counts, we end up wayy overestimating the volume of the total surface (so that implementation also includes a rescale). We dont have "control" observations in places with 0 population, so the estimated surface doesn't have the variation we need it to have
  2. Estimate the model using source_df centroids, then predict a continuous raster, then take the average of pixel values that fall inside target_df polygons. I think this is closer to the spirit of block kriging, though still looking for the best reference
  3. Rasterize input_df and estimate the model using that raster, then predict a continuous raster and take the average within target_df polys. This might help capture some of the "harder" edges between polygons that get overly smoothed in approach (1), but also kind of inflates the data (estimating raster resolution x polygon area "observations" instead of one per polygon) so might end up with some oddities for places with lots of heterogeneously-sized polygons. This is also really computationally intensive because the training data becomes so large, so a hybrid option of sorts might be to use something like pointpats to drop random points inside each polygon and use those as observations

@knaaptime
Copy link
Member Author

actually, a 4th option riffing on 3, would be to include auxiliary data to mask out uninhabited regions of source_df, then randomly drop points in the inhabited areas and assign them them values from source_df, then in uninhabited areas drop random points and assign them all 0 and estimate on that "surface"

@knaaptime
Copy link
Member Author

@knaaptime knaaptime deleted the branch pysal:master May 10, 2023 16:52
@knaaptime knaaptime closed this May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants