-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/vector zonal stats #53
Conversation
butchtm
commented
Jun 23, 2022
- Implement vector zonal stats
- Add tests
- Add tutorial
- Add reference documentation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
geowrangler/vector_zonal_stats.py
Outdated
# Internal Cell | ||
import warnings | ||
|
||
warnings.filterwarnings(action="ignore", category=UserWarning, module="geopandas") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppression of warnings should be done by the user and not on the library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Will remove (accidental add)
geowrangler/vector_zonal_stats.py
Outdated
results = aoi | ||
|
||
# apply aggregations one column at a time | ||
for agg in aggregations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we have to run _fix_agg
? Cant we just ask the users to pass in the parameters per agg per col.
My expectation here is that each aggregation is unique even if we are running the multiple aggregation or multiple columns. Something like
create_zonal_stats(aoi, data,[
{"func":"count", "col": "index",}
{"func":"min", "col": "pop",}
{"func":",max", "col": "pop",}
{"func":"min", "col": "wealth",}
])
as of know i don't see the benefit of fusing the aggregates per col which i find more confusing to the user
create_zonal_stats(aoi, data,
[
{"func":"count", "col": "index",},
{
"func": ["min", "max"],
"column": "pop",
},
{"func":"min", "col": "wealth",}
]
)
We can then expose optional parameters such
{"func":"count", "col": "index",} => {"func":"count", "col": "index", "fillna":False, "output": "total_count"}
This would simplify things in the code
for agg in aggregations:
results = _aggregate_single_stat(results, groups, agg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it primarily for conciseness,
but it can support both formats now, meaning this:
create_zonal_stats(aoi, data,
[
{"func":"count", "col": "index"},
{"func":"min", "col": "pop"},
{"func":",max", "col": "pop"},
{"func":"min", "col": "wealth"}
])
is equivalent to
create_zonal_stats(aoi, data,
[
{"func":"count", "col": "index"},
{"func":["min","max"], "col": "pop"},
{"func":"min", "col": "wealth"}
])
For the times when we are doing EDA, it sometimes make it more concise to specify different aggregation functions and computing them simultaneously is something that the pandas agg
function already supports.
I don't think its necessarily confusing if our documentation provides usage examples.
As for exposing optional parameters like output
and fillna
, single values are also already supported (I need to fix it for fillna
though).
Lastly, the rationale for the _fix_agg
method is that its more sophisticated optional arguments filler but it also validates the agg_spec
. It doesn't really add that much complexity. These functions can be separated, but I don't think this will add much to improving clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright with giving users the option to do
{"func":["min","max"], "col": "pop"},
But internal implementation should convert them to a consistent
{"func":"max", "col": "pop"},
{"func":"min", "col": "pop"},
geowrangler/vector_zonal_stats.py
Outdated
data = data.to_crs(aoi.crs) | ||
|
||
# prep for spatial join | ||
aoi, aoi_index_data, aoi_col_data = _prep_aoi(aoi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to just name the index column something like
__geowrangleer_aoi_ndex
and raise an error if for some reason that column exists at this stage rather than keeping track of everything. we can then drop __geowrangleer_aoi_ndex
after all the aggregations.
A possible issue with replacing columns is if for some reason the index is being used as an aggregation column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its just that right now, this is flexible enough that it doesn't matter what the aoi column names are.
If we were to implement your suggestion we would be reducing the flexibility of the function for the user to have to not think about aoi's column names.
As for the issue of replacing columns, it actually wouldn't be possible to use index as the aggregation column for the aoi side since the only columns used in the spatial join from the aoi side are aoi_index
and the geometry. The merge of the grouped columns (left join) are only keyed on the aoi_index
. The function guarantees that the number of rows of the aoi are always preserved in the results. I would also have to add handling the output column name.
If you feel strongly that the code simplification is worth the reduction in flexibility, then I'll implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaning toward keeping things simple and less flexible. If users request more flexibility, we can address it then. Personally don't want to add complexity unless there are concrete use cases for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
# make data crs == aoi crs | ||
if not data.crs.equals(aoi.crs): | ||
data = data.to_crs(aoi.crs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to have: validate aggregations
is valid config before running any of the computationally expensive stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
* Remove warnings filter in module * Separate validation and fixing agg specs * Expand agg specs to simpler format * Remove handling of columns with same name as GEO_INDEX_NAME * Refactored use of GEO_INDEX_NAME across modules and tests * Removed loop and optimized application of agg functions into 1 step * Fix handling of fillna for a single func
5e52117
to
4fce7d4
Compare
Visit the preview URL for this PR (updated for commit 3da5da4): https://geowrangler--pr53-feature-vector-zonal-h5i5f7kw.web.app (expires Tue, 05 Jul 2022 07:34:05 GMT) 🔥 via Firebase Hosting GitHub Action 🌎 |