Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/vector zonal stats #53

Merged
merged 33 commits into from
Jun 28, 2022
Merged

Feature/vector zonal stats #53

merged 33 commits into from
Jun 28, 2022

Conversation

butchtm
Copy link
Collaborator

@butchtm butchtm commented Jun 23, 2022

  • Implement vector zonal stats
  • Add tests
  • Add tutorial
  • Add reference documentation

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@butchtm butchtm requested a review from jtmiclat June 23, 2022 10:21
# Internal Cell
import warnings

warnings.filterwarnings(action="ignore", category=UserWarning, module="geopandas")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppression of warnings should be done by the user and not on the library

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Will remove (accidental add)

results = aoi

# apply aggregations one column at a time
for agg in aggregations:
Copy link
Contributor

@jtmiclat jtmiclat Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we have to run _fix_agg? Cant we just ask the users to pass in the parameters per agg per col.

My expectation here is that each aggregation is unique even if we are running the multiple aggregation or multiple columns. Something like

create_zonal_stats(aoi, data,[
{"func":"count", "col": "index",}
{"func":"min", "col": "pop",}
{"func":",max", "col": "pop",}
{"func":"min", "col": "wealth",}
])

as of know i don't see the benefit of fusing the aggregates per col which i find more confusing to the user

create_zonal_stats(aoi, data,
        [
        {"func":"count", "col": "index",},
        {
            "func": ["min", "max"],
            "column": "pop",
        },
        {"func":"min", "col": "wealth",}
        ]
)

We can then expose optional parameters such

{"func":"count", "col": "index",} => {"func":"count", "col": "index", "fillna":False, "output": "total_count"}

This would simplify things in the code

for agg in aggregations:
    results = _aggregate_single_stat(results, groups, agg)

Copy link
Collaborator Author

@butchtm butchtm Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it primarily for conciseness,
but it can support both formats now, meaning this:

create_zonal_stats(aoi, data,
[ 
 {"func":"count", "col": "index"}, 
 {"func":"min", "col": "pop"},
 {"func":",max", "col": "pop"},
 {"func":"min", "col": "wealth"}
 ])

is equivalent to

create_zonal_stats(aoi, data,
[ 
 {"func":"count", "col": "index"}, 
 {"func":["min","max"], "col": "pop"},
 {"func":"min", "col": "wealth"}
 ])

For the times when we are doing EDA, it sometimes make it more concise to specify different aggregation functions and computing them simultaneously is something that the pandas agg function already supports.

I don't think its necessarily confusing if our documentation provides usage examples.

As for exposing optional parameters like output and fillna, single values are also already supported (I need to fix it for fillna though).

Lastly, the rationale for the _fix_agg method is that its more sophisticated optional arguments filler but it also validates the agg_spec. It doesn't really add that much complexity. These functions can be separated, but I don't think this will add much to improving clarity.

Copy link
Contributor

@jtmiclat jtmiclat Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright with giving users the option to do

 {"func":["min","max"], "col": "pop"},

But internal implementation should convert them to a consistent

 {"func":"max", "col": "pop"},
 {"func":"min", "col": "pop"},

data = data.to_crs(aoi.crs)

# prep for spatial join
aoi, aoi_index_data, aoi_col_data = _prep_aoi(aoi)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to just name the index column something like

__geowrangleer_aoi_ndex

and raise an error if for some reason that column exists at this stage rather than keeping track of everything. we can then drop __geowrangleer_aoi_ndex after all the aggregations.

A possible issue with replacing columns is if for some reason the index is being used as an aggregation column.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its just that right now, this is flexible enough that it doesn't matter what the aoi column names are.

If we were to implement your suggestion we would be reducing the flexibility of the function for the user to have to not think about aoi's column names.

As for the issue of replacing columns, it actually wouldn't be possible to use index as the aggregation column for the aoi side since the only columns used in the spatial join from the aoi side are aoi_index and the geometry. The merge of the grouped columns (left join) are only keyed on the aoi_index . The function guarantees that the number of rows of the aoi are always preserved in the results. I would also have to add handling the output column name.

If you feel strongly that the code simplification is worth the reduction in flexibility, then I'll implement it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaning toward keeping things simple and less flexible. If users request more flexibility, we can address it then. Personally don't want to add complexity unless there are concrete use cases for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

# make data crs == aoi crs
if not data.crs.equals(aoi.crs):
data = data.to_crs(aoi.crs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have: validate aggregations is valid config before running any of the computationally expensive stuff

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@tm-kah-alforja tm-kah-alforja mentioned this pull request Jun 28, 2022
12 tasks
* Remove warnings filter in module
* Separate validation and fixing agg specs
* Expand agg specs to simpler format
* Remove handling of columns with same name as GEO_INDEX_NAME
* Refactored use of GEO_INDEX_NAME across modules and tests
* Removed loop and optimized application of agg functions into 1 step
* Fix handling of fillna for a single func
@butchtm butchtm force-pushed the feature/vector-zonal-stats branch from 5e52117 to 4fce7d4 Compare June 28, 2022 06:43
@github-actions
Copy link
Contributor

github-actions bot commented Jun 28, 2022

Visit the preview URL for this PR (updated for commit 3da5da4):

https://geowrangler--pr53-feature-vector-zonal-h5i5f7kw.web.app

(expires Tue, 05 Jul 2022 07:34:05 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

@butchtm butchtm requested a review from jtmiclat June 28, 2022 07:34
@jtmiclat jtmiclat merged commit 5617251 into master Jun 28, 2022
@jtmiclat jtmiclat deleted the feature/vector-zonal-stats branch June 28, 2022 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants