New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add hexify function #111
add hexify function #111
Conversation
Codecov Report
@@ Coverage Diff @@
## master #111 +/- ##
==========================================
+ Coverage 37.20% 41.74% +4.53%
==========================================
Files 11 12 +1
Lines 508 551 +43
==========================================
+ Hits 189 230 +41
- Misses 319 321 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I LOVE this idea, I think it's great. I've left a few comments in terms of suggestions, some are more for consideration than anything else.
Thinking through it, one of the values of putting this in tobler
is not stopping at generating the hex geometries but having methods to take your data into H3 (or other indexing systems like S2 down the line, as I suggest in the review). You have a bunch of data on different shapefiles with different geometries for the same region, and tobler
helps you "align" them on a standard system like H3. In this model, you could choose how each column/geometry is transfered to H3 (which would give us an excuse to integrate "entry-point" methods that dispatch to each of the techniques that tobler
covers (area weighted, dasymetric, model-based...). A little bit like the meta-API we've discussed for model
in PySAL. Thinking about this, this could be done for any two arbitrary geographies and then if you only specify the input one, the output could go into H3. Some food for thought (and discussion).
Otherwise, the code itself looks good to me!
tobler/util/util.py
Outdated
@@ -102,14 +106,76 @@ def project_gdf(gdf, to_crs=None, to_latlong=False): | |||
|
|||
# calculate the centroid of the union of all the geometries in the | |||
# GeoDataFrame | |||
avg_longitude = gdf['geometry'].unary_union.centroid.x | |||
avg_longitude = gdf["geometry"].unary_union.centroid.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martinfleis will know better, but I think gdf.geometry
is a safer, more general approach to pick the geom column in a GeoDataFrame (it could be set to a column called "geom" but the gdf.geometry
will always pick it up correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Unless you create "geometry"
column by yourself, you cannot assume that it is there and called as such. gdf.geometry
automatically picks the active geometry no matter its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, as i mentioned when @martinfleis raised this a bit ago, i'm confident this pattern appears elsewhere also, so still need to find and resolve all instances.
Further, what i'd prefer to do moving forward is pin to a version of geopandas with the native version of this function implemented and remove this implementation altogether
tobler/util/util.py
Outdated
|
||
# project the GeoDataFrame to the UTM CRS | ||
projected_gdf = gdf.to_crs(utm_crs) | ||
|
||
return projected_gdf | ||
|
||
|
||
def hexify(source, resolution=6, clip=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to rename to h3fy
. It's not to any or a random hexagonal grid, it's the H3 implementation. This will also open the door for sister methods in the future like s2fy
or OSfy
(for the Ordnance Survey standard grid, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. figured there'd be some suggestions on the name
tobler/util/util.py
Outdated
source : geopandas.GeoDataFrame | ||
GeoDataFrame to transform into a hexagonal grid | ||
resolution : int, optional | ||
resolution of output h3 hexgrid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document default value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. also wasnt sure if 6 was the ideal resolution, but will go ahead and leave as-is
tobler/util/util.py
Outdated
---------- | ||
source : geopandas.GeoDataFrame | ||
GeoDataFrame to transform into a hexagonal grid | ||
resolution : int, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for now, so not a comment to implement before approval. Going forward, it'd be great to give also str
options, something like:
auto
: some "magic" that gives the "correct" one, if we can come up with some heuristicsbalanced
(or some alternative name): for the resolution that'll give you the closest number of hexagons to those passed insource
compacted
: one that allocates to a H3 Polyfill compacted version- ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah these are great ideas. I'd like to merge this version first, then raise an issue for these options as an enhancement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one option might be to try and roughly match the number of input geoms, another option might be to try and match the size, e.g. by taking the best match between mean area of input geoms and the hexes listed at https://h3geo.org/docs/core-library/restable
clip : bool, optional | ||
if True, hexagons are clipped to the precise boundary of the source gdf. Otherwise, | ||
heaxgons along the boundary will be left intact. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add an option for return_geoms
or similar that controls whether the actual polygons are returned. I'm not sure, we probably have to create them to do the apportioning but given that tools like Kepler.gl can ingest H3 ids and render them and that some folks might use this for fusing data, if we can find ways of apportioning without creating the geometries, we might be able to be more efficient, and not have to create/return a bunch of hex polygons that'll be discarded as soon as the table is created. Food for thought.
name="hex_id", | ||
) | ||
|
||
polys = hexids.apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could potentially be processed in parallel for faster execution in modern hardware. Chunk hexids
, send to each core (similar to how we're thinking about it in #112) and collect later. It should scale almost linearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. I wonder if we could also outsource some of that logic to something like modin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could also outsource some of that logic to something
That should be ideally dask-geopandas
which is under development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, though in this case the operation is just an apply
, which shouldn't need dask-geopandas (and as far as i understand dask doesnt do apply?)
so in a perfect world, maybe dask-geopandas could end up as an additional backend for modin or something, depending on the operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dask does row-wise apply
very efficiently. It cannot do column-wise apply
but that is not an issue here.
It is true that we can have hexids
as dask.Series
and then apply remains the same. Then below we'll just call compute()
before creating the GeoDataFrame. That should be relatively straightforward to implement with a vanilla dask.DataFrame
even now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet. would the idea be to check whether dask in installed, and then use multicore handling and fallback if not, or to make dask a full dep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a keyword to control that, similar to this implementation we have in momepy: https://github.com/martinfleis/momepy/blob/19e3a1a6eb9577a2c160710bd139034a459a4f97/momepy/elements.py#L433.
There may be cases when single-core is faster than dask (typically small areas) due to to the scheduler overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, thanks for the tip. i'm gonna go ahead and merge this implementation so we all have it available, then we can hack on the multicore enhancement
sweet
definitely agree, and had the same idea particularly when toying around with this notebook in binder. I think the way rasters are handled is probably a lot fater than the way we're doing it now, so happy to keep exploring this thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code suggestion re CRS handling. Requires newer pyproj but I think that the minimal version geopandas requires is enough.
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
anybody got any idea why the tests arent passing? I can see h3 is in the environment but still getting an import error..? |
got it, on conda, |
Let's move this to a separate issue to figure out ways forward. That notebooks seems to use rasters as |
following the example of some similar code from the gdsbook, this adds a function to generate a geodataframe of h3 hexagons in the footprint of a given source gdf. I think its useful to have around and i imagine would be useful to generate a target for interpolation