Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add hexify function #111

Merged
merged 13 commits into from Dec 27, 2020
Merged

add hexify function #111

merged 13 commits into from Dec 27, 2020

Conversation

knaaptime
Copy link
Member

@knaaptime knaaptime commented Dec 23, 2020

following the example of some similar code from the gdsbook, this adds a function to generate a geodataframe of h3 hexagons in the footprint of a given source gdf. I think its useful to have around and i imagine would be useful to generate a target for interpolation

@knaaptime knaaptime changed the title add hexify add hexify function Dec 23, 2020
@codecov-io
Copy link

codecov-io commented Dec 23, 2020

Codecov Report

Merging #111 (ea73d91) into master (a5d8649) will increase coverage by 4.53%.
The diff coverage is 84.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #111      +/-   ##
==========================================
+ Coverage   37.20%   41.74%   +4.53%     
==========================================
  Files          11       12       +1     
  Lines         508      551      +43     
==========================================
+ Hits          189      230      +41     
- Misses        319      321       +2     
Impacted Files Coverage Δ
tobler/util/util.py 68.25% <71.42%> (+11.11%) ⬆️
tobler/tests/test_utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5d8649...ea73d91. Read the comment docs.

Copy link
Member

@darribas darribas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I LOVE this idea, I think it's great. I've left a few comments in terms of suggestions, some are more for consideration than anything else.

Thinking through it, one of the values of putting this in tobler is not stopping at generating the hex geometries but having methods to take your data into H3 (or other indexing systems like S2 down the line, as I suggest in the review). You have a bunch of data on different shapefiles with different geometries for the same region, and tobler helps you "align" them on a standard system like H3. In this model, you could choose how each column/geometry is transfered to H3 (which would give us an excuse to integrate "entry-point" methods that dispatch to each of the techniques that tobler covers (area weighted, dasymetric, model-based...). A little bit like the meta-API we've discussed for model in PySAL. Thinking about this, this could be done for any two arbitrary geographies and then if you only specify the input one, the output could go into H3. Some food for thought (and discussion).

Otherwise, the code itself looks good to me!

@@ -102,14 +106,76 @@ def project_gdf(gdf, to_crs=None, to_latlong=False):

# calculate the centroid of the union of all the geometries in the
# GeoDataFrame
avg_longitude = gdf['geometry'].unary_union.centroid.x
avg_longitude = gdf["geometry"].unary_union.centroid.x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martinfleis will know better, but I think gdf.geometry is a safer, more general approach to pick the geom column in a GeoDataFrame (it could be set to a column called "geom" but the gdf.geometry will always pick it up correctly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Unless you create "geometry" column by yourself, you cannot assume that it is there and called as such. gdf.geometry automatically picks the active geometry no matter its name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, as i mentioned when @martinfleis raised this a bit ago, i'm confident this pattern appears elsewhere also, so still need to find and resolve all instances.

Further, what i'd prefer to do moving forward is pin to a version of geopandas with the native version of this function implemented and remove this implementation altogether


# project the GeoDataFrame to the UTM CRS
projected_gdf = gdf.to_crs(utm_crs)

return projected_gdf


def hexify(source, resolution=6, clip=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to rename to h3fy. It's not to any or a random hexagonal grid, it's the H3 implementation. This will also open the door for sister methods in the future like s2fy or OSfy (for the Ordnance Survey standard grid, for example).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. figured there'd be some suggestions on the name

source : geopandas.GeoDataFrame
GeoDataFrame to transform into a hexagonal grid
resolution : int, optional
resolution of output h3 hexgrid.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document default value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. also wasnt sure if 6 was the ideal resolution, but will go ahead and leave as-is

----------
source : geopandas.GeoDataFrame
GeoDataFrame to transform into a hexagonal grid
resolution : int, optional
Copy link
Member

@darribas darribas Dec 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now, so not a comment to implement before approval. Going forward, it'd be great to give also str options, something like:

  • auto: some "magic" that gives the "correct" one, if we can come up with some heuristics
  • balanced (or some alternative name): for the resolution that'll give you the closest number of hexagons to those passed in source
  • compacted: one that allocates to a H3 Polyfill compacted version
  • ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah these are great ideas. I'd like to merge this version first, then raise an issue for these options as an enhancement

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one option might be to try and roughly match the number of input geoms, another option might be to try and match the size, e.g. by taking the best match between mean area of input geoms and the hexes listed at https://h3geo.org/docs/core-library/restable

clip : bool, optional
if True, hexagons are clipped to the precise boundary of the source gdf. Otherwise,
heaxgons along the boundary will be left intact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add an option for return_geoms or similar that controls whether the actual polygons are returned. I'm not sure, we probably have to create them to do the apportioning but given that tools like Kepler.gl can ingest H3 ids and render them and that some folks might use this for fusing data, if we can find ways of apportioning without creating the geometries, we might be able to be more efficient, and not have to create/return a bunch of hex polygons that'll be discarded as soon as the table is created. Food for thought.

name="hex_id",
)

polys = hexids.apply(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially be processed in parallel for faster execution in modern hardware. Chunk hexids, send to each core (similar to how we're thinking about it in #112) and collect later. It should scale almost linearly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. I wonder if we could also outsource some of that logic to something like modin?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could also outsource some of that logic to something

That should be ideally dask-geopandas which is under development.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, though in this case the operation is just an apply, which shouldn't need dask-geopandas (and as far as i understand dask doesnt do apply?)

so in a perfect world, maybe dask-geopandas could end up as an additional backend for modin or something, depending on the operation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dask does row-wise apply very efficiently. It cannot do column-wise apply but that is not an issue here.

It is true that we can have hexids as dask.Series and then apply remains the same. Then below we'll just call compute() before creating the GeoDataFrame. That should be relatively straightforward to implement with a vanilla dask.DataFrame even now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet. would the idea be to check whether dask in installed, and then use multicore handling and fallback if not, or to make dask a full dep?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a keyword to control that, similar to this implementation we have in momepy: https://github.com/martinfleis/momepy/blob/19e3a1a6eb9577a2c160710bd139034a459a4f97/momepy/elements.py#L433.

There may be cases when single-core is faster than dask (typically small areas) due to to the scheduler overhead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thanks for the tip. i'm gonna go ahead and merge this implementation so we all have it available, then we can hack on the multicore enhancement

@knaaptime
Copy link
Member Author

sweet

Thinking through it, one of the values of putting this in tobler is not stopping at generating the hex geometries but having methods to take your data into H3 (or other indexing systems like S2 down the line, as I suggest in the review). You have a bunch of data on different shapefiles with different geometries for the same region, and tobler helps you "align" them on a standard system like H3.

definitely agree, and had the same idea particularly when toying around with this notebook in binder. I think the way rasters are handled is probably a lot fater than the way we're doing it now, so happy to keep exploring this thread

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code suggestion re CRS handling. Requires newer pyproj but I think that the minimal version geopandas requires is enough.

tobler/util/util.py Outdated Show resolved Hide resolved
tobler/util/util.py Outdated Show resolved Hide resolved
@knaaptime
Copy link
Member Author

anybody got any idea why the tests arent passing? I can see h3 is in the environment but still getting an import error..?

@knaaptime
Copy link
Member Author

got it, on conda, h3 is the C library and h3-py is the package of python bindings (whereas the pypi package is just h3)

@knaaptime knaaptime merged commit 88ca482 into pysal:master Dec 27, 2020
@darribas
Copy link
Member

sweet

Thinking through it, one of the values of putting this in tobler is not stopping at generating the hex geometries but having methods to take your data into H3 (or other indexing systems like S2 down the line, as I suggest in the review). You have a bunch of data on different shapefiles with different geometries for the same region, and tobler helps you "align" them on a standard system like H3.

definitely agree, and had the same idea particularly when toying around with this notebook in binder. I think the way rasters are handled is probably a lot fater than the way we're doing it now, so happy to keep exploring this thread

Let's move this to a separate issue to figure out ways forward. That notebooks seems to use rasters as DataFrame objects? In any case, I think @martinfleis and I may have ideas. I'll leave you open the issue @knaaptime as I think you'll have a better sense of what'd level would be best to discuss ("transferring" data to h3? a meta method to "transfers" data from one geography to another and has defaults to H3 and other indexing systems? Something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants