Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing grid tile generation #35

Closed
Tracked by #40
joshuacortez opened this issue Jun 13, 2022 · 7 comments · Fixed by #55
Closed
Tracked by #40

Optimizing grid tile generation #35

joshuacortez opened this issue Jun 13, 2022 · 7 comments · Fixed by #55
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@joshuacortez
Copy link
Contributor

joshuacortez commented Jun 13, 2022

I think there’s room for further optimization especially for generate_grids in the GridGenerator class. Right now the grid tiles are first generated across the entire span of xrange and yrange and then filtered out after. While this isn’t an issue for very coarse grids, this can easily run into runtime and memory issues for fine grids.

Instead of generating all the tiles and then filtering after, we can generate only the grid tiles we need.

  1. This drastically saves on memory since our gdf won’t need to store tiles it doesn’t need.
  2. This saves on runtime since we don’t need to generate unnecessary tiles, and don’t need to reproject unnecessary tiles.

To determine which grid tiles to generate in the first place, we can use the cheapest possible geometric operations.

  1. Looping through xrange and yrange and intersecting tiles with the gdf’s unary_union can be expensive since the unary_union is a single geometry most likely has a large number of points.
  2. We can instead generate tiles per polygon within the gdf. (If there are multipolygons we can explode to polygons). Per polygon, we can:
    1. Find the bounding box of the polygon. (should be cheap)
    2. Determine which tiles are within the polygon’s bounding box. (should be cheap since you don’t need to use shapely operations)
    3. Check which tiles are within the polygon itself by doing an intersection. (this is the most computationally expensive part but mitigated since we have less tiles to compare with, and a single polygon should have less points than the unary union)
    4. Avoid generating tiles that have already been generated. (can be a quick Python set lookup) This is relevant in cases where two polygons share the same tiles.

Can make a PR for this too!

@butchtm
Copy link
Collaborator

butchtm commented Jun 13, 2022

@joshuacortez
suggestion: this should be done in the context of the implementation of #33 as this will affect what tiles will be generated as well.

@jtmiclat
Copy link
Contributor

@joshuacortez For the workflow you mention, I have a few questions.

  1. How would we align the polygons that have been generated already with newly generated ones?
  2. How would we handle creating x_idx/y_idx columns which is useful if either are near each other?

@joshuacortez
Copy link
Contributor Author

joshuacortez commented Jun 13, 2022

  1. How would we align the polygons that have been generated already with newly generated ones?

Ahh are you referring to the risk of having square cells whose corners don't align with each other? Good point. This should be addressed by having a consistently defined overall bounding box + resolution (e.g. bounding box of the entire country with x-meter resolution cells).

With an x,y coordinate system being based on one overall bounding box + resolution, each polygon in the for-loop in the original post should be referring to the same set of cells (i.e. they should be aligned by default).

  1. How would we handle creating x_idx/y_idx columns which is useful if either are near each other?

Oh what did you mean by this?

I imagine the x_idx/y_idx of each cell is based on the overall bounding box + resolution. I.e. cell (0,0) is in one corner of the country bounding box. Let me know if I understood it correctly.

The questions are very related to #33 since the overall bounding box can optionally be set for coordinate system consistency across different AOIs.

@jtmiclat
Copy link
Contributor

jtmiclat commented Jun 13, 2022

Ah, x_idx and y_idx are the columns and row numbers in reference to the origin. It is unclear to me how we can derive these based on the bounding box without generating all the grids of a country.

@jtmiclat jtmiclat added the enhancement New feature or request label Jun 14, 2022
@tm-kah-alforja tm-kah-alforja mentioned this issue Jun 14, 2022
9 tasks
@joshuacortez
Copy link
Contributor Author

joshuacortez commented Jun 14, 2022

@jtmiclat yep! There might be different ways of implementing it.

One way is to have a lookup table for the x-axis and a lookup table for the y-axis. Each lookup table can be a pandas series. The indices are the column numbers and the row numbers, and the corresponding values are the reprojected coordinates (meter-based). The lookup tables are a "compressed" representation of representing the SquareGrid since you don't have to take the cartesian product of the two axes.

E.g. IDN x-axis lookup table in reference to country bounding box using epsg:23845 with 100m spacing

x_idx x_coord
0 -5302629
1 -5302529
2 -5302429

With these lookup tables, it's easy to get cells with respect to a bounding box since it only relies on checking which coordinates are within the xmin, xmax, ymin, ymax of the bounding box. Might also be a good idea to pad the borders for good measure. (i.e. xmax = xmax + spacing, xmin = xmin - spacing, etc.)

Just not sure how this would look like for H3Grid or S2Grid though. I also acknowledge we're introducing new objects like the 2 lookup tables here but might be a good tradeoff in exchange for a lot of efficiency. Let me know what you think!

@jtmiclat
Copy link
Contributor

jtmiclat commented Jun 15, 2022

I see, I kinda get what #33 is. We have a Class or Datatype we can pass that would make it easier to reference

Some pseudo-code

grid_bounds = SquareGridBound(x_min, y_min, x_max, y_max)
# gets the the x_range and y_range of aoi based in reference of grid_bounds 
x_idx_start, x_range, y_idx_start,  y_range = grid_bounds.get_ranges(aoi.bounds)
cells = []
for x_idx, x_coord  in enumerate(x_range):
    for y_idx, y_coord  in enumerate(y_range):
       cells.append({'x_idx': x_idx+x_idx_start, 'y_idx': y_idx+y_idx_start, ....})

might replace

for x_idx, x in enumerate(xrange):
for y_idx, y in enumerate(yrange):
polygons.append(
{"x": x_idx, "y": y_idx, "geometry": self.create_grid(x, y)}
)

And a possible usage within SquareGridGenerator

grid_bounds = GridBound(x_min, y_min, x_max, y_max)
aoi1_grid = SquareGridGenerator(aoi_1, grid_bounds).generate_grid()
aoi2_grid = SquareGridGenerator(aoi_2, grid_bounds).generate_grid()

Regarding H3Grid and S2Grid, we don't need to worry about them as they are designed to have a unique index in the context of the entire world. We should only be facing this issue when using a custom index or grid.

@jtmiclat jtmiclat self-assigned this Jun 16, 2022
@joshuacortez
Copy link
Contributor Author

joshuacortez commented Jun 16, 2022

I see, the SquareGridBound class sounds good!

I was just thinking if SquareGridGenerator takes in not just 1 AOI, but a list of AOIs? (I'm assuming an AOI is just a single polygon and not a multipolygon but not sure it's the same in geowrangler).

In my head, the for-loop pseudo code above could look like the for-loop here for multiple polygons

all_cells = set()
for aoi in aois:
  x_idx_start, x_range, y_idx_start,  y_range = grid_bounds.get_ranges(aoi.bounds)
  cells = []
  for x_idx, x_coord  in enumerate(x_range):
    for y_idx, y_coord  in enumerate(y_range):
        idx_pair = (x_idx + x_idx_start, y_idx + y_idx_start)
        # don't include cells that are already included from other AOIs
        if idx_pair not in all_cells:
            cells.append(idx_pair)
  all_cells.update(cells)

This version doesn't save the actual coordinates though (just the indices) but yeah

@tm-kah-alforja tm-kah-alforja added documentation Improvements or additions to documentation and removed enhancement New feature or request labels Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants