Optimizing grid tile generation #35

joshuacortez · 2022-06-13T07:05:13Z

I think there’s room for further optimization especially for generate_grids in the GridGenerator class. Right now the grid tiles are first generated across the entire span of xrange and yrange and then filtered out after. While this isn’t an issue for very coarse grids, this can easily run into runtime and memory issues for fine grids.

Instead of generating all the tiles and then filtering after, we can generate only the grid tiles we need.

This drastically saves on memory since our gdf won’t need to store tiles it doesn’t need.
This saves on runtime since we don’t need to generate unnecessary tiles, and don’t need to reproject unnecessary tiles.

To determine which grid tiles to generate in the first place, we can use the cheapest possible geometric operations.

Looping through xrange and yrange and intersecting tiles with the gdf’s unary_union can be expensive since the unary_union is a single geometry most likely has a large number of points.
We can instead generate tiles per polygon within the gdf. (If there are multipolygons we can explode to polygons). Per polygon, we can:
1. Find the bounding box of the polygon. (should be cheap)
2. Determine which tiles are within the polygon’s bounding box. (should be cheap since you don’t need to use shapely operations)
3. Check which tiles are within the polygon itself by doing an intersection. (this is the most computationally expensive part but mitigated since we have less tiles to compare with, and a single polygon should have less points than the unary union)
4. Avoid generating tiles that have already been generated. (can be a quick Python set lookup) This is relevant in cases where two polygons share the same tiles.

Can make a PR for this too!

The text was updated successfully, but these errors were encountered:

butchtm · 2022-06-13T08:01:07Z

@joshuacortez
suggestion: this should be done in the context of the implementation of #33 as this will affect what tiles will be generated as well.

jtmiclat · 2022-06-13T09:14:06Z

@joshuacortez For the workflow you mention, I have a few questions.

How would we align the polygons that have been generated already with newly generated ones?
How would we handle creating x_idx/y_idx columns which is useful if either are near each other?

joshuacortez · 2022-06-13T15:51:25Z

How would we align the polygons that have been generated already with newly generated ones?

Ahh are you referring to the risk of having square cells whose corners don't align with each other? Good point. This should be addressed by having a consistently defined overall bounding box + resolution (e.g. bounding box of the entire country with x-meter resolution cells).

With an x,y coordinate system being based on one overall bounding box + resolution, each polygon in the for-loop in the original post should be referring to the same set of cells (i.e. they should be aligned by default).

How would we handle creating x_idx/y_idx columns which is useful if either are near each other?

Oh what did you mean by this?

I imagine the x_idx/y_idx of each cell is based on the overall bounding box + resolution. I.e. cell (0,0) is in one corner of the country bounding box. Let me know if I understood it correctly.

The questions are very related to #33 since the overall bounding box can optionally be set for coordinate system consistency across different AOIs.

jtmiclat · 2022-06-13T23:50:54Z

Ah, x_idx and y_idx are the columns and row numbers in reference to the origin. It is unclear to me how we can derive these based on the bounding box without generating all the grids of a country.

joshuacortez · 2022-06-14T14:44:26Z

@jtmiclat yep! There might be different ways of implementing it.

One way is to have a lookup table for the x-axis and a lookup table for the y-axis. Each lookup table can be a pandas series. The indices are the column numbers and the row numbers, and the corresponding values are the reprojected coordinates (meter-based). The lookup tables are a "compressed" representation of representing the SquareGrid since you don't have to take the cartesian product of the two axes.

E.g. IDN x-axis lookup table in reference to country bounding box using epsg:23845 with 100m spacing

x_idx	x_coord
0	-5302629
1	-5302529
2	-5302429

With these lookup tables, it's easy to get cells with respect to a bounding box since it only relies on checking which coordinates are within the xmin, xmax, ymin, ymax of the bounding box. Might also be a good idea to pad the borders for good measure. (i.e. xmax = xmax + spacing, xmin = xmin - spacing, etc.)

Just not sure how this would look like for H3Grid or S2Grid though. I also acknowledge we're introducing new objects like the 2 lookup tables here but might be a good tradeoff in exchange for a lot of efficiency. Let me know what you think!

jtmiclat · 2022-06-15T13:30:28Z

I see, I kinda get what #33 is. We have a Class or Datatype we can pass that would make it easier to reference

Some pseudo-code

grid_bounds = SquareGridBound(x_min, y_min, x_max, y_max)
# gets the the x_range and y_range of aoi based in reference of grid_bounds 
x_idx_start, x_range, y_idx_start,  y_range = grid_bounds.get_ranges(aoi.bounds)
cells = []
for x_idx, x_coord  in enumerate(x_range):
    for y_idx, y_coord  in enumerate(y_range):
       cells.append({'x_idx': x_idx+x_idx_start, 'y_idx': y_idx+y_idx_start, ....})

might replace

geowrangler/geowrangler/grids.py

Lines 70 to 74 in 86561e6

    
           for x_idx, x in enumerate(xrange): 
        
               for y_idx, y in enumerate(yrange): 
        
                   polygons.append( 
        
                       {"x": x_idx, "y": y_idx, "geometry": self.create_grid(x, y)} 
        
                   )

And a possible usage within SquareGridGenerator

grid_bounds = GridBound(x_min, y_min, x_max, y_max)
aoi1_grid = SquareGridGenerator(aoi_1, grid_bounds).generate_grid()
aoi2_grid = SquareGridGenerator(aoi_2, grid_bounds).generate_grid()

Regarding H3Grid and S2Grid, we don't need to worry about them as they are designed to have a unique index in the context of the entire world. We should only be facing this issue when using a custom index or grid.

joshuacortez · 2022-06-16T01:48:46Z

I see, the SquareGridBound class sounds good!

I was just thinking if SquareGridGenerator takes in not just 1 AOI, but a list of AOIs? (I'm assuming an AOI is just a single polygon and not a multipolygon but not sure it's the same in geowrangler).

In my head, the for-loop pseudo code above could look like the for-loop here for multiple polygons

all_cells = set()
for aoi in aois:
  x_idx_start, x_range, y_idx_start,  y_range = grid_bounds.get_ranges(aoi.bounds)
  cells = []
  for x_idx, x_coord  in enumerate(x_range):
    for y_idx, y_coord  in enumerate(y_range):
        idx_pair = (x_idx + x_idx_start, y_idx + y_idx_start)
        # don't include cells that are already included from other AOIs
        if idx_pair not in all_cells:
            cells.append(idx_pair)
  all_cells.update(cells)

This version doesn't save the actual coordinates though (just the indices) but yeah

joshuacortez mentioned this issue Jun 13, 2022

Decoupling the bounding box from the AOIs #33

Closed

jtmiclat added the enhancement New feature or request label Jun 14, 2022

tm-kah-alforja mentioned this issue Jun 14, 2022

Grid Tile Generation #40

Closed

9 tasks

jtmiclat self-assigned this Jun 16, 2022

tm-kah-alforja added documentation Improvements or additions to documentation and removed enhancement New feature or request labels Jun 23, 2022

jtmiclat mentioned this issue Jun 27, 2022

Add documentation for distant AOIs for grid generation #55

Merged

butchtm closed this as completed in #55 Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing grid tile generation #35

Optimizing grid tile generation #35

joshuacortez commented Jun 13, 2022 •

edited

butchtm commented Jun 13, 2022

jtmiclat commented Jun 13, 2022

joshuacortez commented Jun 13, 2022 •

edited

jtmiclat commented Jun 13, 2022 •

edited

joshuacortez commented Jun 14, 2022 •

edited

jtmiclat commented Jun 15, 2022 •

edited

joshuacortez commented Jun 16, 2022 •

edited

Optimizing grid tile generation #35

Optimizing grid tile generation #35

Comments

joshuacortez commented Jun 13, 2022 • edited

butchtm commented Jun 13, 2022

jtmiclat commented Jun 13, 2022

joshuacortez commented Jun 13, 2022 • edited

jtmiclat commented Jun 13, 2022 • edited

joshuacortez commented Jun 14, 2022 • edited

jtmiclat commented Jun 15, 2022 • edited

joshuacortez commented Jun 16, 2022 • edited

joshuacortez commented Jun 13, 2022 •

edited

joshuacortez commented Jun 13, 2022 •

edited

jtmiclat commented Jun 13, 2022 •

edited

joshuacortez commented Jun 14, 2022 •

edited

jtmiclat commented Jun 15, 2022 •

edited

joshuacortez commented Jun 16, 2022 •

edited