# Geospatial Matching Optimization

This project is designed to showcase how to improve the speed of a polygon matching procedure. There are two main areas that could benefit from taking a "geospatial-native" approach:

1. Data encoding, fetching, and storage
2. Matching

We will generate synthetic polygon data and use both a non-geospatial-native and geospatial-native approach, comparing execution time for both to see how they differ.

Note: Make sure that you have PostgreSQL running.

## Imports

In [847]:
import geopandas as gpd
import pandas as pd
import shapely
import h3
import time
import helpers as h
import matplotlib.pyplot as plt
from shapely.wkt import dumps, loads
from pathlib import Path
import multiprocessing as mp

# ensure helpers is loaded correctly
import importlib
importlib.reload(h)

<module 'helpers' from '/Users/sra/files/projects/matching_optimization/helpers.py'>

## Retrieve Generated Test Data

In [None]:
prev_month_blobs = gpd.read

In [848]:
# Create two blobs of data
prev_month_blobs = h.generate_random_polygons(n=10000)
curr_month_blobs = h.generate_random_polygons(n=10000)

Function `generate_random_polygons` executed in 7.7752 sec, CPU: 8.20%, Memory: 1158.61MB
Function `generate_random_polygons` executed in 7.8338 sec, CPU: 26.80%, Memory: 44.73MB


In [849]:
prev_month_blobs.head()

Unnamed: 0,geometry,id,geohash
1481,"POLYGON ((-98.33318 26.85044, -98.33318 26.883...",fddab117-f6ab-4881-8127-128ac4ae58b9,9uf0g4
7075,"POLYGON ((-98.40704 26.96788, -98.40704 27.030...",bde1cd5c-8ac8-475c-b84b-c5cb93266430,9uf195
3128,"POLYGON ((-98.53842 27.13548, -98.53842 27.222...",6853fb1d-7c65-420e-be8c-16c7a8ecf600,9ucfwe
2845,"POLYGON ((-98.37208 27.25487, -98.37208 27.304...",80103b0a-b02f-4601-85f1-869910b993ac,9uf51y
7152,"POLYGON ((-98.61008 27.28281, -98.61008 27.293...",23a91e74-7db6-4115-abee-86ca0141cea3,9ucghr


In [850]:
curr_month_blobs.head()

Unnamed: 0,geometry,id,geohash
9272,"POLYGON ((-81.14114 27.05090, -81.14114 27.140...",473effa9-7024-4216-9b36-1e5b2d161626,dhy64s
8705,"POLYGON ((-81.16978 27.28284, -81.16978 27.327...",ac7c1af1-6313-4b1d-89a0-c2511dc92620,dhy734
9306,"POLYGON ((-81.45964 27.40301, -81.45964 27.498...",12a411d1-fb8a-482a-964d-6044de891a2b,dhyh4t
5612,"POLYGON ((-81.07037 27.58944, -81.07037 27.599...",d21a8f9e-2ef9-4a76-a87b-696b2981b7fd,dhykgz
8262,"POLYGON ((-81.47784 27.73694, -81.47784 27.824...",6cbd5348-7fb7-48b5-b1e3-24c04f942f1b,dhyn41


In [851]:
# make data folder
Path('data/').mkdir(exist_ok=True)

# save as geoparquet
prev_month_blobs.to_parquet('data/prev_month_blobs.parquet')
curr_month_blobs.to_parquet('data/curr_month_blobs.parquet')

In [852]:
# range_ = 10000

# fig, ax = plt.subplots(figsize=(5, 5))
# prev_month_blobs.iloc[0:range_].plot(ax=ax)
# ax.set_title(f'Viewing the first {range_} blobs')
# plt.tight_layout()

# plt.show()

In [853]:
prev_month_blobs.head()

Unnamed: 0,geometry,id,geohash
1481,"POLYGON ((-98.33318 26.85044, -98.33318 26.883...",fddab117-f6ab-4881-8127-128ac4ae58b9,9uf0g4
7075,"POLYGON ((-98.40704 26.96788, -98.40704 27.030...",bde1cd5c-8ac8-475c-b84b-c5cb93266430,9uf195
3128,"POLYGON ((-98.53842 27.13548, -98.53842 27.222...",6853fb1d-7c65-420e-be8c-16c7a8ecf600,9ucfwe
2845,"POLYGON ((-98.37208 27.25487, -98.37208 27.304...",80103b0a-b02f-4601-85f1-869910b993ac,9uf51y
7152,"POLYGON ((-98.61008 27.28281, -98.61008 27.293...",23a91e74-7db6-4115-abee-86ca0141cea3,9ucghr


## Non-optimized process

### 1. Data encoding, storage, and fetching

We will use PostgreSQL and string storage for geographic information.

#### Make copy of polygon layer and convert to non-geospatial-native string (WKT) datatype

In [854]:
# copy the gdfs
prev_month_blobs_wkt = prev_month_blobs.copy()
curr_month_blobs_wkt = curr_month_blobs.copy()

In [855]:
# Convert WKT versions to strings
dfs_to_convert = [prev_month_blobs_wkt, curr_month_blobs_wkt]
prev_month_blobs_wkt, curr_month_blobs_wkt = [h.convert_col_to_string(df) for df in dfs_to_convert]

# Check result
print(prev_month_blobs_wkt.head())

Function `convert_col_to_string` executed in 0.0498 sec, CPU: 42.90%, Memory: 3.64MB
Function `convert_col_to_string` executed in 0.0481 sec, CPU: 0.00%, Memory: 0.06MB
                                               geometry  \
1481  POLYGON ((-98.333176 26.850443, -98.333176 26....   
7075  POLYGON ((-98.407035 26.967882, -98.407035 27....   
3128  POLYGON ((-98.538421 27.135484, -98.538421 27....   
2845  POLYGON ((-98.372078 27.25487, -98.372078 27.3...   
7152  POLYGON ((-98.610077 27.282811, -98.610077 27....   

                                        id geohash  
1481  fddab117-f6ab-4881-8127-128ac4ae58b9  9uf0g4  
7075  bde1cd5c-8ac8-475c-b84b-c5cb93266430  9uf195  
3128  6853fb1d-7c65-420e-be8c-16c7a8ecf600  9ucfwe  
2845  80103b0a-b02f-4601-85f1-869910b993ac  9uf51y  
7152  23a91e74-7db6-4115-abee-86ca0141cea3  9ucghr  


  df[col] = df[col].apply(lambda geom: to_wkt(geom) if isinstance(geom, BaseGeometry) else str(geom))
  df[col] = df[col].apply(lambda geom: to_wkt(geom) if isinstance(geom, BaseGeometry) else str(geom))


### Save in PostgreSQL database

This is a simple version of what a "non-geospatial-native" data ingestion pipeline may look like, with the primary example being that the polygons are stored as strings, not in a spatial-friendly datatype.

In [856]:
# Convert to tuples
prev_month_blobs_wkt = h.df_itertuple(prev_month_blobs_wkt)
curr_month_blobs_wkt = h.df_itertuple(curr_month_blobs_wkt)

Function `df_itertuple` executed in 0.0033 sec, CPU: 0.00%, Memory: 0.00MB
Function `df_itertuple` executed in 0.0027 sec, CPU: 0.00%, Memory: 0.00MB


Create PostgreSQL database if it doesn't exist already. We will be using the default settings. If they need to be adjusted, refer to [`helpers.py`](helpers.py).

In [857]:
h.create_pg_db()

Database blob_matching already exists.
Function `create_pg_db` executed in 0.0263 sec, CPU: 0.00%, Memory: 0.00MB


Create and insert into tables:

In [858]:
prev_month_blobs.head(1)

Unnamed: 0,geometry,id,geohash
1481,"POLYGON ((-98.33318 26.85044, -98.33318 26.883...",fddab117-f6ab-4881-8127-128ac4ae58b9,9uf0g4


In [859]:
prev_month_blobs_wkt[0]

('POLYGON ((-98.333176 26.850443, -98.333176 26.883512, -98.275705 26.883512, -98.275705 26.850443, -98.333176 26.850443))',
 'fddab117-f6ab-4881-8127-128ac4ae58b9',
 '9uf0g4')

In [860]:
h.create_pg_table(table_name='prev_blobs_wkt', data=prev_month_blobs_wkt, truncate=True)
h.create_pg_table(table_name='curr_blobs_wkt', data=curr_month_blobs_wkt, truncate=True)

Table prev_blobs_wkt truncated.
Inserted 10000 records into prev_blobs_wkt.
Function `create_pg_table` executed in 0.4013 sec, CPU: 0.00%, Memory: 0.00MB
Table curr_blobs_wkt truncated.
Inserted 10000 records into curr_blobs_wkt.
Function `create_pg_table` executed in 0.3766 sec, CPU: 29.00%, Memory: 39.94MB


Retrieve data as GeoDataFrames to confirm that it worked:

In [861]:
df_prev = h.retrieve_pg_table(table_name='prev_blobs_wkt')
df_curr = h.retrieve_pg_table(table_name='curr_blobs_wkt')

Retrieved 10000 records from prev_blobs_wkt.
Function `retrieve_pg_table` executed in 0.0460 sec, CPU: 3.90%, Memory: 1.58MB
Retrieved 10000 records from curr_blobs_wkt.
Function `retrieve_pg_table` executed in 0.0436 sec, CPU: 0.00%, Memory: 0.00MB


Compare the tables before and after for a sanity check:

In [862]:
df_prev.head()

Unnamed: 0,geometry,id,geohash
0,"POLYGON ((-98.333176 26.850443, -98.333176 26....",fddab117-f6ab-4881-8127-128ac4ae58b9,9uf0g4
1,"POLYGON ((-98.407035 26.967882, -98.407035 27....",bde1cd5c-8ac8-475c-b84b-c5cb93266430,9uf195
2,"POLYGON ((-98.538421 27.135484, -98.538421 27....",6853fb1d-7c65-420e-be8c-16c7a8ecf600,9ucfwe
3,"POLYGON ((-98.372078 27.25487, -98.372078 27.3...",80103b0a-b02f-4601-85f1-869910b993ac,9uf51y
4,"POLYGON ((-98.610077 27.282811, -98.610077 27....",23a91e74-7db6-4115-abee-86ca0141cea3,9ucghr


In [863]:
prev_month_blobs.head()

Unnamed: 0,geometry,id,geohash
1481,"POLYGON ((-98.33318 26.85044, -98.33318 26.883...",fddab117-f6ab-4881-8127-128ac4ae58b9,9uf0g4
7075,"POLYGON ((-98.40704 26.96788, -98.40704 27.030...",bde1cd5c-8ac8-475c-b84b-c5cb93266430,9uf195
3128,"POLYGON ((-98.53842 27.13548, -98.53842 27.222...",6853fb1d-7c65-420e-be8c-16c7a8ecf600,9ucfwe
2845,"POLYGON ((-98.37208 27.25487, -98.37208 27.304...",80103b0a-b02f-4601-85f1-869910b993ac,9uf51y
7152,"POLYGON ((-98.61008 27.28281, -98.61008 27.293...",23a91e74-7db6-4115-abee-86ca0141cea3,9ucghr


In [864]:
def round_geometry(geom, precision=6):
    """Round all coordinates of a geometry to a given precision."""
    return shapely.wkt.loads(shapely.wkt.dumps(geom, rounding_precision=precision))

# Convert both to sets of rounded WKT strings
set_prev_month_blobs = set(prev_month_blobs['geometry'].apply(lambda g: round_geometry(g, precision=6).wkt))
set_df_prev = set(df_prev['geometry'].apply(lambda g: round_geometry(g, precision=6).wkt))

# Find common, missing, and extra geometries
common_geometries = set_prev_month_blobs & set_df_prev
missing_from_retrieved = set_prev_month_blobs - set_df_prev
extra_in_retrieved = set_df_prev - set_prev_month_blobs

# Print summary
print(f"Number of matching geometries: {len(common_geometries)}")
print(f"Missing geometries in retrieved table: {len(missing_from_retrieved)}")
print(f"Extra geometries in retrieved table: {len(extra_in_retrieved)}")

# Show an example missing/extra geometry for debugging
if missing_from_retrieved:
    print("Example missing record:", next(iter(missing_from_retrieved)))

if extra_in_retrieved:
    print("Example extra record:", next(iter(extra_in_retrieved)))

Number of matching geometries: 10000
Missing geometries in retrieved table: 0
Extra geometries in retrieved table: 0


It worked!

### 2. Matching

We will match the polygons using GeoPandas.

In [865]:
postgresql_details = h.pg_details()
h.run_parallel_matching(table_prev='prev_blobs_wkt', 
                        table_curr='curr_blobs_wkt', 
                        output_table='matched_results', 
                        postgresql_details=postgresql_details, 
                        db_name='blob_matching', 
                        num_workers=4, 
                        batch_size=100)

Table matched_results created successfully.
Retrieved 10000 records from prev_blobs_wkt.
Function `match_geometries` executed in 0.0045 sec, CPU: 13.40%, Memory: 0.72MB
Function `match_geometries` executed in 0.0067 sec, CPU: 29.10%, Memory: 0.72MB
Function `match_geometries` executed in 0.0043 sec, CPU: 79.90%, Memory: 0.12MB
Function `match_geometries` executed in 0.0054 sec, CPU: 18.20%, Memory: 0.11MB
Function `match_geometries` executed in 0.0060 sec, CPU: 60.90%, Memory: 4.98MB
Function `run_parallel_matching` executed in 42.3933 sec, CPU: 64.80%, Memory: 317.17MB


In [None]:
# testing logging for matching_geometries
df_prev = h._retrieve_pg_table(postgresql_details, 'blob_matching', "prev_blobs_wkt", log_enabled=True)
df_curr = h._retrieve_pg_table(postgresql_details, 'blob_matching', "curr_blobs_wkt", log_enabled=True)

# Use a small subset for testing
df_prev_sample = df_prev.sample(10, random_state=42)
df_curr_sample = df_curr.sample(10, random_state=42)

h.match_geometries(df_prev_sample, df_curr_sample)  # Check if this logs matches

Retrieved 10000 records from prev_blobs_wkt.
Retrieved 10000 records from curr_blobs_wkt.
Function `match_geometries` executed in 0.0052 sec, CPU: 21.40%, Memory: 0.47MB


[]

In [866]:
prev_month_blobs.to_file('data/prev_month_blobs.geojson')
curr_month_blobs.to_file('data/curr_month_blobs.geojson')

geohash polygon into a larger region
match blobs using multiprocessing (mp.Process) and geopandas
find blobs from previous month that do not have a match

# 1 imports and setup

```python
import os, sys, traceback
import argparse
import cv2
import time
import numpy as np
import pandas as pd
import geopandas as gpd
from skimage import io, color, measure
from tqdm import tqdm
from shapely import wkt
from shapely.geometry import Point, Polygon
from shapely.validation import make_valid
import PIL
from PIL import Image, ImageDraw, ImageEnhance
import uuid
import multiprocessing as mp
from datetime import datetime
from sqlalchemy import or_
from sqlmodel import SQLModel, Session, create_engine, select
from pathlib import Path
```

## Key takeaways
- Uses Pandas and GeoPandas for data processing.
- Uses Shapely for geospatial geometry operations.
- Uses multiprocessing (mp) to parallelize the blob-matching process.
- Uses SQLAlchemy and SQLModel for database operations.

# 2 High-level script overview
The script is designed to match blobs (spatial objects) between two months and classify them into different business categories.

General Workflow
1. Fetch blobs from the previous and current months.
1. Convert the region of interest (state, city, county, or geohash) into a list of geohashes.
1. Parallelized Matching Process:
1. Process multiple geohashes at once using multiprocessing.
1. Find corresponding blobs from previous months for each geohash.
1. Classify blobs based on construction stage progression.
1. Identify blobs that are missing in the current month and "impute" them.
1. Save results into the database.

# 3 Blob classification logic

# 4 Blob matching

- Matches blobs between months using polygon intersections.
- Uses Shapely to validate geometries and check for overlaps.
- Returns matched blob IDs.

## Multiprocessing Optimization
The script parallelizes blob matching by:
- Uses multiprocessing (mp.Process) to divide the dataset into smaller batches.
- Each batch of blobs is processed in parallel.
- Reduces runtime compared to a single-threaded approach.

# 5 Handling unmatched blobs
- Finds blobs from the previous month that do not have a match in the current month.
- These blobs are "imputed", meaning they are carried over into the new month.

# 6 Main class to orchestrate process
- Loads previous and current month blobs.
- Converts input regions (state, city, county) into geohashes.

# Summary
- Parallel processing is used to match blobs.
- Blobs are classified based on construction stages.
- Unmatched blobs are imputed for continuity.
- The script writes results to a database.