# Spatial Indices in GeoParquet

This notebook explains the different spatial optimization options:
- Bbox columns (required for performance)
- Hilbert sorting (recommended)
- H3, Quadkey, and KD-tree columns (for analysis)

**Important**: GeoParquet doesn't have a traditional spatial index like R-tree. Performance comes from row group statistics combined with spatial sorting.

In [None]:
import geoparquet_io as gpio

## The Bbox Column: Foundation for Performance

The bbox column stores each feature's bounding box as a struct. Combined with GeoParquet 1.1's `covering` metadata, this enables query engines to skip row groups that don't intersect a query area.

**Critical**: Bbox columns only provide performance benefits when data is spatially sorted.

In [None]:
# Add bbox column
table = gpio.read("data/sample.parquet")
with_bbox = table.add_bbox()

print(f"Columns before: {table.column_names}")
print(f"Columns after: {with_bbox.column_names}")

## Hilbert Sorting: Making Bbox Useful

Hilbert sorting reorders rows so nearby features are stored together. This makes row group statistics meaningful for spatial queries.

In [None]:
# The recommended combination: bbox + Hilbert sort
optimized = gpio.read("data/sample.parquet").add_bbox().sort_hilbert()

optimized.info()

## H3 Hexagonal Cells

H3 divides the world into hexagonal cells at multiple resolutions. Use for:
- Aggregating data to a hexagonal grid
- Joining datasets with a common H3 key
- Uniform global coverage analysis

In [None]:
# Add H3 column
with_h3 = table.add_h3(resolution=9)  # ~100m cells
print(f"Columns: {with_h3.column_names}")

In [None]:
# H3 resolution guide:
# Resolution 0: ~1,107 km edge (continental scale)
# Resolution 5: ~8.5 km edge (city scale)
# Resolution 9: ~174 m edge (neighborhood scale)
# Resolution 12: ~9 m edge (building scale)
# Resolution 15: ~0.5 m edge (sub-meter)

# Different resolutions
coarse_h3 = table.add_h3(resolution=6, column_name="h3_coarse")
fine_h3 = table.add_h3(resolution=11, column_name="h3_fine")

## Quadkey Tiles

Quadkeys identify web map tiles at specific zoom levels. Use for:
- Web mapping integration
- Tile-based workflows
- Zoom-level aligned analysis

In [None]:
# Add quadkey column
with_quadkey = table.add_quadkey(resolution=12)
print(f"Columns: {with_quadkey.column_names}")

## KD-tree Cells

Data-adaptive partitioning that divides space based on data distribution. Use for:
- Highly clustered data
- Creating balanced partition sizes
- When H3/quadkey creates very uneven partitions

In [None]:
# Add KD-tree column
with_kdtree = table.add_kdtree(iterations=6)  # 64 partitions
print(f"Columns: {with_kdtree.column_names}")

## Combining Multiple Indices

You can add multiple index columns for different purposes.

In [None]:
# Add multiple indices for analysis workloads
enriched = (
    gpio.read("data/sample.parquet")
    .add_bbox()
    .add_h3(resolution=9)
    .add_quadkey(resolution=12)
    .sort_hilbert()
)

print(f"Columns: {enriched.column_names}")

## Understanding What Each Provides

| Component | Purpose | Improves Query Performance? |
|-----------|---------|-----------------------------|
| **Bbox column** | Row group filtering | Yes (with sorting) |
| **Hilbert sort** | Cluster nearby features | Yes (makes bbox useful) |
| **H3 column** | Aggregation/joins | No (for analysis) |
| **Quadkey column** | Tile workflows | No (for analysis) |
| **KD-tree column** | Balanced partitioning | No (for partitioning) |

## Recommended Workflow

For optimal query performance:

In [None]:
# Minimum for good performance
gpio.read("data/sample.parquet").add_bbox().sort_hilbert().write("/tmp/optimized.parquet")

print("Optimized file written!")

## Next Steps

- [04_partitioning.ipynb](04_partitioning.ipynb) - Splitting data into multiple files
- [Spatial Performance Guide](https://geoparquet.io/concepts/spatial-indices/) - Detailed explanation