# Partitioning GeoParquet Files

This notebook covers partitioning large datasets:
- When to partition vs. sort
- Partitioning by H3 cells
- Partitioning by quadkey
- Reading partitioned datasets

**Note**: Partitioning requires datasets with at least 100+ rows per partition to be effective. The examples below show the API patterns - use them with your larger datasets.

In [None]:
import geoparquet_io as gpio

## When to Partition vs. Sort

| Approach | File Structure | Best For |
|----------|---------------|----------|
| **Sorting only** | Single file | Datasets under 10GB, most queries |
| **Partitioning** | Many files | Very large datasets (10GB+), known query patterns |

Partitioning adds overhead (more files, metadata per file) but enables:
- Query engines to skip entire files
- Parallel reads from multiple files
- Easier updates to specific regions

In [None]:
# Load sample data
table = gpio.read("data/sample.parquet")
table.info()

## Partitioning by H3 Cells

In [None]:
# Partition by H3 (requires larger datasets)
# Uncomment to run with your data:

# from pathlib import Path
# import shutil
#
# output_dir = Path('/tmp/h3_partitions')
# if output_dir.exists():
#     shutil.rmtree(output_dir)
#
# stats = gpio.read('your_large_file.parquet') \
#     .add_h3(resolution=9) \
#     .partition_by_h3(str(output_dir), resolution=4)
#
# print(f"Created {stats['file_count']} partition files")

In [None]:
# See what was created (after running partition above)
# for f in sorted(output_dir.rglob('*.parquet'))[:5]:
#     print(f.relative_to(output_dir))

## Partitioning by Quadkey

In [None]:
# Partition by quadkey (requires larger datasets)
# Uncomment to run with your data:

# output_dir = Path('/tmp/quadkey_partitions')
# if output_dir.exists():
#     shutil.rmtree(output_dir)
#
# stats = gpio.read('your_large_file.parquet') \
#     .add_quadkey(resolution=12) \
#     .partition_by_quadkey(str(output_dir), partition_resolution=4)
#
# print(f"Created {stats['file_count']} partition files")

In [None]:
# See what was created (after running partition above)
# for f in sorted(output_dir.rglob('*.parquet'))[:5]:
#     print(f.relative_to(output_dir))

## Reading Partitioned Datasets

In [None]:
# Read all partitions back (after running partition above)
# combined = gpio.read_partition('/tmp/h3_partitions/')
# combined.info()

In [None]:
# Read with glob pattern
# combined = gpio.read_partition('/tmp/h3_partitions/h3_cell=*/*.parquet')
# print(f"Read {combined.num_rows} rows")

## Best Practices

1. **Sort before partitioning**: Always add bbox and sort within partitions
2. **Choose appropriate resolution**: Aim for partition files of 50-200MB each
3. **Use coarser resolution for partitioning**: The partition column can be at a different resolution than the index column

In [None]:
# Recommended workflow for large datasets
# Uncomment to run with your data:

# output_dir = Path('/tmp/optimized_partitions')
# if output_dir.exists():
#     shutil.rmtree(output_dir)
#
# # Add fine-grained H3 index, sort by Hilbert, partition at coarse level
# stats = gpio.read('your_large_file.parquet') \
#     .add_bbox() \
#     .add_h3(resolution=9) \
#     .sort_hilbert() \
#     .partition_by_h3(str(output_dir), resolution=4)
#
# print(f"Created {stats['file_count']} optimized partition files")

## Cleanup

In [None]:
# Clean up temporary directories (after running partitions)
# for d in ['/tmp/h3_partitions', '/tmp/quadkey_partitions', '/tmp/optimized_partitions']:
#     p = Path(d)
#     if p.exists():
#         shutil.rmtree(p)
#         print(f"Removed {d}")

## Next Steps

- [05_cloud_workflows.ipynb](05_cloud_workflows.ipynb) - Cloud storage integration
- [Partition Guide](https://geoparquet.io/guide/partition/) - CLI partitioning options