# 01 — Explore Boundary Data

Use this notebook to load and inspect the enumeration area boundaries and control grid files from Google Drive.

**What this notebook covers:**
- Loading project configuration
- Reading boundary GeoJSON files from Google Drive
- Inspecting schema, CRS, and geometry types
- Quick visualization of boundaries

## Setup

Add the project root to the Python path so we can import our `src` modules.

In [None]:
import sys
from pathlib import Path

import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt

# Add project root to path
PROJECT_ROOT = Path.cwd().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# Google Drive data root
DATA_DIR = Path(r"G:\Shared drives\TZ-CCT_RUBEV-0825\Data\Data\0.4_listing_geospatial")
print(f"Project root: {PROJECT_ROOT}")
print(f"Data dir exists: {DATA_DIR.exists()}")

## Load Configuration

The config loader reads `config/config.yaml` and merges any local overrides from `config/config.local.yaml`. This way your local Google Drive path stays out of version control.

In [None]:
from src.utils.config_loader import load_config, get_data_dir, get_output_dir

config = load_config()
data_dir = get_data_dir(config)
output_dir = get_output_dir(config)

print(f"Data directory:   {data_dir}")
print(f"Output directory: {output_dir}")
print(f"Data dir exists:  {data_dir.exists()}")

## Explore Study Area File

Load the Study_Area_30km.geojson and inspect what's inside — columns, CRS, geometry types, and a quick plot.

In [None]:
# Load the study area file
grid_5km_path = DATA_DIR / "01_input_data" / "boundaries" / "Area_study_5km_grid.gpkg"
grid_5km = gpd.read_file(grid_5km_path)

print(f"Shape:          {grid_5km.shape}")
print(f"CRS:            {grid_5km.crs}")
print(f"Geometry types: {grid_5km.geom_type.unique()}")
print(f"\nColumns: {list(grid_5km.columns)}")
print(f"\nDtypes:\n{grid_5km.dtypes}")
grid_5km.head(10)


In [None]:
#merge with control areas sampled
control_sampled_path = DATA_DIR / "01_input_data" / "boundaries" / "Rubeho_control_areas_sampled.csv"
control_sampled = pd.read_csv(control_sampled_path)
print(control_sampled.shape)

control_sampled.head()
control_sampled.control_pixel_sampled.value_counts(dropna=False
                                                       )

print("grid_5km columns:", list(grid_5km.columns))
print("sampled columns:", list(control_sampled.columns))

#merge sampled areas with flags
# Only keep control areas (sampled=1 or replacement=0)
control_ids = control_sampled[control_sampled["control_pixel_sampled"].isin([0, 1])]
id_to_status = control_ids.set_index("grid_id")["control_pixel_sampled"].map({1: "sampled", 0: "replacement"})

grid_5km["sample_status"] = grid_5km["id"].map(id_to_status)

print(grid_5km["sample_status"].value_counts(dropna=False))
grid_5km.head(10)

# Verify counts match source data
assert (grid_5km["sample_status"] == "sampled").sum() == (control_sampled["control_pixel_sampled"] == 1).sum(), "Sampled count mismatch"
assert (grid_5km["sample_status"] == "replacement").sum() == (control_sampled["control_pixel_sampled"] == 0).sum(), "Replacement count mismatch"

print("Counts match")
print(grid_5km["sample_status"].value_counts(dropna=False))



grid_5km

In [None]:
print(grid_5km['id'].value_counts(dropna=False))


In [None]:
grid_5km_wm = grid_5km.to_crs(epsg=3857)

fig, ax = plt.subplots(1, 1, figsize=(12, 10))

# All grid cells as background
grid_5km_wm.plot(ax=ax, facecolor="none", edgecolor="#CCCCCC", linewidth=0.5)

# Sampled controls
grid_5km_wm[grid_5km_wm["sample_status"] == "sampled"].plot(
    ax=ax, facecolor="blue", edgecolor="blue", linewidth=0.8, alpha=0.4, label="Sampled"
)

# Replacements
grid_5km_wm[grid_5km_wm["sample_status"] == "replacement"].plot(
    ax=ax, facecolor="orange", edgecolor="orange", linewidth=0.8, alpha=0.4, label="Replacement"
)

import contextily as cx
cx.add_basemap(ax, source=cx.providers.OpenStreetMap.Mapnik)

ax.set_title("5km Grid — Sampled vs Replacement Controls")
ax.set_axis_off()
plt.tight_layout()
plt.show()


## Load Enumeration Area Boundaries

The enumeration areas file should be at:
```
<data_dir>/01_input_data/boundaries/enumeration_areas.geojson
```

In [None]:
from src.data_processing.load_boundaries import (
    load_enumeration_areas,
    load_control_grid,
    load_layer,
    validate_crs,
)

# Load enumeration areas
ea = load_enumeration_areas(data_dir)
ea = validate_crs(ea)

print(f"\nShape: {ea.shape}")
print(f"CRS:   {ea.crs}")
print(f"\nColumns: {list(ea.columns)}")
ea.head()

## Load Control Grid

The control grid is the set of cells that each enumerator will use for navigation.

In [None]:
grid = load_control_grid(data_dir)
grid = validate_crs(grid)

print(f"\nShape: {grid.shape}")
print(f"CRS:   {grid.crs}")
print(f"\nColumns: {list(grid.columns)}")
grid.head()

## Quick Visualization

Plot enumeration areas and control grid cells together to verify the data looks correct.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1, figsize=(12, 10))

# Plot enumeration areas in light blue
ea.plot(ax=ax, facecolor="lightblue", edgecolor="steelblue", linewidth=0.5, alpha=0.5, label="Enumeration Areas")

# Overlay control grid in red outlines
grid.plot(ax=ax, facecolor="none", edgecolor="red", linewidth=0.8, label="Control Grid")

ax.set_title("Enumeration Areas & Control Grid")
ax.legend()
plt.tight_layout()
plt.show()

## Load Optional Layers

Roads and buildings provide context on the final maps. These are optional — if the files don't exist, `load_layer` returns `None`.

In [None]:
roads = load_layer(data_dir, "roads")
buildings = load_layer(data_dir, "buildings")

if roads is not None:
    print(f"Roads:     {len(roads)} features")
    display(roads.head())

if buildings is not None:
    print(f"Buildings: {len(buildings)} features")
    display(buildings.head())

## Next Steps

- Open **02_test_map_generation.ipynb** to generate a test map for a single grid cell
- Open **03_batch_processing.ipynb** to generate maps for all grid cells
- Run `python scripts/test_setup.py` to verify the full environment