# 🛠️ Extracting Google Buildings V3 Data for Your Region of Interest

## 📋 Overview

In this notebook, you will:

- Load .geojson.gz building tiles downloaded in notebook 1
- Load your region of interest polygons (e.g., disjoint grids, urban clusters)
- Process each tile in chunks (to avoid memory crashes)
- Filter buildings that intersect with the ROI
- Save only the filtered buildings to a .gpkg file

## 📦 Required Libraries

In [None]:
import os
import gzip
import pandas as pd
import geopandas as gpd
from shapely import wkt
from shapely.geometry import MultiPolygon, Polygon
from tqdm import tqdm

## 🔹 Step 1: Define Paths for Input, ROI, and Output

In [None]:
# Input folder where tiles were downloaded in Step 1
input_dir = r"/home/saiganeshv/Desktop/Google V3 Download/Outputs/sample_ROI_tiles"

# Path to the country-specific urban clusters (can be grid, slum polygons, etc.)
segments_path = r"/home/saiganeshv/Desktop/Google V3 Download/sample_ROI.gpkg"

# Output path to store only filtered buildings
output_path = r"/home/saiganeshv/Desktop/Google V3 Download/Outputs/sample_ROI_filtered_buildings.gpkg"


## 🔹 Step 2: Load the Region of Interest (ROI)

In [None]:
# Load ROI polygons and ensure CRS matches building data
roi = gpd.read_file(segments_path).to_crs("EPSG:4326")

# Create a single unioned geometry for fast spatial filtering
roi_union = roi.unary_union

## 🔹 Step 3: Function to Convert All Geometries to MultiPolygon

In [None]:
# Convert single Polygon to MultiPolygon to avoid GeoPackage type warnings
def ensure_multipolygon(geom):
    if isinstance(geom, Polygon):
        return MultiPolygon([geom])
    return geom


## 🔹 Step 4: Process Each Tile in Chunks and Filter by ROI

**⚠️ Important Note:** Google V3 `.geojson.gz` tiles are actually **CSV files**, **not valid GeoJSON**

Although the downloaded files have a `.geojson.gz` extension, they are **not GeoJSON FeatureCollections**. Instead, each file is a **CSV** (comma-separated text) that:

- Contains one row per building
- Stores the building geometry as a **WKT (Well-Known Text)** string (in the geometry column)
- Includes other fields like `confidence`, `area_in_meters`, and **location info**

Therefore, we **do not use** `gpd.read_file()` to load them.

Instead, we:

- Open the file using `gzip`
- Load it using `pandas.read_csv(...)`
- Convert the WKT strings to shapely geometries using `wkt.loads(...)`
- Convert the DataFrame into a proper `GeoDataFrame`

This step is necessary to correctly parse and spatially filter the data.

In [None]:
def process_tile(fname, chunk_size=7000): #change chunk_size based on your CPU and RAM. Higher value speeds processing but requires more RAM and cpu power.
    if not fname.endswith(".geojson.gz"):
        return None

    tile_id = fname.replace(".geojson.gz", "")
    in_path = os.path.join(input_dir, fname)

    try:
        with gzip.open(in_path, "rt", encoding="utf-8") as f:
            reader = pd.read_csv(f, chunksize=chunk_size)

            for chunk in reader:
                if "geometry" not in chunk.columns:
                    return f"❌ No geometry column in {tile_id}"

                # Convert WKT to shapely geometry and enforce MultiPolygon
                chunk["geometry"] = chunk["geometry"].apply(wkt.loads).apply(ensure_multipolygon)
                gdf = gpd.GeoDataFrame(chunk, geometry="geometry", crs="EPSG:4326")

                # Spatial filter: keep only buildings intersecting ROI
                filtered = gdf[gdf.intersects(roi_union)]
                
                if not filtered.empty:
                    # Save filtered buildings to GPKG (append mode)
                    filtered.to_file(output_path, driver="GPKG", mode="a", index=False)

        return f"✅ Saved filtered buildings from: {tile_id}"

    except Exception as e:
        return f"❌ Failed: {tile_id} — {type(e).__name__}: {e}"


## 🔹 Step 5: Loop Through All Tiles

In [None]:
file_list = os.listdir(input_dir)
results = [process_tile(f) for f in tqdm(file_list)]

## 📋 Summary of Results

In [None]:
# Print outcome for each tile
for r in results:
    if r:
        print(r)


## ✅ Notebook Summary

By the end of this notebook, you will have:

- A GeoPackage file (.gpkg) with only the buildings that intersect your area of interest
- Efficiently avoided processing/storing buildings outside your target zones
- A scalable, memory-safe pipeline that works even for large cities or countries

**If you want to get notified when the download is completed or Jupyter Notebook is done running, get a notification on Telegram. Look at this blog post:** https://sola.kau.se/deprimap/2025/07/16/telegram-bot/