# GeoAI-VLM v0.2.0 — New Features Demo

This notebook demonstrates the **new features** added in v0.2.0:

1. **Multimodal Embedding** — Generate semantically rich vectors from text and images using Qwen3-VL-Embedding  
2. **Vector Search** — Build and query searchable indexes with ChromaDB or FAISS  
3. **Semantic Clustering** — K-Means clustering of VLM descriptions with TF-IDF keywords  
4. **Spatial Autocorrelation** — Global and Local Moran's I analysis  
5. **Visualization** — Elbow curves, cluster maps, LISA maps, category distributions  
6. **Data Preparation** — Parse, merge, and construct embedding text from pipeline output  

> **Prerequisites**: Install GeoAI-VLM v0.2.0 (`pip install -e .` from the repo root).  
> Sections 1–5 work **offline** with synthetic data — no GPU or API key needed.  
> Section 6 shows the full pipeline requiring a Mapillary API key and a GPU.

In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

# Create a reusable output directory
OUTPUT_DIR = Path("./demo_v020_output")
OUTPUT_DIR.mkdir(exist_ok=True)

print("GeoAI-VLM v0.2.0 demo ready!")

---
## 0. Synthetic Dataset

We create a small synthetic GeoDataFrame that mimics real GeoAI-VLM pipeline output.
This lets us exercise every new module without needing a GPU or API key.

In [None]:
rng = np.random.RandomState(42)
N = 60

land_uses = ["residential", "commercial", "mixed", "institutional", "green_space"]
street_types = ["arterial", "collector", "local", "pedestrian", "alley"]
characters = ["busy", "quiet", "touristic", "residential", "industrial"]

narratives = [
    "A bustling commercial street lined with shops and cafés under colorful awnings.",
    "A quiet residential alley with potted plants and laundry hanging between buildings.",
    "A wide arterial road with modern office buildings and a tram line in the median.",
    "A narrow pedestrian lane with cobblestones, leading to a historic mosque.",
    "A tree-lined boulevard next to a public park with benches and a fountain.",
    "An industrial zone with warehouses, trucks, and a railway crossing.",
    "A touristic waterfront promenade overlooking the Bosphorus with ferries docked.",
    "A mixed-use neighborhood with ground-floor shops and apartments above.",
    "A collectors road connecting suburban housing blocks to the main highway.",
    "A green institutional campus with university buildings surrounded by gardens.",
]

tag_pool = ["urban", "green", "historic", "modern", "waterfront", "dense",
            "transit", "cultural", "quiet", "noisy", "pedestrian", "car-oriented"]

data = {
    "image_id": [f"img_{i:04d}" for i in range(N)],
    "scene_narrative": [narratives[i % len(narratives)] for i in range(N)],
    "semantic_tags": [
        ", ".join(rng.choice(tag_pool, size=4, replace=False).tolist())
        for _ in range(N)
    ],
    "land_use_primary": rng.choice(land_uses, N).tolist(),
    "street_type": rng.choice(street_types, N).tolist(),
    "place_character": rng.choice(characters, N).tolist(),
    "lat": (41.005 + rng.rand(N) * 0.01).tolist(),
    "lon": (28.975 + rng.rand(N) * 0.01).tolist(),
}

df = pd.DataFrame(data)
geometry = [Point(lon, lat) for lon, lat in zip(df["lon"], df["lat"])]
gdf = gpd.GeoDataFrame(df, geometry=geometry, crs="EPSG:4326")

print(f"Synthetic dataset: {len(gdf)} rows, {len(gdf.columns)} columns")
gdf.head()

---
## 1. Data Preparation

The `preparation` module helps parse VLM JSON output, merge data sources, and
build composite embedding text.

In [None]:
import json
from geoai_vlm.preparation import (
    parse_vlm_descriptions,
    merge_data_sources,
    extract_image_id,
    build_embedding_text,
)

# --- parse_vlm_descriptions ---
# Simulate a DataFrame with a JSON-encoded description column
raw_descriptions = gdf[["image_id"]].copy()
raw_descriptions["vlm_description"] = gdf.apply(
    lambda r: json.dumps({
        "scene_narrative": r["scene_narrative"],
        "semantic_tags": r["semantic_tags"],
        "land_use_primary": r["land_use_primary"],
    }),
    axis=1,
)

parsed = parse_vlm_descriptions(
    raw_descriptions,
    description_column="vlm_description",
    target_field="scene_narrative",
)
print("Parsed descriptions:")
parsed[["image_id", "parsed_description"]].head(3)

In [None]:
# --- extract_image_id ---
# Pull numeric IDs from file-path strings
paths = pd.Series([
    "/data/mapillary/1234567890.jpg",
    "/data/mapillary/9876543210.jpg",
    "/data/mapillary/1111111111.jpg",
])
ids = extract_image_id(paths)
print("Extracted IDs:")
print(ids.tolist())

In [None]:
# --- merge_data_sources ---
metadata = gdf[["image_id", "lat", "lon"]].copy()
descriptions = gdf[["image_id", "scene_narrative", "semantic_tags"]].copy()
classifiers = gdf[["image_id", "land_use_primary", "street_type"]].copy()

unified = merge_data_sources(
    metadata,
    descriptions,
    classifiers=classifiers,
    merge_on="image_id",
)
print(f"Merged columns: {list(unified.columns)}")
unified.head(3)

In [None]:
# --- build_embedding_text ---
# Concatenate multiple columns into a single text for embedding
gdf["embedding_text"] = build_embedding_text(
    gdf,
    columns=["scene_narrative", "semantic_tags", "place_character"],
    separator=" | ",
)
print("Embedding text sample:")
print(gdf["embedding_text"].iloc[0])

---
## 2. Semantic Clustering

The `SemanticClusterer` generates embeddings, runs K-Means, extracts TF-IDF
keywords per cluster, and profiles GeoAI categories.

> Below we use a lightweight **mock embedder** to avoid downloading a model.
> In production, replace it with `ImageEmbedder()` (see Section 6).

In [None]:
from geoai_vlm.clustering import SemanticClusterer, ClusterConfig


# --- Lightweight mock embedder (no GPU needed) ---
class DemoEmbedder:
    """Returns deterministic random embeddings for demo purposes."""
    instruction = "demo"

    def embed_texts(self, texts, **kw):
        rng = np.random.RandomState(len(texts))
        emb = rng.randn(len(texts), 128).astype(np.float32)
        norms = np.linalg.norm(emb, axis=1, keepdims=True)
        return emb / np.where(norms == 0, 1, norms)


embedder = DemoEmbedder()

config = ClusterConfig(
    n_clusters=5,
    embedding_columns=["scene_narrative", "semantic_tags", "place_character"],
    n_keywords=8,
)
clusterer = SemanticClusterer(embedder=embedder, config=config)

print(f"Config: {config}")

In [None]:
# --- Find optimal k with the elbow method ---
k_values, inertias = clusterer.find_optimal_k(gdf, k_range=(2, 10))

print("k  | Inertia")
for k, inertia in zip(k_values, inertias):
    print(f"{k:>2} | {inertia:,.1f}")

In [None]:
# --- Run clustering ---
gdf_clustered = clusterer.cluster(gdf, n_clusters=5)

print(f"\nCluster distribution:")
print(gdf_clustered["cluster"].value_counts().sort_index())

In [None]:
# --- Extract keywords per cluster ---
gdf_clustered["embedding_text"] = clusterer.build_embedding_text(gdf_clustered)
keywords = clusterer.extract_keywords(gdf_clustered)

print("Top keywords per cluster:")
for cid, kw_list in keywords.items():
    print(f"  Cluster {cid}: {', '.join(kw_list[:5])}")

In [None]:
# --- Category analysis ---
profiles = clusterer.analyze_categories(
    gdf_clustered,
    category_columns=["land_use_primary", "street_type", "place_character"],
)

for cid, profile in profiles.items():
    print(f"Cluster {cid} (n={profile['size']}):")
    for key, val in profile.items():
        if key != "size":
            print(f"    {key}: {val}")

---
## 3. Visualization

Built-in plotting functions for quick exploration.

In [None]:
import matplotlib
matplotlib.use("Agg")  # use 'TkAgg' or remove this line for interactive display

from geoai_vlm.visualization import (
    plot_elbow_curve,
    plot_cluster_map,
    plot_category_distribution,
    generate_report,
)

In [None]:
# --- Elbow curve ---
fig_elbow = plot_elbow_curve(inertias, k_range=k_values, optimal_k=5)
fig_elbow.savefig(OUTPUT_DIR / "elbow_curve.png", dpi=150, bbox_inches="tight")
print("Saved elbow_curve.png")
fig_elbow

In [None]:
# --- Cluster map ---
fig_map = plot_cluster_map(gdf_clustered)
fig_map.savefig(OUTPUT_DIR / "cluster_map.png", dpi=150, bbox_inches="tight")
print("Saved cluster_map.png")
fig_map

In [None]:
# --- Category distribution ---
fig_cats = plot_category_distribution(
    gdf_clustered,
    category_columns=["land_use_primary", "street_type"],
)
fig_cats.savefig(OUTPUT_DIR / "category_distribution.png", dpi=150, bbox_inches="tight")
print("Saved category_distribution.png")
fig_cats

In [None]:
# --- Markdown report ---
report = generate_report(gdf_clustered, keywords)
print(report[:600])
print("...")

---
## 4. Spatial Autocorrelation

Global and Local Moran's I reveal whether clusters are spatially concentrated
or randomly distributed.

In [None]:
from geoai_vlm.spatial import SpatialAnalyzer, MoranResult

sa = SpatialAnalyzer(k_neighbors=8)

# --- Global Moran's I ---
global_results = sa.moran_global(gdf_clustered)

print("Global Moran's I per cluster:")
print(f"{'Cluster':>8}  {'I':>8}  {'E[I]':>8}  {'z':>8}  {'p':>8}")
for cid, mr in global_results.items():
    sig = "*" if mr.p_value < 0.05 else " "
    print(f"{cid:>8}  {mr.I:>8.4f}  {mr.expected_I:>8.4f}  {mr.z_score:>8.3f}  {mr.p_value:>8.4f} {sig}")

In [None]:
# --- Local Moran's I (LISA) ---
gdf_lisa = sa.moran_local(gdf_clustered)

print(f"Significant LISA observations: {gdf_lisa['lisa_significant'].sum()} / {len(gdf_lisa)}")
print(f"\nLISA cluster distribution:")
print(gdf_lisa["lisa_cluster"].value_counts().sort_index())

In [None]:
# --- LISA map ---
from geoai_vlm.visualization import plot_lisa_map

fig_lisa = plot_lisa_map(gdf_lisa)
fig_lisa.savefig(OUTPUT_DIR / "lisa_map.png", dpi=150, bbox_inches="tight")
print("Saved lisa_map.png")
fig_lisa

---
## 5. Vector Search

Build a searchable vector database from VLM descriptions using ChromaDB or FAISS.

In [None]:
from geoai_vlm.vectorstore import (
    ChromaVectorStore,
    FAISSVectorStore,
    VectorDB,
)

### 5a. Low-Level — ChromaDB

In [None]:
# Generate synthetic embeddings for all rows
embeddings = embedder.embed_texts(gdf["scene_narrative"].tolist())
ids = gdf["image_id"].tolist()
metadatas = [
    {"land_use": row["land_use_primary"], "street_type": row["street_type"]}
    for _, row in gdf.iterrows()
]

# Create a persistent ChromaDB store
chroma_store = ChromaVectorStore(
    persist_directory=str(OUTPUT_DIR / "chroma_db"),
    collection_name="demo_places",
)
chroma_store.add(embeddings=embeddings, ids=ids, metadatas=metadatas)
print(f"ChromaDB: {chroma_store.count()} vectors stored")

# Query: find 5 most similar to the first image
results = chroma_store.query(query_embedding=embeddings[0], n_results=5)
print(f"\nTop-5 similar to {ids[0]}:")
for i, (rid, dist) in enumerate(zip(results["ids"], results["distances"])):
    meta = results["metadatas"][i] if results["metadatas"] else {}
    print(f"  {i+1}. {rid} (dist={dist:.4f}) — {meta}")

### 5b. Low-Level — FAISS

In [None]:
# In-memory FAISS index
faiss_store = FAISSVectorStore(dimension=embeddings.shape[1])
faiss_store.add(embeddings=embeddings, ids=ids, metadatas=metadatas)
print(f"FAISS: {faiss_store.count()} vectors stored")

# Query
results_f = faiss_store.query(query_embedding=embeddings[0], n_results=5)
print(f"\nTop-5 similar (FAISS):")
for i, rid in enumerate(results_f["ids"]):
    print(f"  {i+1}. {rid} (dist={results_f['distances'][i]:.4f})")

# Persist to disk and reload
faiss_path = str(OUTPUT_DIR / "faiss_index")
faiss_store.persist(faiss_path)

faiss_reloaded = FAISSVectorStore.load(faiss_path)
print(f"\nReloaded FAISS index: {faiss_reloaded.count()} vectors")

### 5c. High-Level — VectorDB Orchestrator

In [None]:
# The VectorDB class wraps embedding + store in one object.
# Here we use the mock embedder; in production use ImageEmbedder().

vdb = VectorDB(
    embedder=embedder,
    store_backend="faiss",
    dimension=128,
)

# Build index from GeoDataFrame
vdb.build(
    gdf,
    text_column="scene_narrative",
    metadata_columns=["land_use_primary", "street_type"],
    id_column="image_id",
)

print(f"VectorDB built: {vdb.store.count()} vectors")

# Search by text
search_results = vdb.search(query_text="busy commercial street with shops")
print(f"\nSearch results for 'busy commercial street with shops':")
search_results.head()

---
## 6. Full Pipeline (Requires GPU + API Key)

The convenience functions in `geoai_vlm.pipeline` chain everything together.
These cells require:
- A valid **Mapillary API key**
- A **GPU** with the Qwen3-VL model weights

Skip this section if running on CPU only.

In [None]:
import os

MLY_API_KEY = os.environ.get("MLY_API_KEY", "YOUR_MAPILLARY_API_KEY")
HAS_KEY = MLY_API_KEY != "YOUR_MAPILLARY_API_KEY"

if HAS_KEY:
    print("Mapillary API key detected — full pipeline cells will run.")
else:
    print("No API key set. Set MLY_API_KEY to run the cells below.")

In [None]:
# --- embed_place: download, describe, AND embed in one call ---
if HAS_KEY:
    from geoai_vlm import embed_place

    gdf_embedded = embed_place(
        place_name="Sultanahmet, Istanbul",
        mly_api_key=MLY_API_KEY,
        buffer_m=100,
        max_images=20,
        model_name="Qwen/Qwen3-VL-Embedding-2B",
        output_dir=OUTPUT_DIR / "embed_place",
    )
    print(f"Embedded {len(gdf_embedded)} images")
    print(f"Embedding dim: {len(gdf_embedded['embedding'].iloc[0])}")
else:
    print("Skipped — no API key.")

In [None]:
# --- cluster_descriptions: cluster a GeoDataFrame or parquet file ---
if HAS_KEY:
    from geoai_vlm import cluster_descriptions

    gdf_clust = cluster_descriptions(
        gdf_embedded,
        n_clusters=5,
        embedding_columns=["scene_narrative", "semantic_tags", "place_character"],
    )
    print(gdf_clust["cluster"].value_counts().sort_index())
else:
    print("Skipped — no API key.")

In [None]:
# --- analyze_spatial: run Moran's I on the clustered data ---
if HAS_KEY:
    from geoai_vlm import analyze_spatial

    spatial = analyze_spatial(gdf_clust, k_neighbors=8)

    print("Global Moran's I:")
    for cid, mr in spatial["global"].items():
        print(f"  Cluster {cid}: I={mr.I:.4f}, p={mr.p_value:.4f}")
else:
    print("Skipped — no API key.")

In [None]:
# --- build_search_index + search_similar ---
if HAS_KEY:
    from geoai_vlm import build_search_index, search_similar

    vdb = build_search_index(
        gdf_embedded,
        store_backend="faiss",
        text_column="scene_narrative",
        metadata_columns=["land_use_primary", "street_type"],
    )

    hits = search_similar(vdb, query_text="historic mosque with minarets", n_results=5)
    print("Search results:")
    display(hits)
else:
    print("Skipped — no API key.")

---
## Summary

| Module | Key Class / Function | What it Does |
|---|---|---|
| `preparation` | `parse_vlm_descriptions`, `merge_data_sources`, `build_embedding_text` | Parse, merge, and prepare data |
| `embedding` | `ImageEmbedder` | Multimodal embedding (Qwen3-VL-Embedding) |
| `clustering` | `SemanticClusterer`, `ClusterConfig` | K-Means + TF-IDF keywords |
| `spatial` | `SpatialAnalyzer`, `MoranResult` | Moran's I (global + local LISA) |
| `vectorstore` | `VectorDB`, `ChromaVectorStore`, `FAISSVectorStore` | Vector search |
| `visualization` | `plot_cluster_map`, `plot_lisa_map`, `generate_report` | Charts & reports |
| `pipeline` | `embed_place`, `cluster_descriptions`, `analyze_spatial`, `build_search_index`, `search_similar` | One-liner convenience functions |

For the core image download and VLM description pipeline, see **`demo_geoai_vlm.ipynb`**.