# Baltimore Analytics — Spatial Enrichment Pipeline
**Project:** `baltimore-analytics`  
**Input:** `raw_data` dataset  
**Output:** `analytics` dataset  
**Author:** Spencer  

---
### What This Notebook Does
Spatially joins all point tables to `neighborhood_boundaries` polygons using BigQuery GIS (`ST_WITHIN`).  
Adds a standardized `neighborhood` column to each table as the universal join key.

```
raw_data.<table>  +  raw_data.neighborhood_boundaries  →  analytics.<table>
```

### Tables Enriched
| Table | Date Col | Rows (approx) | Geo Coverage |
|---|---|---|---|
| crime_incidents_legacy | crimedatetime | 644,737 | 99.8% |
| crime_incidents_current | crimedatetime | 239,435 | 99.9% |
| bpd_arrests | arrestdatetime | 393,475 | 58.8% |
| vacant_building_notices | datenotice | 11,990 | 100.0% |
| vacant_building_rehabs | dateissued | 12,123 | 100.0% |
| building_permits | issueddate | 430,169 | 100.0% |
| service_requests_311 | createddate | 9,008,438 | 86.8% |

**Note:** Records without coordinates will have `neighborhood = NULL` after the join.  
This is expected — see geo coverage above.

---

## 0. Configuration

In [2]:
from google.cloud import bigquery
from google.oauth2 import service_account
from datetime import datetime
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s — %(levelname)s — %(message)s')
log = logging.getLogger(__name__)

GCP_PROJECT      = "baltimore-analytics"
RAW_DATASET      = "raw_data"
ANALYTICS_DATASET = "analytics"
GCP_REGION       = "us-east1"
CREDENTIALS_PATH = "service_account.json"

# Point tables to enrich — (raw_table_name, output_table_name)
POINT_TABLES = [
    "crime_incidents_legacy",
    "crime_incidents_current",
    "bpd_arrests",
    "vacant_building_notices",
    "vacant_building_rehabs",
    "building_permits",
    "service_requests_311",
]

# Reference tables — copied as-is to analytics (no spatial join needed)
REFERENCE_TABLES = [
    "real_property",
    "neighborhood_boundaries",
]

print(f"✓ Config loaded. {len(POINT_TABLES)} point tables to enrich, {len(REFERENCE_TABLES)} reference tables to copy.")

✓ Config loaded. 7 point tables to enrich, 2 reference tables to copy.


## 1. Initialize Client

In [3]:
credentials = service_account.Credentials.from_service_account_file(
    CREDENTIALS_PATH,
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
bq = bigquery.Client(project=GCP_PROJECT, credentials=credentials)

# Ensure analytics dataset exists
ds = bigquery.Dataset(f"{GCP_PROJECT}.{ANALYTICS_DATASET}")
ds.location = GCP_REGION
bq.create_dataset(ds, exists_ok=True)
log.info(f"✓ Dataset '{GCP_PROJECT}.{ANALYTICS_DATASET}' ready.")

2026-02-26 16:37:55,463 — INFO — ✓ Dataset 'baltimore-analytics.analytics' ready.


## 2. Pre-Flight Checks
Verify all source tables exist and `neighborhood_boundaries` has polygon geometry before running enrichment.

In [4]:
print("Checking source tables...\n")

all_ok = True
for table_name in POINT_TABLES + REFERENCE_TABLES:
    try:
        t = bq.get_table(f"{GCP_PROJECT}.{RAW_DATASET}.{table_name}")
        print(f"✅ {table_name}: {t.num_rows:,} rows")
    except Exception as e:
        print(f"❌ {table_name}: {e}")
        all_ok = False

# Verify neighborhood_boundaries has polygon geometry
print("\nChecking neighborhood_boundaries polygon geometry...")
try:
    result = bq.query("""
        SELECT 
            COUNT(*) as total,
            COUNTIF(geo_polygon_wkt IS NOT NULL) as with_polygon
        FROM `baltimore-analytics.raw_data.neighborhood_boundaries`
    """).to_dataframe()
    row = result.iloc[0]
    if row['with_polygon'] == row['total']:
        print(f"✅ All {int(row['total'])} neighborhoods have polygon geometry.")
    else:
        print(f"⚠ Only {int(row['with_polygon'])} / {int(row['total'])} neighborhoods have polygon geometry.")
        all_ok = False
except Exception as e:
    print(f"❌ Could not verify polygon geometry: {e}")
    all_ok = False

print(f"\n{'✓ All checks passed. Ready to run enrichment.' if all_ok else '✗ Fix issues above before proceeding.'}")

Checking source tables...

✅ crime_incidents_legacy: 644,737 rows
✅ crime_incidents_current: 239,435 rows
✅ bpd_arrests: 393,475 rows
✅ vacant_building_notices: 11,990 rows
✅ vacant_building_rehabs: 12,123 rows
✅ building_permits: 430,169 rows
✅ service_requests_311: 9,008,438 rows
✅ real_property: 238,496 rows
✅ neighborhood_boundaries: 279 rows

Checking neighborhood_boundaries polygon geometry...
✅ All 279 neighborhoods have polygon geometry.

✓ All checks passed. Ready to run enrichment.


## 3. Spatial Enrichment
For each point table: LEFT JOIN to `neighborhood_boundaries` using `ST_WITHIN`,  
write result to `analytics` dataset. Uses `WRITE_TRUNCATE` — safe to re-run.

In [None]:
enrichment_log = []

for table_name in POINT_TABLES:
    log.info(f"\n{'='*60}")
    log.info(f"Enriching: {table_name}")

    result = {"table": table_name, "status": None, "rows_total": None,
              "rows_with_neighborhood": None, "match_rate": None, "error": None}

    try:
        src  = f"`{GCP_PROJECT}.{RAW_DATASET}.{table_name}`"
        nsa  = f"`{GCP_PROJECT}.{RAW_DATASET}.neighborhood_boundaries`"
        dest = f"{GCP_PROJECT}.{ANALYTICS_DATASET}.{table_name}"

        # Dynamically build EXCEPT list based on columns that conflict with the join output
        src_table = bq.get_table(f"{GCP_PROJECT}.{RAW_DATASET}.{table_name}")
        src_cols = [f.name for f in src_table.schema]
        except_cols = [c for c in ["objectid", "neighborhood", "nsa_population", "nsa_objectid", "_ingested_at", "_source_url"] if c in src_cols]
        except_clause = f"EXCEPT ({', '.join(except_cols)})" if except_cols else ""

        sql = f"""
            SELECT
                p.* {except_clause},
                n.name        AS neighborhood,
                n.population  AS nsa_population,
                n.objectid    AS nsa_objectid,
                '{pd.Timestamp.utcnow().strftime('%Y-%m-%d %H:%M:%S UTC')}' AS _ingested_at,
                p._source_url
            FROM {src} p
            LEFT JOIN {nsa} n
            ON ST_WITHIN(p.geo_point_wkt, n.geo_polygon_wkt)
        """

        job_config = bigquery.QueryJobConfig(
            destination=dest,
            write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
            create_disposition=bigquery.CreateDisposition.CREATE_IF_NEEDED,
        )

        job = bq.query(sql, job_config=job_config)
        job.result()  # Wait for completion

        # Get stats
        t = bq.get_table(dest)
        stats = bq.query(f"""
            SELECT
                COUNT(*) as total,
                COUNTIF(neighborhood IS NOT NULL) as with_neighborhood
            FROM `{dest}`
        """).to_dataframe().iloc[0]

        result["rows_total"]             = int(stats["total"])
        result["rows_with_neighborhood"] = int(stats["with_neighborhood"])
        result["match_rate"]             = round(stats["with_neighborhood"] / stats["total"] * 100, 1)
        result["status"]                 = "SUCCESS"

        log.info(f"  ✓ {result['rows_with_neighborhood']:,} / {result['rows_total']:,} rows matched ({result['match_rate']}%)")

    except Exception as e:
        log.error(f"  ✗ Failed: {table_name} — {e}")
        result["status"] = "FAILED"
        result["error"]  = str(e)

    enrichment_log.append(result)

log.info("\n✓ Spatial enrichment complete.")

2026-02-26 15:39:30,784 — INFO — 
2026-02-26 15:39:30,785 — INFO — Enriching: crime_incidents_legacy
2026-02-26 15:39:31,330 — ERROR —   ✗ Failed: crime_incidents_legacy — 400 Column objectid in SELECT * EXCEPT list does not exist at [3:25]; reason: invalidQuery, location: query, message: Column objectid in SELECT * EXCEPT list does not exist at [3:25]

Location: us-east1
Job ID: 4b4a8c05-09d6-4c3b-89f1-92e4f8613dec

2026-02-26 15:39:31,331 — INFO — 
2026-02-26 15:39:31,331 — INFO — Enriching: crime_incidents_current
2026-02-26 15:39:31,761 — ERROR —   ✗ Failed: crime_incidents_current — 400 Column objectid in SELECT * EXCEPT list does not exist at [3:25]; reason: invalidQuery, location: query, message: Column objectid in SELECT * EXCEPT list does not exist at [3:25]

Location: us-east1
Job ID: c00effd1-957a-4a70-a591-1dfbe680519c

2026-02-26 15:39:31,762 — INFO — 
2026-02-26 15:39:31,762 — INFO — Enriching: bpd_arrests
2026-02-26 15:39:32,140 — ERROR —   ✗ Failed: bpd_arrests — 400 Co

In [8]:
for table_name in ["crime_incidents_legacy", "vacant_building_notices"]:
    t = bq.get_table(f"baltimore-analytics.raw_data.{table_name}")
    cols = [f.name for f in t.schema]
    conflicts = [c for c in cols if c in ["objectid", "neighborhood", "_ingested_at", "_source_url"]]
    print(f"{table_name}: {conflicts}")

crime_incidents_legacy: ['neighborhood', '_ingested_at', '_source_url']
vacant_building_notices: ['objectid', 'neighborhood', '_ingested_at', '_source_url']


## 4. Copy Reference Tables
Copy `real_property` and `neighborhood_boundaries` to `analytics` as-is — no spatial join needed.

In [5]:
for table_name in REFERENCE_TABLES:
    log.info(f"Copying reference table: {table_name}")
    try:
        sql = f"SELECT * FROM `{GCP_PROJECT}.{RAW_DATASET}.{table_name}`"
        dest = f"{GCP_PROJECT}.{ANALYTICS_DATASET}.{table_name}"

        job_config = bigquery.QueryJobConfig(
            destination=dest,
            write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
            create_disposition=bigquery.CreateDisposition.CREATE_IF_NEEDED,
        )

        bq.query(sql, job_config=job_config).result()
        t = bq.get_table(dest)
        log.info(f"  ✓ {t.num_rows:,} rows → {dest}")
    except Exception as e:
        log.error(f"  ✗ Failed: {table_name} — {e}")

print("\n✓ Reference tables copied.")

2026-02-26 15:36:03,987 — INFO — Copying reference table: real_property
2026-02-26 15:36:16,453 — INFO —   ✓ 238,496 rows → baltimore-analytics.analytics.real_property
2026-02-26 15:36:16,454 — INFO — Copying reference table: neighborhood_boundaries
2026-02-26 15:36:18,332 — INFO —   ✓ 279 rows → baltimore-analytics.analytics.neighborhood_boundaries



✓ Reference tables copied.


## 5. Enrichment Audit Report

In [6]:
import pandas as pd

audit_df = pd.DataFrame(enrichment_log)
pd.set_option("display.max_colwidth", 60)

print("\n" + "="*70)
print("SPATIAL ENRICHMENT AUDIT REPORT")
print(f"Run at: {datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')} UTC")
print("="*70)
display(audit_df[["table", "status", "rows_total", "rows_with_neighborhood", "match_rate", "error"]])

failed = audit_df[audit_df["status"] == "FAILED"]
if len(failed) > 0:
    print(f"\n⚠ {len(failed)} table(s) failed — fix before proceeding to notebook 03.")
else:
    print(f"\n✓ All {len(audit_df)} tables enriched successfully.")


SPATIAL ENRICHMENT AUDIT REPORT
Run at: 2026-02-26 20:36:28 UTC


  print(f"Run at: {datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')} UTC")


Unnamed: 0,table,status,rows_total,rows_with_neighborhood,match_rate,error
0,crime_incidents_legacy,FAILED,,,,400 Duplicate column names in the result are not support...
1,crime_incidents_current,FAILED,,,,400 Duplicate column names in the result are not support...
2,bpd_arrests,FAILED,,,,400 Duplicate column names in the result are not support...
3,vacant_building_notices,FAILED,,,,400 Duplicate column names in the result are not support...
4,vacant_building_rehabs,FAILED,,,,400 Duplicate column names in the result are not support...
5,building_permits,FAILED,,,,400 Duplicate column names in the result are not support...
6,service_requests_311,FAILED,,,,400 Duplicate column names in the result are not support...



⚠ 7 table(s) failed — fix before proceeding to notebook 03.


## 6. Verify analytics Dataset
Confirm all 9 tables landed in `analytics` with expected row counts.

In [None]:
print("analytics dataset contents:\n")
all_tables = POINT_TABLES + REFERENCE_TABLES
for table_name in all_tables:
    try:
        t = bq.get_table(f"{GCP_PROJECT}.{ANALYTICS_DATASET}.{table_name}")
        print(f"✅ {table_name}: {t.num_rows:,} rows")
    except Exception as e:
        print(f"❌ {table_name}: missing")

In [7]:
bq.query("""
    SELECT * FROM `baltimore-analytics.raw_data.real_property`
""", job_config=bigquery.QueryJobConfig(
    destination="baltimore-analytics.analytics.real_property",
    write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
    create_disposition=bigquery.CreateDisposition.CREATE_IF_NEEDED
)).result()
print("✓ real_property refreshed in analytics")

✓ real_property refreshed in analytics


## 7. Spot Check — Neighborhood Match Quality
Sample neighborhood distribution for a few tables to sanity check the spatial join results.

In [None]:
# Top 10 neighborhoods by crime (legacy)
print("Top 10 neighborhoods by crime incidents (legacy):")
display(bq.query("""
    SELECT neighborhood, COUNT(*) as incidents
    FROM `baltimore-analytics.analytics.crime_incidents_legacy`
    WHERE neighborhood IS NOT NULL
    GROUP BY neighborhood
    ORDER BY incidents DESC
    LIMIT 10
""").to_dataframe())

# Top 10 neighborhoods by 311 requests
print("\nTop 10 neighborhoods by 311 service requests:")
display(bq.query("""
    SELECT neighborhood, COUNT(*) as requests
    FROM `baltimore-analytics.analytics.service_requests_311`
    WHERE neighborhood IS NOT NULL
    GROUP BY neighborhood
    ORDER BY requests DESC
    LIMIT 10
""").to_dataframe())

# Top 10 neighborhoods by vacant building notices
print("\nTop 10 neighborhoods by vacant building notices:")
display(bq.query("""
    SELECT neighborhood, COUNT(*) as notices
    FROM `baltimore-analytics.analytics.vacant_building_notices`
    WHERE neighborhood IS NOT NULL
    GROUP BY neighborhood
    ORDER BY notices DESC
    LIMIT 10
""").to_dataframe())

---
## Next Steps
Once all tables show `SUCCESS` and spot checks look reasonable:

1. **`03_feature_engineering.ipynb`** — Aggregate all tables to NSA level, build Neighborhood Vitality Index feature matrix
2. **`04_clustering.ipynb`** — K-means segmentation of neighborhoods into cohorts
3. **`05_looker_studio.md`** — Dashboard connection guide

**If match rates are lower than expected:**  
Check if neighborhoods at city boundaries have points that fall slightly outside polygon edges — may need `ST_DWITHIN` with a small buffer as fallback.

**Re-running this notebook:**  
All queries use `WRITE_TRUNCATE` — safe to re-run at any time without duplicating data.