# Forecast → Hotspots → Sensor Selection (SciDX) Workflow

This notebook is the **end-to-end workflow** of the project using the same scripts in this repo:

1. **(Optional) Download CAMS inputs** → `download_cams.py`  
2. **Run Aurora** to generate regional forecast CSVs → `run_aurora.py` / `aurora_runner.py`  
3. **Detect hotspot events** from forecast CSVs → `event_detection.py`  
4. **Register hotspot CSV** into SciDX (with `updated_at` in `extras`) → `register_hotspots_csv_scidx.py`  
5. **Register sensor sources** (e.g., Synoptic WebSockets) into SciDX → `register_synoptic_websockets.py`  
6. **Orchestrate:** read hotspots + find nearest sensors → `orchestrate_hotspots_to_sensors.py`

> Goal of the notebook: make it easy to see **what runs when**, **what files are produced**, and **how hotspot → nearby sensor selection works**.


## 0) Setup: environment + config

This demo needs:

- `API_URL`, `TOKEN` in your environment (for SciDX / NDP EP)
- optional `SERVER` (defaults to `local`)
- `config.yaml` at the project root with:
  - `region.name` (e.g., `utah`)
  - `hotspot.url` (where the hotspot CSV is hosted)


In [4]:
import os
from pathlib import Path
import yaml
from dotenv import load_dotenv

load_dotenv(override=True)

API_URL = os.environ.get("API_URL", "").strip()
TOKEN  = os.environ.get("TOKEN", "").strip()
SERVER = (os.environ.get("SERVER") or "local").strip()

print("API_URL:", API_URL)
print("SERVER :", SERVER)
print("TOKEN  :", "set" if TOKEN else "MISSING")

# Load config.yaml
project_root = Path.cwd()
cfg_path = project_root.parent / "config.yaml"

with open(cfg_path, "r", encoding="utf-8") as f:
    cfg = yaml.safe_load(f)

region = str(cfg.get("region", {}).get("name", "")).strip()
hotspot_url = str(cfg.get("hotspot", {}).get("url", "")).strip()

print("region     :", region)
print("hotspot.url:", hotspot_url)


API_URL: 10.244.2.206:8003
SERVER : local
TOKEN  : set
region     : utah
hotspot.url: https://drive.google.com/uc?export=download&id=1Hx1RIthTSPyw8qupKudDsiPZaabMvGa7


## 1) Connect to SciDX (catalog/control plane)

We use:
- `ndp_ep.APIClient` for dataset search/metadata
- `scidx_streaming.StreamingClient` for streaming-related helpers and registration


In [5]:
from ndp_ep import APIClient
from scidx_streaming import StreamingClient

client = APIClient(base_url=API_URL, token=TOKEN)
streaming = StreamingClient(client)

print("Streaming user_id:", streaming.user_id)


Streaming user_id: 987104e7-e6d3-47f2-82a0-0d3f620aea70


## 2) (Optional) Download CAMS inputs

If you already have CAMS NetCDF files locally, you can skip this section.

Otherwise run the helper script which pulls CAMS data needed for Aurora.


In [None]:
# run the script as-is (recommended if you're following the repo workflow)
!python download_cams.py


print("Downloaded data successfully.")


## 3) Run Aurora → produce regional forecast artifact(s)

This stage turns CAMS inputs into a **regional forecast CSV** for your configured region.


In [11]:
# Run Aurora using the repo script (preferred)
!python run_aurora.py


print("A forecast CSV artifact created.")


Running Aurora for 2025-10-20 → /uufs/chpc.utah.edu/common/home/u1494915/stream-simulation-ebus/data/processed/predictions/2025-10-20_0000-1200_12h_utah.csv
Loading static variables...
Loading CAMS datasets...
Loading Aurora model...
Surface steps: 13 Atmos steps: 5
Processing hour 1/12
Processing hour 2/12
Processing hour 3/12
Processing hour 4/12
Processing hour 5/12
Processing hour 6/12
Processing hour 7/12
Processing hour 8/12
Processing hour 9/12
Processing hour 10/12
Processing hour 11/12
Processing hour 12/12
Processing hour 1/12
Processing hour 2/12
Processing hour 3/12
Processing hour 4/12
Processing hour 5/12
Processing hour 6/12
Processing hour 7/12
Processing hour 8/12
Processing hour 9/12
Processing hour 10/12
Processing hour 11/12
Processing hour 12/12

Saved hourly predictions with all surface variables to /uufs/chpc.utah.edu/common/home/u1494915/stream-simulation-ebus/data/processed/predictions/2025-10-20_0000-1200_12h_utah.csv
A forecast CSV artifact created.


## 4) Event detection on forecast output → generate hotspot CSV

This stage reads the forecast artifact and produces a **hotspot events CSV** with columns like:
- `timestamp`
- `lat_min`, `lat_max`, `lon_min`, `lon_max`
- optionally `pm25_max`, etc.

Run the repo script:


In [26]:
# Preferred: run the repo script
!python event_detection.py

print("After running, the hotspot CSV should exist, you need to upload the csv to your desired location and copy the link.")


[event_detection] Processing /uufs/chpc.utah.edu/common/home/u1494915/stream-simulation-ebus/data/processed/predictions/2025-10-20_0000-1200_12h_utah.csv
[event_detection] Found 65 persistent hotspots → /uufs/chpc.utah.edu/common/home/u1494915/stream-simulation-ebus/data/processed/hotspots/2025-10-20_utah_hotspots.csv
After running, the hotspot CSV should exist, you need to upload the csv to your desired location and copy the link.


## 5) Register hotspot CSV as a SciDX dataset (adds `updated_at` in extras)

This makes the hotspot artifact **discoverable** via SciDX search:
- `extras.dataset_kind = hotspot` (or `hotspots`, depending on your convention)
- `extras.region = utah`
- `extras.updated_at = <UTC ISO timestamp>`  ✅ used for picking the latest run

Run:


In [21]:
# Preferred: run repo script
!python register_hotspots_csv_scidx.py

print("After running, 'hotspots_<region>' should be created/updated in SciDX.")


INFO:scidx_streaming.client.init_client:Extracted user ID: 987104e7-e6d3-47f2-82a0-0d3f620aea70
INFO:scidx_streaming.client.init_client:Kafka details set: HOST=localhost, PORT=9092, PREFIX=data_stream_, MAX_STREAMS=10
INFO:register_hotspots_csv:Streaming client initialized. user_id=987104e7-e6d3-47f2-82a0-0d3f620aea70
INFO:scidx_streaming.client.registration:Dataset 'hotspots_utah' created successfully with 1 resource(s).
INFO:register_hotspots_csv:Registered hotspot CSV dataset 'hotspots_utah' from https://drive.google.com/uc?export=download&id=1hUfbphCjCPkdgS2n9SE1hjugkNxq22EZ
After running, 'hotspots_<region>' should be created/updated in SciDX.


## 6) Register sensor sources (e.g., Synoptic WebSockets) in SciDX

This step registers sensor *methods* (WebSocket URLs + station metadata).
Then you can discover them using:

`streaming.search_consumption_methods(terms=["sensor", "utah"])`

Run:


In [22]:
# Preferred: run repo script
!python register_synoptic_websockets.py

# Quick check (should return multiple sensors if already registered):
methods = streaming.search_consumption_methods(terms=["sensor", region])
print("Found methods:", len(methods))
if methods:
    print("Example:", methods[0]["name"])


INFO:scidx_streaming.client.init_client:Extracted user ID: 987104e7-e6d3-47f2-82a0-0d3f620aea70
INFO:scidx_streaming.client.init_client:Kafka details set: HOST=localhost, PORT=9092, PREFIX=data_stream_, MAX_STREAMS=10
INFO:register_synoptic_websockets:Streaming Client initialized. User ID: 987104e7-e6d3-47f2-82a0-0d3f620aea70
INFO:scidx_streaming.client.registration:Dataset 'synoptic_push_qcv_aq' created successfully with 1 resource(s).
INFO:register_synoptic_websockets:Registered dataset=synoptic_push_qcv_aq (station=QCV)
INFO:scidx_streaming.client.registration:Dataset 'synoptic_push_qnr_aq' created successfully with 1 resource(s).
INFO:register_synoptic_websockets:Registered dataset=synoptic_push_qnr_aq (station=QNR)
INFO:scidx_streaming.client.registration:Dataset 'synoptic_push_qhw_aq' created successfully with 1 resource(s).
INFO:register_synoptic_websockets:Registered dataset=synoptic_push_qhw_aq (station=QHW)
INFO:scidx_streaming.client.registration:Dataset 'synoptic_push_quttc

## 7) Pick the *latest* hotspot dataset by `extras.updated_at`

When multiple hotspot datasets exist (e.g., repeated registrations), pick the most recent one.

This uses `client.search_datasets(...)` because `updated_at` is stored in dataset **extras**.


In [23]:
from datetime import datetime

def parse_iso(dt: str):
    if not dt:
        return None
    return datetime.fromisoformat(str(dt).replace("Z", "+00:00"))

hotspot_candidates = client.search_datasets(
    terms=["hotspot", region],
    keys=["extras_dataset_kind", "extras_region"],
    server=SERVER,
)

# If your dataset_kind is singular ("hotspot"), fall back:
if not hotspot_candidates:
    hotspot_candidates = client.search_datasets(
        terms=["hotspot", region],
        keys=["extras_dataset_kind", "extras_region"],
        server=SERVER,
    )

print("Hotspot datasets found:", len(hotspot_candidates))

# Keep only those that have extras.updated_at
hotspot_with_time = [
    d for d in hotspot_candidates
    if isinstance(d.get("extras"), dict) and d["extras"].get("updated_at")
]

hotspot_with_time.sort(
    key=lambda d: parse_iso(d["extras"]["updated_at"]),
    reverse=True,
)

latest_hotspot_ds = hotspot_with_time[0]
print("Latest hotspot dataset:", latest_hotspot_ds["name"])
print("updated_at:", latest_hotspot_ds["extras"]["updated_at"])
print("resource url:", latest_hotspot_ds["resources"][0]["url"] if latest_hotspot_ds.get("resources") else None)


Hotspot datasets found: 1
Latest hotspot dataset: hotspots_utah
updated_at: 2026-01-12T05:49:36.253413+00:00
resource url: https://drive.google.com/uc?export=download&id=1hUfbphCjCPkdgS2n9SE1hjugkNxq22EZ


## 8) Load hotspot CSV correctly (Google Drive note)

If your hotspot CSV is hosted on Google Drive, the `.../view?usp=sharing` URL is **not** a raw CSV.
Pandas will read HTML and throw parsing errors.

This helper converts a Drive *view* link into a *direct download* link.
If you get 404, it's usually because:
- the file is not shared publicly / accessible to your environment, or
- the ID is wrong, or
- your environment can't reach Google Drive.

In that case: host the CSV on a plain HTTP endpoint (or S3/GitHub raw) for easiest ingestion.


In [24]:
import re
import pandas as pd
import requests

def drive_view_to_direct(url: str) -> str:
    # From: https://drive.google.com/file/d/<ID>/view?usp=sharing
    m = re.search(r"/file/d/([^/]+)/", url)
    if not m:
        return url
    file_id = m.group(1)
    return f"https://drive.google.com/uc?export=download&id={file_id}"

def read_csv_safely(url: str) -> pd.DataFrame:
    # Try direct url first
    try_url = drive_view_to_direct(url)
    r = requests.get(try_url, timeout=30)
    r.raise_for_status()
    # If HTML page returned, pandas will likely fail; detect quickly:
    ctype = r.headers.get("Content-Type","")
    if "text/html" in ctype.lower():
        raise RuntimeError(f"Got HTML instead of CSV from {try_url}. Check sharing/hosting.")
    from io import StringIO
    return pd.read_csv(StringIO(r.text))

# Use resource URL from SciDX if available; fallback to config hotspot_url
hotspot_resource_url = None
if latest_hotspot_ds.get("resources"):
    hotspot_resource_url = latest_hotspot_ds["resources"][0].get("url")

chosen_url = hotspot_resource_url or hotspot_url
print("Loading hotspot CSV from:", chosen_url)

hotspots_df = read_csv_safely(chosen_url)
print("Rows:", len(hotspots_df), "Cols:", len(hotspots_df.columns))
hotspots_df.head()


Loading hotspot CSV from: https://drive.google.com/uc?export=download&id=1hUfbphCjCPkdgS2n9SE1hjugkNxq22EZ
Rows: 126 Cols: 4


Unnamed: 0,timestamp,lat,lon,pm25
0,2025-10-20T03:00:00+00:00,38.8,248.0,10.482469
1,2025-10-20T15:00:00+00:00,38.8,248.0,10.46417
2,2025-10-20T03:00:00+00:00,38.8,248.4,10.732413
3,2025-10-20T15:00:00+00:00,38.8,248.4,10.619511
4,2025-10-20T03:00:00+00:00,38.8,248.8,10.234383


## 9) Discover sensors (from SciDX) + compute nearest sensor per hotspot

This mirrors your core research goal:
- take a hotspot bounding box
- compute its centroid
- find the nearest registered sensor (by haversine distance)


In [25]:
import math

def haversine_km(lat1, lon1, lat2, lon2):
    R = 6371.0
    p1 = math.radians(lat1)
    p2 = math.radians(lat2)
    dphi = math.radians(lat2 - lat1)
    dl = math.radians(lon2 - lon1)
    a = math.sin(dphi / 2) ** 2 + math.cos(p1) * math.cos(p2) * math.sin(dl / 2) ** 2
    return 2 * R * math.asin(math.sqrt(a))

sensor_methods = streaming.search_consumption_methods(terms=["sensor", region])
print("Sensor methods found:", len(sensor_methods))

# Build a simple list of sensors with coordinates
sensors = []
for ds in sensor_methods:
    res = (ds.get("resources") or [{}])[0]
    cfg = (res.get("config") or {})
    lat = cfg.get("latitude")
    lon = cfg.get("longitude")
    if lat is None or lon is None:
        continue
    sensors.append({
        "name": ds.get("name"),
        "station_id": cfg.get("station_id"),
        "lat": float(lat),
        "lon": float(lon),
    })

assert sensors, "No sensors with coordinates found."

required_cols = {"lat_min","lat_max","lon_min","lon_max"}
missing = required_cols - set(hotspots_df.columns)
assert not missing, f"Hotspot CSV missing columns: {missing}"

has_ts = "timestamp" in hotspots_df.columns
has_pm25 = "pm25_max" in hotspots_df.columns

for _, row in hotspots_df.iterrows():
    center_lat = (float(row["lat_min"]) + float(row["lat_max"])) / 2.0
    center_lon = (float(row["lon_min"]) + float(row["lon_max"])) / 2.0

    nearest = min(sensors, key=lambda s: haversine_km(center_lat, center_lon, s["lat"], s["lon"]))
    d_km = haversine_km(center_lat, center_lon, nearest["lat"], nearest["lon"])

    ts_part = f"{row['timestamp']} | " if has_ts else ""
    pm_part = f"PM2.5 max={row['pm25_max']} → " if has_pm25 else "→ "

    print(f"[HOTSPOT] {ts_part}{pm_part}{nearest['station_id']} ({d_km:.1f} km)")


Sensor methods found: 5


AssertionError: Hotspot CSV missing columns: {'lon_max', 'lat_min', 'lat_max', 'lon_min'}

## 10) Next step: schedule a job on Sage for the selected sensor(s)

At this point you have, for each hotspot:
- hotspot time + bounding box
- nearest sensor station_id (+ coordinates)

Your **next integration point** is to take `station_id` and call scheduler / Sage job launcher, e.g.:

- create a job payload that includes: station_id, time window, variables, output location
- submit the job
