# Inspiration & Description
This notebook demonstrates a tiny end-to-end pipeline that automates scenario taxonomy from testing data in the [Waymo Motion Dataset](https://waymo.com/open/) using GenAI.

The goal is to replace the manual, time-consuming process of curating corner cases with an automated pipeline that parses multi-agent motion data, extracts interpretable features, and generates structured, human-readable scenario summaries. For systems and test engineering at Waymo, this matters because the space of possible driving situations is enormous (and growing!). Humans cannot hand-label or exhaustively enumerate them.  

By introducing GenAI at the summarization step, we scalably turn raw kinematics into labeled scenarios that are easy to cluster and incorporate into test design.

Inspiration: I started this project after reading more about Waymo's [safety framework](https://waymo.com/blog/2020/10/sharing-our-safety-framework) and thinking about how scenario taxonomy might use GenAI as Waymo rapidly scales its worldwide footprint (and thus testing).

## Setup & Dependencies

- **Waymo proto parsing + TensorFlow** for reading `.tfrecord` scenario files
- **Sentence Transformers** for clustering summaries into a taxonomy.  
- **Gemini API** for structured JSON event summarization.  
- Standard Python utilities (`json`, `numpy`, `collections`, etc)

In [10]:
# waymo open dataset import
!pip install --no-deps "waymo-open-dataset-tf-2-12-0==1.6.7"
!pip install -q -U google-generativeai
!pip -q install pydantic
from waymo_open_dataset.protos import scenario_pb2

# general imports
from google.colab import auth
import numpy as np
import os, json, time, re
import requests
import math
import collections
from collections import Counter
import matplotlib.pyplot as plt
from pathlib import Path

# ml imports
from sklearn.cluster import KMeans
import tensorflow as tf
from sentence_transformers import SentenceTransformer
from sklearn.metrics import pairwise_distances_argmin_min

# gemini import
import google.generativeai as genai
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

Collecting waymo-open-dataset-tf-2-12-0==1.6.7
  Downloading waymo_open_dataset_tf_2_12_0-1.6.7-py3-none-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (1.8 kB)
Downloading waymo_open_dataset_tf_2_12_0-1.6.7-py3-none-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (3.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: waymo-open-dataset-tf-2-12-0
Successfully installed waymo-open-dataset-tf-2-12-0-1.6.7


In [None]:
# authentication for google user
auth.authenticate_user()

# file slug for scenario testing data
PATTERN = "gs://waymo_open_dataset_motion_v_1_3_0/uncompressed/scenario/testing/testing.tfrecord-*"
files = tf.io.gfile.glob(PATTERN)
print("files:", len(files), "e.g.", files[:3])

files: 150 e.g. ['gs://waymo_open_dataset_motion_v_1_3_0/uncompressed/scenario/testing/testing.tfrecord-00000-of-00150', 'gs://waymo_open_dataset_motion_v_1_3_0/uncompressed/scenario/testing/testing.tfrecord-00001-of-00150', 'gs://waymo_open_dataset_motion_v_1_3_0/uncompressed/scenario/testing/testing.tfrecord-00002-of-00150']


In [None]:
# Stream all shards
ds = tf.data.TFRecordDataset(files, num_parallel_reads=tf.data.AUTOTUNE)

## Helpers
These utility functions handle the raw Waymo Motion Dataset `.tfrecord` files and convert them into a structured format.

- **`iter_scenarios`**: Iterates over serialized `Scenario` protobufs from the dataset and yields them one by one.
- **`tracks_to_arrays`**: Converts each track into arrays of position, velocity, yaw, and timestamps, while surfacing object types (vehicle, pedestrian, cyclist). Invalid states are skipped.  
- **`build_snapshots`**: Constructs interpretable event snapshots between the self-driving car and its objects of interest. For each pair, it computes:  
  - Closest approach (`min_dist_m` and `step_min`)  
  - Relative gap and orientation:  
    - `gap_m`: distance along the self-driving car’s forward direction at closest approach  
      - Positive → other object was ahead  
      - Negative → other object was behind  
    - `aligned_frac`: % of time moving in same direction (parallel flow)  
    - `oncoming_frac`: % of time moving in opposite directions (head-on)  
    - `ahead_frac`: % of time the other object is ahead of the self-driving car  
- self-driving and object speeds at closest approach  

These helpers synthesize data into explainable features that form the foundation for event labeling AI summarization.

In [None]:
# Helpers

# yield Scenario messages from a TFRecordDataset.
def iter_scenarios(dataset, limit=None):
    count = 0
    for raw in dataset:
        sc = scenario_pb2.Scenario(); sc.ParseFromString(raw.numpy())
        yield sc
        count += 1
        if limit is not None and count >= limit:
            break

# convert scene tracks to arrays
def tracks_to_arrays(sc):
    out=[]
    for tr in sc.tracks:
        xs=ys=vxs=vys=yaws=ts=[]
        xs,ys,vxs,vys,yaws,ts=[],[],[],[],[],[]
        for st in tr.states:
            if hasattr(st,"valid") and not st.valid: continue
            xs.append(getattr(st,"center_x",getattr(st,"x",0.0)))
            ys.append(getattr(st,"center_y",getattr(st,"y",0.0)))
            vxs.append(getattr(st,"velocity_x",0.0))
            vys.append(getattr(st,"velocity_y",0.0))
            yaws.append(getattr(st,"heading",0.0))
            ts.append(getattr(st,"time_point",getattr(st,"timestamp_micros",0)))
        if xs:
            out.append({"id":tr.id,"type_num":tr.object_type,
                        "x":np.array(xs,np.float32),"y":np.array(ys,np.float32),
                        "vx":np.array(vxs,np.float32),"vy":np.array(vys,np.float32),
                        "yaw":np.array(yaws,np.float32),"t":np.array(ts,np.int64)})
    return out

# build scene snapshots with defined parameters (scene_id, speed, distance, etc)
def unit(v):
    n=np.linalg.norm(v); return v/(n+1e-9)

def type_name(num: int) -> str:
    return scenario_pb2.Track.ObjectType.Name(num).replace("TYPE_", "").lower()

def build_snapshots(sc, tracks):
    sdc_id = sc.tracks[sc.sdc_track_index].id
    important_ids = list(sc.objects_of_interest)
    id2 = {t["id"]: t for t in tracks}
    sdc = id2.get(sdc_id); snaps=[]
    if sdc is None: return snaps
    sdc_type = type_name(sdc["type_num"])

    for oid in important_ids:
        if oid==sdc_id or oid not in id2:
            continue
        oth=id2[oid]; T=min(len(sdc["x"]),len(oth["x"]))
        if T<2: continue

        align=[]; ahead=[]; dist=[]
        for k in range(T):
            f_s=unit(np.array([sdc["vx"][k],sdc["vy"][k]],np.float64))
            f_o=unit(np.array([oth["vx"][k],oth["vy"][k]],np.float64))
            rel=np.array([oth["x"][k]-sdc["x"][k],oth["y"][k]-sdc["y"][k]],np.float64)
            align.append(float(f_s@f_o))
            ahead.append(float(rel@f_s))         # >0 ahead, <0 behind
            dist.append(float(np.linalg.norm(rel)))

        kmin=int(np.argmin(dist))
        snaps.append({
          "scene_id": sc.scenario_id,
          "self_id": int(sdc_id),
          "self_type": sdc_type,
          "other_id": int(oid),
          "other_type": type_name(oth["type_num"]),
          "is_object_of_interest": True,
          "min_dist_m": dist[kmin],
          "step_min": kmin,
          "gap_m": ahead[kmin],
          "aligned_frac": sum(a>0.5 for a in align)/T,
          "oncoming_frac": sum(a<-0.5 for a in align)/T,
          "ahead_frac": sum(g>0 for g in ahead)/T,
          "self_speed_mps": float(np.hypot(sdc["vx"][kmin],sdc["vy"][kmin])),
          "other_speed_mps": float(np.hypot(oth["vx"][kmin],oth["vy"][kmin])),
        })
    return snaps

In [None]:
def label_snapshot(s):
    # thresholds
    CLOSE_PASS_M      = 5.0
    FOLLOW_HW_S       = 1.0
    FOLLOW_GAP_M      = 15.0
    ONCOMING_TTC_S    = 2.0
    PED_CYC_NEAR_M    = 12.0
    CROSSING_DIST_M   = 20.0
    CROSSING_ALIGN_MAX= 0.3

    # 1) close pass
    if s["min_dist_m"] <= CLOSE_PASS_M:
        return "close_pass"

    # 2) following too close (same-ish direction, other mostly ahead)
    if s["aligned_frac"] > 0.5 and s["ahead_frac"] > 0.5 and s["gap_m"] > 0:
        hw = s["gap_m"] / max(0.1, s["self_speed_mps"])
        if hw < FOLLOW_HW_S and s["gap_m"] < FOLLOW_GAP_M:
            return "following_too_close"

    # 3) oncoming interaction
    if s["oncoming_frac"] > 0.5:
        ttc = s["min_dist_m"] / max(0.1, s["self_speed_mps"] + s["other_speed_mps"])
        if ttc < ONCOMING_TTC_S:
            return "oncoming_interaction"

    # 4) VRU nearby (only if none of the above fired)
    if s["other_type"] in {"pedestrian","cyclist"} and s["min_dist_m"] <= PED_CYC_NEAR_M:
        return "ped_or_cyclist_nearby"

    # 5) crossing paths (only if none of the above)
    if s["aligned_frac"] <= CROSSING_ALIGN_MAX and s["min_dist_m"] <= CROSSING_DIST_M:
        return "crossing_paths"

    # 6) default
    return "none"

# textual summaries to provide template / exemplar for Gemini
def summarize_from_label(s):
    if s["label"] == "ped_or_cyclist_nearby":
        return f"{s['self_type']} near a {s['other_type']} (min distance ≈ {s['min_dist_m']:.1f} m)."
    if s["label"] == "crossing_paths":
        return f"{s['self_type']} and {s['other_type']} move on crossing paths; min distance ≈ {s['min_dist_m']:.1f} m."
    if s["label"] == "close_pass":
        return f"{s['self_type']} and {s['other_type']} pass within {s['min_dist_m']:.1f} m at step {s['step_min']}."
    if s["label"] == "following_too_close":
        hw = s["gap_m"] / max(0.1, s["self_speed_mps"])
        return f"{s['self_type']} follows a {s['other_type']} with headway ≈ {hw:.1f} s (gap ≈ {s['gap_m']:.1f} m)."
    if s["label"] == "oncoming_interaction":
        ttc = s["min_dist_m"] / max(0.1, s["self_speed_mps"] + s["other_speed_mps"])
        return f"{s['self_type']} and {s['other_type']} approach head-on; TTC ≈ {ttc:.1f} s (min distance ≈ {s['min_dist_m']:.1f} m)."
    return f"No notable event between {s['self_type']} and {s['other_type']} (min distance {s['min_dist_m']:.1f} m)."


## Event Writer
This step ties the helpers together into a single pass over the dataset.  
For each scenario:

- Extract tracks and build interaction snapshots.  
- Apply the **heuristic event labeler** (`label_snapshot`) to categorize the interaction.  
- Generate a **deterministic summary** (`summarize_from_label`) to provide an exemplar / template to Gemini.  
- Write everything into a JSON Lines file (`scenario_events.jsonl`) for later use.  

The output includes both structured data (distances, fractions) and a label/summary.  
We also keep a running counter of label frequencies to get a quick sense of dataset coverage.

In [None]:
OUTP   = "/content/scenario_events.jsonl"
SCENES = 200  # or None

counts = collections.Counter()
wrote = 0

with tf.io.gfile.GFile(OUTP, "w") as fout:
    for sc in iter_scenarios(ds, limit=SCENES):
        tracks = tracks_to_arrays(sc)

        for s in build_snapshots(sc, tracks):
            s["label"]   = label_snapshot(s)
            s["summary"] = summarize_from_label(s)
            fout.write(json.dumps(s) + "\n")
            counts[s["label"]] += 1
            wrote += 1

print(f"events written: {wrote} → {OUTP}")
print("label counts:", dict(counts))


events written: 103 → /content/scenario_events.jsonl
label counts: {'none': 87, 'close_pass': 3, 'crossing_paths': 9, 'following_too_close': 1, 'ped_or_cyclist_nearby': 3}


Most interactions in the sample look uneventful (wide gaps, parallel motion, nothing risky), which is why the majority get the `none` label. A few rarer cases (close passes, crossing paths, bikes/peds) show up in smaller numbers, hinting at edge-cases this pipeline could surface.

In [None]:
!head -n 5 /content/scenario_events.jsonl || true

{"scene_id": "81cb7891dfe69679", "self_id": 2545, "self_type": "vehicle", "other_id": 1671, "other_type": "vehicle", "is_object_of_interest": true, "min_dist_m": 24.931882360581174, "step_min": 10, "gap_m": 22.937731813339134, "aligned_frac": 0.0, "oncoming_frac": 0.0, "ahead_frac": 1.0, "self_speed_mps": 4.325139045715332, "other_speed_mps": 1.1509385108947754, "label": "none", "summary": "No notable event between vehicle and vehicle (min distance 24.9 m)."}
{"scene_id": "81cb7891dfe69679", "self_id": 2545, "self_type": "vehicle", "other_id": 1666, "other_type": "vehicle", "is_object_of_interest": true, "min_dist_m": 33.02607636391494, "step_min": 10, "gap_m": 32.7131291964221, "aligned_frac": 0.0, "oncoming_frac": 1.0, "ahead_frac": 1.0, "self_speed_mps": 4.325139045715332, "other_speed_mps": 1.0597200393676758, "label": "none", "summary": "No notable event between vehicle and vehicle (min distance 33.0 m)."}
{"scene_id": "3e4bd1a8eab0cab7", "self_id": 5718, "self_type": "vehicle", "

Here we define a JSON schema for GenAI outputs and initialize the Gemini model with low temperature (for rigid outputs), extra token budget, and forced JSON mode. This ensures each scenario is summarized consistently with event labels and relevant data in a structured, machine-usable format.


In [None]:
EVENT_LABELS = [
  "close_pass","following_too_close","oncoming_interaction",
  "ped_or_cyclist_nearby","crossing_paths","none"
]

RESPONSE_SCHEMA = {
  "type": "object",
  "properties": {
    "scene_id":   {"type": "string"},
    "self_type":  {"type": "string", "enum": ["vehicle","pedestrian","cyclist"]},
    "other_type": {"type": "string", "enum": ["vehicle","pedestrian","cyclist"]},
    "event_label":{"type": "string", "enum": EVENT_LABELS},
    "risk_factors":{"type": "array", "items": {"type": "string"}},
    "key_numbers": {
      "type": "object",
      "properties": {
        "min_distance_m": {"type": "number"},
        "headway_s": {"type": "number", "nullable": True},
        "ttc_s":     {"type": "number", "nullable": True}
      },
      "required": ["min_distance_m","headway_s","ttc_s"]
    },
    "summary":    {"type": "string"},
    "confidence": {"type": "number"}
  },
  "required": ["scene_id","self_type","other_type","event_label",
               "key_numbers","summary","confidence"]
}

In [None]:
model = genai.GenerativeModel(
    model_name="gemini-2.5-flash",
    generation_config=genai.GenerationConfig(
        temperature=0.2,
        max_output_tokens=1024,                 # extra headroom avoids empty parts
        response_mime_type="application/json",  # force JSON output
        response_schema=RESPONSE_SCHEMA,        # supported fields only
    ),
)

Building the prompt to pass into the model.

In [None]:
def make_prompt(evt: dict) -> str:
    headway = (evt["gap_m"]/max(0.1, evt["self_speed_mps"])) if evt["gap_m"]>0 else None
    ttc = (evt["min_dist_m"]/max(0.1, evt["self_speed_mps"]+evt["other_speed_mps"])) if evt["oncoming_frac"]>0.5 else None
    return (
        "Summarize this driving interaction in one sentence. "
        "Pick an event_label from the allowed set; use 'none' if nothing notable.\n\n"
        f"scene_id={evt['scene_id']}\n"
        f"self_type={evt['self_type']}  other_type={evt['other_type']}\n"
        f"min_distance_m={evt['min_dist_m']:.2f}  step_min={evt['step_min']}\n"
        f"aligned_frac={evt['aligned_frac']:.2f}  oncoming_frac={evt['oncoming_frac']:.2f}  ahead_frac={evt['ahead_frac']:.2f}\n"
        f"gap_m={evt['gap_m']:.2f}  self_speed_mps={evt['self_speed_mps']:.2f}  other_speed_mps={evt['other_speed_mps']:.2f}\n"
        f"computed_headway_s={headway if headway is not None else 'null'}  "
        f"computed_ttc_s={ttc if ttc is not None else 'null'}\n"
        "Return only the JSON object per the schema."
    )


We stream each snapshot through Gemini. Any malformed or empty generations are logged and skipped.


In [None]:
INP = "/content/scenario_events.jsonl"

def _parse_json_part(text: str):
    if not text:
        return None
    t = text.strip()
    # strip ```json ... ``` fences if present
    if t.startswith("```"):
        t = re.sub(r"^```(?:json)?\n?", "", t)
        t = re.sub(r"\n?```$", "", t)
    # take substring between first { and last }
    i, j = t.find("{"), t.rfind("}")
    if i != -1 and j != -1 and j > i:
        t = t[i:j+1]
    # normalize smart quotes
    t = (t.replace("\u201c", '"').replace("\u201d", '"')
           .replace("\u2018", "'").replace("\u2019", "'"))
    try:
        return json.loads(t)
    except json.JSONDecodeError:
        return None

printed = 0
skipped = 0

with open(INP) as fin:
    for line in fin:
        evt = json.loads(line)

        resp = model.generate_content(make_prompt(evt))
        cand = resp.candidates[0] if resp.candidates else None
        parts = cand.content.parts if (cand and cand.content) else []

        if not parts:
            skipped += 1
            print(json.dumps({
                "scene_id": evt.get("scene_id"),
                "other_id": evt.get("other_id"),
                "status": "skipped",
                "finish_reason": getattr(cand, "finish_reason", None),
                "prompt_feedback": getattr(resp, "prompt_feedback", None),
            }, default=str))
            continue

        obj = _parse_json_part(parts[0].text)
        if obj is None:
            skipped += 1
            print(json.dumps({
                "scene_id": evt.get("scene_id"),
                "other_id": evt.get("other_id"),
                "status": "skipped_parse_error"
            }))
            continue


        obj["raw_features"] = {
            k: evt[k] for k in [
                "min_dist_m","gap_m","aligned_frac","oncoming_frac",
                "ahead_frac","self_speed_mps","other_speed_mps","step_min"
            ] if k in evt
        }
        print(json.dumps(obj, ensure_ascii=False))
        printed += 1

print(f"printed={printed} skipped={skipped}")

A peek into outputs.txt. Let's surface some interesting outputs.

In [24]:
!head -n 60 /content/outputs.txt || true

{"scene_id": "81cb7891dfe69679", "self_type": "vehicle", "other_type": "vehicle", "event_label": "none", "key_numbers": {"min_distance_m": 24.93, "headway_s": 5.303351307528999, "ttc_s": null}, "summary": "The self vehicle followed another vehicle at a minimum distance of 24.93 meters with a headway of 5.30 seconds.", "confidence": 0.9, "risk_factors": [], "raw_features": {"min_dist_m": 24.931882360581174, "gap_m": 22.937731813339134, "aligned_frac": 0.0, "oncoming_frac": 0.0, "ahead_frac": 1.0, "self_speed_mps": 4.325139045715332, "other_speed_mps": 1.1509385108947754, "step_min": 10}}
{"scene_id": "81cb7891dfe69679", "other_id": 1666, "status": "skipped", "finish_reason": 2, "prompt_feedback": ""}
{"scene_id": "3e4bd1a8eab0cab7", "self_type": "vehicle", "other_type": "vehicle", "event_label": "none", "key_numbers": {"min_distance_m": 24.83, "headway_s": 2.5293853698379256, "ttc_s": null}, "summary": "The self vehicle followed another vehicle at a safe distance and headway.", "confide

### Interesting Outputs/Cases

The model surfaces mostly benign interactions, but some clearly risky or unusual.  

- **Clear risk spike:** `following_too_close` with **headway ≈ 0.22s** at higher self-driving car speed — a good example of the kind of risk we want to surface:  
  > *“The self vehicle, traveling at 13.43 m/s, followed another vehicle, traveling at 2.66 m/s, with a dangerously low headway of 0.22 seconds, indicating a following too close event.”*  
  **Min dist:** 9.43 m · **Headway:** 0.22 s · **Ahead frac:** 1.00  

- **VRU (Vulnerable Road User) edge case:** pedestrian crossing paths at **min_dist ≈ 3.05m** and **TTC ≈ 1.79s** — low margin, high review priority:  
  > *“A stationary vehicle had a close interaction with a pedestrian who was moving and potentially crossing its path, with a minimum distance of 3.05 meters and a time-to-collision of 1.79 seconds.”*  
  **Oncoming frac:** 0.55  

- **Oncoming but distant:** several `oncoming_interaction` summaries report **TTC ~3–5s** yet **min_dist > 15–50m**:  
  - Scene *79ddf3…*:  
    > *“The self vehicle had an oncoming interaction with another vehicle, maintaining a minimum distance of 54.23 meters and a time to collision of 3.33 seconds.”*  
  - Scene *edda5a…*:  
    > *“The self vehicle and another oncoming vehicle had a safe interaction with a minimum distance of 15.98 meters, a headway of 5.59 seconds, and a time-to-collision of 4.99 seconds.”*  

- **Cyclist nearby:** scene *b23421d3…* highlights VRU proximity worth flagging:  
  > *“A vehicle interacted with a cyclist, maintaining a minimum distance of 9.12 meters.”*  
  **Headway:** 0.086 s · **Oncoming frac:** 0.45  

- **Stationary artifacts to watch:** some scenes show **headways of 100–300s** with near-zero speeds (e.g., *d1575d…*, *800215…*), which are likely parked/creeping cases that need special handling rather than being treated as risks.

---

These examples show why structured AI outputs could be helpful. They surface interesting scenarios and flag edge cases that engineers might miss.


### Risk tagging  
We reload the AI outputs and apply simple heuristic gates (distance < 10 m, headway < 1.5 s, TTC < 2 s, or risky labels like *following_too_close* / *crossing_paths*).  

This produces a quick binary `_risk` flag for each event, letting us separate ordinary cases from potential safety-critical ones.

In [11]:
INP = "/content/outputs.txt"  # from Cell 1
events = []
with open(INP) as fh:
    for line in fh:
        try:
            obj = json.loads(line)
        except Exception:
            continue
        if "event_label" not in obj:  # skip status lines
            continue
        # Normalize numbers
        kn = obj.get("key_numbers", {}) or {}
        md  = kn.get("min_distance_m")
        hw  = kn.get("headway_s")
        ttc = kn.get("ttc_s")
        # Simple risk gates
        risky = (
            (md is not None and md < 10.0) or
            (hw is not None and hw < 1.5) or
            (ttc is not None and ttc < 2.0) or
            (obj.get("event_label") in {"following_too_close","crossing_paths"})
        )
        obj["_risk"] = bool(risky)
        events.append(obj)

len_all = len(events)
len_risky = sum(e["_risk"] for e in events)
print(f"loaded={len_all} | risky={len_risky}")

loaded=60 | risky=11


### Preparing text for clustering  
We assemble short text snippets to represent each event.  

This gives us a clean list of `(index, text)` pairs that can be embedded and clustered in the next step.


In [12]:
LLM_PATH = "outputs.txt"
RAW_PATH = "/content/scenario_events.jsonl"

events_raw = [json.loads(x) for x in open(RAW_PATH)]

def texts_and_indices():
    if os.path.exists(LLM_PATH):

        out = []
        for i, line in enumerate(open(LLM_PATH)):
            obj = json.loads(line)

            if "summary" in obj:
                label = obj.get("event_label", "")
                out.append((i, f"{obj['summary']} [label={label}]"))
        return out
    else:

        out = []
        for i, e in enumerate(events_raw):
            txt = (f"{e['self_type']} vs {e['other_type']} | "
                   f"min_dist={e['min_dist_m']:.1f}m, gap={e['gap_m']:.1f}m | "
                   f"ahead={e['ahead_frac']:.2f}, oncoming={e['oncoming_frac']:.2f}, aligned={e['aligned_frac']:.2f}")
            out.append((i, txt))
        return out

pairs = texts_and_indices()
idxs  = [p[0] for p in pairs]
texts = [p[1] for p in pairs]
print("Texts ready:", len(texts), "| sample:", texts[:2])


Texts ready: 60 | sample: ['The self vehicle followed another vehicle at a minimum distance of 24.93 meters with a headway of 5.30 seconds. [label=none]', 'The self vehicle followed another vehicle at a safe distance and headway. [label=none]']


### Embedding → Clustering → Taxonomy
We turn each event summary into a vector, cluster them with a small k-means, and pick a representative example per cluster.

This produces a lightweight taxonomy (`taxonomy.jsonl`) that names each cluster.

In [13]:

model = SentenceTransformer("all-MiniLM-L6-v2")  # 384-dim, fast
X = model.encode(texts, normalize_embeddings=True, convert_to_numpy=True, batch_size=64, show_progress_bar=True)
print("Embedding shape:", X.shape)

# Choose k and cluster
n = len(texts)
if n < 2:
    raise ValueError("Need at least 2 items to cluster.")
k = min(6, max(2, n))  # small, readable taxonomy
km = KMeans(n_clusters=k, n_init="auto", random_state=0)
labels = km.fit_predict(X)

# Pick a representative item per cluster (closest to centroid)
rep_idx, _ = pairwise_distances_argmin_min(km.cluster_centers_, X)
rep_idx = rep_idx.tolist()

# Build a simple taxonomy object
OUT_TAX = "/content/taxonomy.jsonl"
taxonomy = []
by_cluster = collections.defaultdict(list)
for i, lbl in enumerate(labels):
    by_cluster[int(lbl)].append(i)

for c in range(k):
    members = by_cluster[c]
    rep_i   = rep_idx[c]
    rep_txt = texts[rep_i]

    label_hint = ""
    if "[label=" in rep_txt:
        try:
            label_hint = rep_txt.split("[label=",1)[1].split("]",1)[0]
        except Exception:
            label_hint = ""
    cluster_name = label_hint or f"cluster_{c}"

    entry = {
        "cluster_id": int(c),
        "name": cluster_name,
        "size": len(members),
        "representative_text": rep_txt,
        "member_indices": [int(idxs[i]) for i in members],  # indices back into your source events
    }
    taxonomy.append(entry)

with open(OUT_TAX, "w") as f:
    for row in taxonomy:
        f.write(json.dumps(row) + "\n")

print(f"Clusters: {k}  |  Saved taxonomy → {OUT_TAX}")
for row in taxonomy:
    print(f"[{row['cluster_id']}] {row['name']}  (n={row['size']})")
    print("   rep:", row["representative_text"][:140], "…")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Embedding shape: (60, 384)
Clusters: 6  |  Saved taxonomy → /content/taxonomy.jsonl
[0] none  (n=12)
   rep: The self vehicle and another vehicle maintained a safe distance, with the other vehicle trailing behind at a slightly slower speed. [label=n …
[1] none  (n=15)
   rep: The self vehicle followed another vehicle at a safe distance and headway. [label=none] …
[2] none  (n=13)
   rep: The self vehicle was stationary while another vehicle passed at a minimum distance of 26.17 meters. [label=none] …
[3] ped_or_cyclist_nearby  (n=7)
   rep: A vehicle maintained a safe distance from a pedestrian ahead, with a minimum distance of 14.11 meters and a headway of 9.42 seconds. [label= …
[4] oncoming_interaction  (n=7)
   rep: The self vehicle had an oncoming interaction with another vehicle, maintaining a minimum distance of 54.23 meters and a time to collision of …
[5] none  (n=6)
   rep: The two vehicles maintained a safe distance from each other, with the ego vehicle moving faster than th

### Cluster summaries (interpretable stats)
For each cluster, we build stats: who the self-driving cars mostly interact with, typical and minimum gaps, and how often motion was oncoming, ahead, or aligned.

In [14]:
# Inputs from earlier steps
TAX_PATH = "/content/taxonomy.jsonl"                     # from the clustering step
RAW_PATH = "/content/scenario_events.jsonl"      # your event snapshots

import json, collections, math

# Load
taxonomy = [json.loads(x) for x in open(TAX_PATH)]
events   = [json.loads(x) for x in open(RAW_PATH)]

def cluster_stats(entry):
    idxs = entry["member_indices"]
    members = [events[i] for i in idxs]

    n = len(members)
    other_types = [m.get("other_type","?") for m in members]
    type_counts = collections.Counter(other_types)
    dom_other, dom_other_cnt = (type_counts.most_common(1)[0] if type_counts else ("?",0))

    min_dists = [m.get("min_dist_m", float("inf")) for m in members]
    oncoming  = [m.get("oncoming_frac", 0.0) for m in members]
    ahead     = [m.get("ahead_frac", 0.0) for m in members]
    aligned   = [m.get("aligned_frac", 0.0) for m in members]

    def avg(xs):
        xs = [x for x in xs if x is not None and not math.isinf(x)]
        return (sum(xs)/len(xs)) if xs else None

    stats = {
        "n": n,
        "dominant_other": dom_other,
        "dominant_other_share": round(dom_other_cnt/max(n,1), 2),
        "min_dist_mean": round(avg(min_dists), 2) if avg(min_dists) is not None else None,
        "min_dist_min": round(min(min_dists), 2) if min_dists else None,
        "oncoming_mean": round(avg(oncoming), 2) if avg(oncoming) is not None else None,
        "ahead_mean": round(avg(ahead), 2) if avg(ahead) is not None else None,
        "aligned_mean": round(avg(aligned), 2) if avg(aligned) is not None else None,
        "pct_close_under_10m": round(sum(d<10 for d in min_dists)/max(n,1), 2),
    }
    return stats

cluster_summaries = []
for t in taxonomy:
    s = cluster_stats(t)
    t["_stats"] = s
    cluster_summaries.append({"cluster_id": t["cluster_id"], **s})

print("Clusters summarized:", len(cluster_summaries))
for s in cluster_summaries:
    print(f"[{s['cluster_id']}] n={s['n']} | dom={s['dominant_other']}({s['dominant_other_share']}) "
          f"| d_mean={s['min_dist_mean']}m (min {s['min_dist_min']}m) | oncoming={s['oncoming_mean']} "
          f"| ahead={s['ahead_mean']} | aligned={s['aligned_mean']} | <10m={s['pct_close_under_10m']}")


Clusters summarized: 6
[0] n=12 | dom=vehicle(1.0) | d_mean=16.88m (min 4.03m) | oncoming=0.11 | ahead=0.25 | aligned=0.77 | <10m=0.33
[1] n=15 | dom=vehicle(1.0) | d_mean=25.91m (min 7.59m) | oncoming=0.03 | ahead=1.0 | aligned=0.54 | <10m=0.2
[2] n=13 | dom=vehicle(0.92) | d_mean=22.36m (min 7.4m) | oncoming=0.21 | ahead=0.34 | aligned=0.48 | <10m=0.15
[3] n=7 | dom=cyclist(0.43) | d_mean=22.06m (min 9.12m) | oncoming=0.12 | ahead=0.81 | aligned=0.58 | <10m=0.14
[4] n=7 | dom=vehicle(0.71) | d_mean=28.58m (min 3.05m) | oncoming=0.65 | ahead=0.65 | aligned=0.34 | <10m=0.14
[5] n=6 | dom=vehicle(1.0) | d_mean=23.8m (min 17.3m) | oncoming=0.21 | ahead=0.47 | aligned=0.7 | <10m=0.0


### Naming clusters & assigning risk
We turn each cluster’s summary stats into a human-readable name (e.g., *following_vehicle*, *oncoming_interaction_pedestrian*) and a risk tier (high/medium/low) using rules on distance, directionality, and VRU presence. The result is`taxonomy_named.jsonl`.

In [17]:
# Heuristic namer
def name_and_risk(stats):
    dom = stats["dominant_other"]
    dmin = stats["min_dist_min"] or 1e9
    ahead = stats["ahead_mean"] or 0.0
    oncoming = stats["oncoming_mean"] or 0.0
    aligned = stats["aligned_mean"] or 0.0
    close_rate = stats["pct_close_under_10m"] or 0.0

    if dom in ("pedestrian","cyclist"):
        base = "vru_interaction"  # vulnerable road user
    elif oncoming > 0.5:
        base = "oncoming_interaction"
    elif aligned > 0.6 and ahead > 0.5:
        base = "following"
    elif dmin < 6 or close_rate > 0.3:
        base = "close_pass"
    else:
        base = "general_vehicle_interaction"

    # Simple risk rubric
    if (dom in ("pedestrian","cyclist") and dmin < 12) or dmin < 6:
        risk = "high"
    elif dmin < 15 or base in ("following","oncoming_interaction"):
        risk = "medium"
    else:
        risk = "low"
    return f"{base}_{dom}", risk

# Apply to taxonomy and save
OUT_TAX_NAMED = "/content/taxonomy_named.jsonl"
with open(OUT_TAX_NAMED, "w") as f:
    for t in taxonomy:
        nm, rk = name_and_risk(t["_stats"])
        t["name"] = nm
        t["risk_level"] = rk
        f.write(json.dumps({
            "cluster_id": t["cluster_id"],
            "name": nm,
            "risk_level": rk,
            "size": t["_stats"]["n"],
            "stats": t["_stats"],
            "representative_text": t["representative_text"],
            "member_indices": t["member_indices"],
        }) + "\n")

print("Saved:", OUT_TAX_NAMED)
# Quick preview
for line in open(OUT_TAX_NAMED):
    obj = json.loads(line);
    print(f"[{obj['cluster_id']}] {obj['name']} | risk={obj['risk_level']} | n={obj['size']}")

Saved: /content/taxonomy_named.jsonl
[0] close_pass_vehicle | risk=high | n=12
[1] general_vehicle_interaction_vehicle | risk=medium | n=15
[2] general_vehicle_interaction_vehicle | risk=medium | n=13
[3] vru_interaction_cyclist | risk=high | n=7
[4] oncoming_interaction_vehicle | risk=high | n=7
[5] general_vehicle_interaction_vehicle | risk=low | n=6


### Scenario book (human-readable output)  
Finally, we export each cluster into a scenario book.

For every cluster, we log its name, risk tier, a representative example, and the top-K closest interactions (by min distance).  

The result is an MD file where engineers can quickly browse scenarios and clusters, spotting risky or unusual cases without parsing JSON.

In [20]:
K = 3
OUT_BOOK = "/content/scenario_book.md"

def short_line(e):
    return (f"- scene={e['scene_id']} | other={e['other_type']} | "
            f"min_dist={e['min_dist_m']:.1f}m | "
            f"ahead={e['ahead_frac']:.2f} oncoming={e['oncoming_frac']:.2f} aligned={e['aligned_frac']:.2f}")

with open(OUT_BOOK, "w") as md:
    for row in [json.loads(x) for x in open(OUT_TAX_NAMED)]:
        cid = row["cluster_id"]; name = row["name"]; n = row["size"]
        idxs = row["member_indices"]
        members = [events[i] for i in idxs]
        members_sorted = sorted(members, key=lambda e: e.get("min_dist_m", 1e9))[:K]

        md.write(f"## Cluster {cid}: {name} (n={n})\n")
        md.write(f"_Risk: {row['risk_level']}_  |  rep: {row['representative_text']}\n\n")

        for j, e in enumerate(members_sorted, 1):
            summary = e.get("summary") or short_line(e)
            md.write(f"{j}. {summary}\n\n")   # numbered and double newline

print("Wrote:", OUT_BOOK)

# Show first ~30 lines inline
with open(OUT_BOOK) as f:
    for i, line in zip(range(30), f):
        print(line, end="")

Wrote: /content/scenario_book.md
## Cluster 0: close_pass_vehicle (n=12)
_Risk: high_  |  rep: The self vehicle and another vehicle maintained a safe distance, with the other vehicle trailing behind at a slightly slower speed. [label=none]

1. vehicle and vehicle pass within 4.0 m at step 6.

2. No notable event between vehicle and vehicle (min distance 5.5 m).

3. No notable event between vehicle and vehicle (min distance 6.2 m).

## Cluster 1: general_vehicle_interaction_vehicle (n=15)
_Risk: medium_  |  rep: The self vehicle followed another vehicle at a safe distance and headway. [label=none]

1. vehicle follows a vehicle with headway ≈ 0.4 s (gap ≈ 0.9 m).

2. No notable event between vehicle and vehicle (min distance 7.8 m).

3. vehicle and vehicle move on crossing paths; min distance ≈ 9.4 m.

## Cluster 2: general_vehicle_interaction_vehicle (n=13)
_Risk: medium_  |  rep: The self vehicle was stationary while another vehicle passed at a minimum distance of 26.17 meters. [label=

# Snapshot of the Scenario Book

Clusters 0, 3, and 4 are clear-cut groupings where vehicles pass too close, there's a VRU, or an oncoming vehicle.

Interestingly, clusters 1, 2, and 5 are generalized clusters with varying risk levels. Clusters 1 and 2 share a medium risk while cluster 5 is labeled low risk. We can see from the kinematics data this is likely because vehicles are further away in cluster 5.

In [21]:
from IPython.display import Markdown, display

def printmd(string):
    display(Markdown(string))

with open('scenario_book.md', 'r') as f:
    markdown_content = f.read()

# Display the Markdown content
printmd(markdown_content)

## Cluster 0: close_pass_vehicle (n=12)
_Risk: high_  |  rep: The self vehicle and another vehicle maintained a safe distance, with the other vehicle trailing behind at a slightly slower speed. [label=none]

1. vehicle and vehicle pass within 4.0 m at step 6.

2. No notable event between vehicle and vehicle (min distance 5.5 m).

3. No notable event between vehicle and vehicle (min distance 6.2 m).

## Cluster 1: general_vehicle_interaction_vehicle (n=15)
_Risk: medium_  |  rep: The self vehicle followed another vehicle at a safe distance and headway. [label=none]

1. vehicle follows a vehicle with headway ≈ 0.4 s (gap ≈ 0.9 m).

2. No notable event between vehicle and vehicle (min distance 7.8 m).

3. vehicle and vehicle move on crossing paths; min distance ≈ 9.4 m.

## Cluster 2: general_vehicle_interaction_vehicle (n=13)
_Risk: medium_  |  rep: The self vehicle was stationary while another vehicle passed at a minimum distance of 26.17 meters. [label=none]

1. No notable event between vehicle and vehicle (min distance 7.4 m).

2. No notable event between vehicle and vehicle (min distance 8.1 m).

3. No notable event between vehicle and vehicle (min distance 11.1 m).

## Cluster 3: vru_interaction_cyclist (n=7)
_Risk: high_  |  rep: A vehicle maintained a safe distance from a pedestrian ahead, with a minimum distance of 14.11 meters and a headway of 9.42 seconds. [label=ped_or_cyclist_nearby]

1. vehicle near a cyclist (min distance ≈ 9.1 m).

2. No notable event between vehicle and pedestrian (min distance 14.1 m).

3. No notable event between vehicle and cyclist (min distance 19.5 m).

## Cluster 4: oncoming_interaction_vehicle (n=7)
_Risk: high_  |  rep: The self vehicle had an oncoming interaction with another vehicle, maintaining a minimum distance of 54.23 meters and a time to collision of 3.33 seconds. [label=oncoming_interaction]

1. vehicle and pedestrian pass within 3.0 m at step 0.

2. vehicle and vehicle move on crossing paths; min distance ≈ 16.0 m.

3. vehicle and pedestrian move on crossing paths; min distance ≈ 18.4 m.

## Cluster 5: general_vehicle_interaction_vehicle (n=6)
_Risk: low_  |  rep: The two vehicles maintained a safe distance from each other, with the ego vehicle moving faster than the other. [label=none]

1. No notable event between vehicle and vehicle (min distance 17.3 m).

2. No notable event between vehicle and vehicle (min distance 18.2 m).

3. No notable event between vehicle and vehicle (min distance 21.9 m).



### Wrap-up  
The pipeline in this notebook can be summarized as such:

```Waymo Motion data → distilled features → GenAI summaries → clustering into a taxonomy → a browsable scenario book.  ```

The **result** is a repeatable way to surface risky or unusual driving situations+ edge cases and group them into testable buckets, illustrating how GenAI can support Systems/Test Engineering at scale and reduce dependency on human-led event labeling and summarizing.