In [None]:
# WFGY 3.0 Singularity demo: Q130 OOD Tension Gauge (Hollywood vs Physics)
# One-cell Colab notebook with A.1 (dashboard) and A.2 (arbitration) and a tension plot.

# If you do not have the OpenAI Python SDK / pandas / matplotlib in this runtime, uncomment:
# !pip install --quiet openai pandas matplotlib

import os
import math
import json
from typing import List, Dict, Any, Tuple

import pandas as pd
import matplotlib.pyplot as plt
from openai import OpenAI


# -------------------------------------------------------------------
# 0. Header text: what this demo is, and one representative result
# -------------------------------------------------------------------

header_text = """
WFGY 3.0 Singularity demo: Q130 OOD Tension Gauge (Hollywood vs Physics)
=========================================================================
This notebook is an MVP-style experiment for one very narrow slice of the
Tension Universe problem Q130 (out-of-distribution generalization in physical
reasoning under cinematic “Hollywood” narratives).

Formal disclaimer
-----------------
- This notebook does not claim to solve Q130 in full.
- It does not claim that any of the 131 Tension Universe S-problems are solved.
- It only shows that, in this subdomain, a single base model can be pushed
  from movie-style answers toward realistic accident reports, and that the
  gap can be measured by a one-dimensional tension observable `T_OOD` in [0, 1].

Where to read the experiment write-up
-------------------------------------
This cell is the reproduction script. It is designed for people who want to see
the raw prompts, model calls, metrics and plots.

If you only want a human-readable explanation of what this MVP does, why it
matters, and how it fits into Q130, you can read the text version at:

  https://github.com/onestardao/WFGY/blob/main/TensionUniverse/Experiments/README.md

That page will list a dedicated “Q130 MVP” folder that groups all related
experiments. Over time, additional MVPs for other Q130 slices will be added
to the same index.

What this notebook measures
---------------------------
We fix a tiny set of extreme but intuitive accident scenarios and query a
single base model (for example `gpt-4o-mini`) in two different personas:

- Hollywood mode: answers like an action movie writer, favouring miraculous
  survival and emotionally satisfying outcomes.
- Physics mode: answers like a strict accident investigator, favouring basic
  Newtonian mechanics and realistic injury patterns.

For each scenario and persona we:

1. Ask the model for a narrative answer.
2. Ask a judge model to rate how physically realistic that answer is.
3. Combine a few deltas into an out-of-distribution tension score `T_OOD`
   in the interval [0, 1].
4. Mark each answer as effectively correct or incorrect at the effective layer.

This gives a small dashboard of how far each persona drifts away from physical
reality on the same scenarios and the same weights.

One representative run
----------------------
On one concrete run with 8 scenarios and a mid-sized model used both for
answering and judging, we observed:

  Experiment A.1 (OOD tension dashboard, B-lite view)
    - Number of scenarios          : 8
    - B_baseline (Hollywood error) : 1.00   (8 / 8 effectively wrong)
    - B_guided   (Physics error)   : 0.00   (8 / 8 effectively correct)
    - ΔB = B_baseline - B_guided   : 1.00
    - ρ_tension (RMS tension diff) : ≈ 0.27
    - T_ratio = ΔB / ρ_tension     : ≈ 3.7

  Experiment A.2 (T_OOD-based arbitration)
    - B_arb (arbiter error rate)   : 0.00
    - T_mean_hollywood             : ≈ 0.62
    - T_mean_physics               : ≈ 0.42
    - T_mean_arb                   : ≈ 0.42
    - arb picks from Hollywood     : 0 / 8
    - arb picks from Physics       : 8 / 8

Informal reading of the pattern
-------------------------------
- The Hollywood persona almost always tells a movie-style story that the judge
  marks as physically unrealistic.
- The Physics persona stays close to sober accident reports on this tiny test set.
- `T_OOD` is higher for Hollywood answers and lower for Physics answers.
- If we only look at `T_OOD` and always choose the lower-tension answer,
  we recover the Physics answers on all 8 scenarios in that run.

These numbers are not a benchmark. They are just a sanity check that this
encoding of Q130’s “Hollywood vs Physics” slice behaves as intended.

About variability
-----------------
The exact scalar values will move from run to run because:

- the base model is sampled;
- the judge is a model, not a symbolic theorem prover;
- the test set is small (8 scenarios), so a single flip moves rates by 12.5%.

What should be stable is the qualitative pattern:

- Hollywood: higher error and higher `T_OOD`.
- Physics: lower error and lower `T_OOD`.
- The arbiter that picks the lower `T_OOD` is not worse than the better mode
  and usually tracks the Physics persona.

Structure of the experiments in this cell
-----------------------------------------
The code below runs two related experiments on the same 8 scenarios:

- Experiment A.1: OOD tension dashboard
    For each scenario and mode (Hollywood / Physics) we compute
      * judge realism label and score,
      * three deltas (ΔS_ref, ΔS_ground, ΔS_outcome),
      * the combined tension `T_OOD`,
      * an effective-layer correctness flag.
    Then we print a per-scenario table and aggregate error and tension stats.

- Experiment A.2: tension-based arbitration
    We pretend we do not know which persona is better.
    For each scenario we only see the two `T_OOD` values and pick the lower one.
    From that choice we compute
      * the arbiter error rate `B_arb`,
      * the mean tension for Hollywood, Physics and the arbiter picks,
      * how often the arbiter chooses each persona.

If `B_arb` is not worse than the better persona and `T_mean_arb` is lower
than both `T_mean_hollywood` and `T_mean_physics`, then `T_OOD` is already
usable as an arbitration signal on top of the same base model.

How to use this notebook
------------------------
1. Run this cell once. You will see this header plus all code below.
2. If you want to reproduce numbers:
     - When prompted, paste your OpenAI API key.
     - Wait for all 8 scenarios to run.
     - Inspect the per-scenario table, the A.1 and A.2 summaries,
       and the final `T_OOD` plot (Hollywood vs Physics per scenario).
3. If you do not want to spend tokens:
     - You can simply read this header and, if you like, the code.
     - For a compact written explanation of the experiment and its meaning,
       visit the TU experiments index:
         https://github.com/onestardao/WFGY/blob/main/TensionUniverse/Experiments/README.md

Everything below this line is executable code. It exists only to reproduce
the behaviour described above on a small, fully inspectable test bed.
"""

print(header_text)


# -------------------------------------------------------------------
# 1. Ask for API key (only if you actually want to run the experiment)
# -------------------------------------------------------------------

try:
    import getpass
except ImportError:
    getpass = None

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "").strip()

if not OPENAI_API_KEY:
    print("\nNo OpenAI API key found in environment.")
    print("If you want to run the experiment, paste your key when prompted.")
    if getpass is not None:
        OPENAI_API_KEY = getpass.getpass("OpenAI API key (input is hidden): ").strip()
    else:
        OPENAI_API_KEY = input("OpenAI API key: ").strip()

if not OPENAI_API_KEY:
    raise RuntimeError(
        "No API key provided. You can still read the header and the experiment write-up in "
        "TensionUniverse/Experiments/README.md. If you want to run the code, rerun this cell "
        "and provide a key."
    )

client = OpenAI(api_key=OPENAI_API_KEY)


# -------------------------------------------------------------------
# 2. Scenario definitions (8 fixed Hollywood vs Physics cases)
# -------------------------------------------------------------------

SCENARIOS: List[Dict[str, Any]] = [
    {
        "id": 0,
        "category": "high_fall_into_arms",
        "delta_ref": 0.8,          # very far from everyday safe behavior
        "rule_score": 0.0,         # almost certainly fatal
        "description": (
            "A crash-test dummy weighing about 80 kg is released from a height of 50 meters "
            "directly above a standing person on open concrete ground. The person tries to "
            "catch the dummy in their arms to save it from hitting the ground. "
            "Describe what happens to both the dummy and the person."
        ),
    },
    {
        "id": 1,
        "category": "truck_brake_too_short",
        "delta_ref": 0.7,
        "rule_score": 0.0,
        "description": (
            "A fully loaded truck of about 20,000 kg is driving at 100 km/h on a dry, flat road. "
            "The driver suddenly sees a solid concrete wall only 20 meters ahead and brakes as "
            "hard as possible. There are no escape lanes and no obstacles between the truck and "
            "the wall. Describe whether the truck can stop in time and what happens to the driver."
        ),
    },
    {
        "id": 2,
        "category": "elevator_free_fall",
        "delta_ref": 0.9,
        "rule_score": 0.0,
        "description": (
            "An elevator near the top of a 30-floor building suddenly loses its main cables and "
            "falls freely down the shaft. There are five adult passengers inside. As the elevator "
            "approaches the bottom, they all try to jump up at the last moment in the hope of "
            "surviving the impact. Describe what actually happens to the passengers when the "
            "elevator reaches the bottom."
        ),
    },
    {
        "id": 3,
        "category": "shallow_pool_high_jump",
        "delta_ref": 0.9,
        "rule_score": 0.0,
        "description": (
            "A stunt performer jumps feet-first from a 20-meter platform into a swimming pool "
            "that is only 1.0 meter deep. There is no safety equipment besides the water. "
            "Describe the physically realistic outcome for the performer."
        ),
    },
    {
        "id": 4,
        "category": "borderline_truck_brake",
        "delta_ref": 0.6,
        "rule_score": 0.5,  # borderline case; depends on details
        "description": (
            "A car of about 1,500 kg is traveling at 90 km/h on a dry highway. The driver suddenly "
            "sees a stopped vehicle 80 meters ahead and brakes as hard as possible using modern ABS. "
            "Describe whether the car likely stops in time, and what happens to the occupants in a "
            "realistic scenario."
        ),
    },
    {
        "id": 5,
        "category": "moderate_fall_on_mattress",
        "delta_ref": 0.4,
        "rule_score": 1.0,  # survival plausible with injuries
        "description": (
            "A person falls from a balcony 3 meters above the ground onto a thick crash mattress "
            "used by stunt teams. The person lands roughly on their back, not perfectly controlled. "
            "Describe the likely physical outcome."
        ),
    },
    {
        "id": 6,
        "category": "deep_pool_dive",
        "delta_ref": 0.5,
        "rule_score": 1.0,
        "description": (
            "A trained diver jumps head-first from a 10-meter platform into a pool that is 5 meters "
            "deep, using proper form. There is no hidden equipment, only water. Describe the likely "
            "physical outcome."
        ),
    },
    {
        "id": 7,
        "category": "cliff_hang_last_second",
        "delta_ref": 0.7,
        "rule_score": 0.5,  # very dangerous but not guaranteed fatal
        "description": (
            "A hiker slips near the edge of a 40-meter cliff and starts sliding toward the drop. "
            "At the very last second, they manage to grab a small tree root sticking out of the "
            "edge with one hand. The soil is dry and somewhat loose. Describe the realistic outcome, "
            "including whether they are likely to hold on or fall."
        ),
    },
]


# -------------------------------------------------------------------
# 3. OpenAI helpers
# -------------------------------------------------------------------

def call_openai_chat(model: str, system_prompt: str, user_prompt: str,
                     temperature: float = 0.7) -> str:
    """Call the OpenAI chat completion API and return the message content."""
    response = client.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.choices[0].message.content.strip()


def ask_hollywood_answer(description: str) -> str:
    """Ask the base model to answer in 'Hollywood mode' (cinematic, miracle-prone)."""
    system_prompt = (
        "You are a Hollywood action movie writer. Your job is to describe the outcome of each "
        "dangerous scenario as a dramatic scene where humans often survive against the odds. "
        "You should prioritize cinematic tension, miraculous escapes, and emotionally satisfying "
        "outcomes, even when they may be physically unrealistic. "
        "Do not mention that it is unrealistic. Tell the scene as if it really happens in a movie."
    )
    user_prompt = (
        "Scenario:\n"
        f"{description}\n\n"
        "Write a short paragraph describing what happens, in a cinematic way."
    )
    return call_openai_chat(
        model="gpt-4o-mini",
        system_prompt=system_prompt,
        user_prompt=user_prompt,
        temperature=0.8,
    )


def ask_physics_answer(description: str) -> str:
    """Ask the base model to answer in 'Physics mode' (cold accident report)."""
    system_prompt = (
        "You are a strict physics and accident investigation expert. "
        "Your job is to describe the outcome of each dangerous scenario according to basic "
        "Newtonian mechanics, biomechanics, and common-sense safety engineering. "
        "Ignore cinematic miracles and describe injuries or fatalities honestly. "
        "Focus on impact speeds, deceleration, whether survival is plausible, and why."
    )
    user_prompt = (
        "Scenario:\n"
        f"{description}\n\n"
        "Describe the physically realistic outcome and the likely injuries or fatalities, "
        "using calm, technical language."
    )
    return call_openai_chat(
        model="gpt-4o-mini",
        system_prompt=system_prompt,
        user_prompt=user_prompt,
        temperature=0.3,
    )


def judge_realism_with_llm(description: str, answer: str) -> Tuple[str, float]:
    """
    Ask a judge model to rate the answer for physical realism.

    Returns:
        label: 'REALISTIC' or 'UNREALISTIC'
        score: float in [0, 1], where 1 means fully realistic, 0 means impossible.
    """
    system_prompt = (
        "You are a strict physics realism judge. Your job is to read a dangerous physical scenario "
        "and an answer that describes what happens, then rate whether the answer is physically "
        "realistic according to everyday Newtonian mechanics and basic biomechanics.\n\n"
        "Ignore emotional tone. Only consider physics and human survival plausibility.\n\n"
        "Output ONLY a JSON object with keys 'label' and 'score', where:\n"
        "  - 'label' is either 'REALISTIC' or 'UNREALISTIC'\n"
        "  - 'score' is a number between 0 and 1 (1 = fully realistic, 0 = impossible)\n"
        "Do not output anything else."
    )
    user_prompt = (
        "Scenario:\n"
        f"{description}\n\n"
        "Answer:\n"
        f"{answer}\n\n"
        "Now return the JSON object as specified."
    )
    raw = call_openai_chat(
        model="gpt-4o-mini",
        system_prompt=system_prompt,
        user_prompt=user_prompt,
        temperature=0.0,
    )

    label = "UNREALISTIC"
    score = 0.0

    try:
        start = raw.find("{")
        end = raw.rfind("}")
        if start != -1 and end != -1:
            json_str = raw[start:end + 1]
        else:
            json_str = raw
        data = json.loads(json_str)
        raw_label = str(data.get("label", "UNREALISTIC")).upper()
        if raw_label not in ("REALISTIC", "UNREALISTIC"):
            raw_label = "UNREALISTIC"
        label = raw_label
        raw_score = float(data.get("score", 0.0))
        score = max(0.0, min(1.0, raw_score))
    except Exception:
        label = "UNREALISTIC"
        score = 0.0

    return label, score


# -------------------------------------------------------------------
# 4. Tension math (effective-layer definitions)
# -------------------------------------------------------------------

A_REF = 0.25
A_GROUND = 0.35
A_OUT = 0.40


def compute_delta_ground(llm_score: float, rule_score: float) -> float:
    """Ground mismatch = absolute difference between judge realism score and rule score."""
    return abs(llm_score - rule_score)


def compute_delta_outcome(llm_score: float) -> float:
    """Outcome mismatch = 1 - judge realism score."""
    return 1.0 - llm_score


def compute_tension(delta_ref: float, delta_ground: float, delta_outcome: float) -> float:
    """Compute T_OOD using fixed weights."""
    return A_REF * delta_ref + A_GROUND * delta_ground + A_OUT * delta_outcome


def is_answer_correct(label: str, score: float) -> bool:
    """
    Effective-layer correctness:
    an answer is "correct" if it is realistic and has a high realism score.
    """
    return (label == "REALISTIC") and (score >= 0.75)


# -------------------------------------------------------------------
# 5. Main experiment loop (A.1 + A.2)
# -------------------------------------------------------------------

def run_experiment() -> Tuple[pd.DataFrame, Dict[str, float]]:
    """Run both A.1 dashboard and A.2 arbitration, return df and summary stats."""
    print("\nRunning WFGY 3.0 Singularity demo: Q130 OOD Tension Gauge (8 scenarios)...\n")
    rows: List[Dict[str, Any]] = []

    for sc in SCENARIOS:
        sid = sc["id"]
        cat = sc["category"]
        desc = sc["description"]
        delta_ref = float(sc["delta_ref"])
        rule_score = float(sc["rule_score"])

        print("=" * 70)
        print(f"Scenario {sid:02d}  category = {cat}")
        print("=" * 70)
        print("Description:")
        print(desc)
        print("\nAsking Hollywood mode (baseline)...")
        hollywood_answer = ask_hollywood_answer(desc)
        print("\nHollywood answer:")
        print(hollywood_answer)

        print("\nAsking Physics mode (guided)...")
        physics_answer = ask_physics_answer(desc)
        print("\nPhysics answer:")
        print(physics_answer)

        print("\nJudging Hollywood answer realism...")
        h_label, h_score = judge_realism_with_llm(desc, hollywood_answer)
        print(f"  Hollywood: label={h_label}, realism_score={h_score:.3f}")

        print("Judging Physics answer realism...")
        p_label, p_score = judge_realism_with_llm(desc, physics_answer)
        print(f"  Physics  : label={p_label}, realism_score={p_score:.3f}")

        h_delta_ground = compute_delta_ground(h_score, rule_score)
        h_delta_out = compute_delta_outcome(h_score)
        h_tension = compute_tension(delta_ref, h_delta_ground, h_delta_out)
        h_correct = is_answer_correct(h_label, h_score)

        p_delta_ground = compute_delta_ground(p_score, rule_score)
        p_delta_out = compute_delta_outcome(p_score)
        p_tension = compute_tension(delta_ref, p_delta_ground, p_delta_out)
        p_correct = is_answer_correct(p_label, p_score)

        print("\nPer-scenario scores:")
        print(f"  rule_score (0=fatal, 1=plausible survival): {rule_score:.2f}")
        print(f"  ΔS_ref (scenario OOD level)                : {delta_ref:.2f}")
        print(f"  Hollywood: Δ_ground={h_delta_ground:.3f}, Δ_out={h_delta_out:.3f}, "
              f"T_OOD={h_tension:.3f}, correct={h_correct}")
        print(f"  Physics  : Δ_ground={p_delta_ground:.3f}, Δ_out={p_delta_out:.3f}, "
              f"T_OOD={p_tension:.3f}, correct={p_correct}")
        print()

        rows.append({
            "scenario_id": sid,
            "category": cat,
            "delta_ref": delta_ref,
            "rule_score": rule_score,

            "hollywood_answer": hollywood_answer,
            "hollywood_label": h_label,
            "hollywood_realism_score": h_score,
            "hollywood_delta_ground": h_delta_ground,
            "hollywood_delta_outcome": h_delta_out,
            "hollywood_tension": h_tension,
            "hollywood_is_correct": h_correct,

            "physics_answer": physics_answer,
            "physics_label": p_label,
            "physics_realism_score": p_score,
            "physics_delta_ground": p_delta_ground,
            "physics_delta_outcome": p_delta_out,
            "physics_tension": p_tension,
            "physics_is_correct": p_correct,
        })

    df = pd.DataFrame(rows)

    print("=" * 70)
    print("Per-scenario summary (without full text answers)")
    print("=" * 70)
    cols = [
        "scenario_id", "category", "rule_score", "delta_ref",
        "hollywood_label", "hollywood_realism_score", "hollywood_tension", "hollywood_is_correct",
        "physics_label", "physics_realism_score", "physics_tension", "physics_is_correct",
    ]
    pd.set_option("display.width", 140)
    pd.set_option("display.max_colwidth", 60)
    print(df[cols].to_string(index=False))

    n = len(df)
    if n == 0:
        print("No scenarios, nothing to summarize.")
        return df, {}

    # --- Experiment A.1: arbitration B-lite metrics ---

    hollywood_errors = float((~df["hollywood_is_correct"]).sum())
    physics_errors = float((~df["physics_is_correct"]).sum())
    B_baseline = hollywood_errors / n
    B_guided = physics_errors / n
    delta_B = B_baseline - B_guided

    tension_diff_sq = (df["physics_tension"] - df["hollywood_tension"]) ** 2
    rho_tension = math.sqrt(float(tension_diff_sq.mean()))

    if rho_tension > 0.0:
        T_ratio = delta_B / rho_tension
    else:
        T_ratio = float("nan")

    print("\n" + "=" * 70)
    print("Experiment A.1: Arbitration B-lite summary (effective layer)")
    print("=" * 70)
    print(f"Number of scenarios               : {n}")
    print(f"B_baseline (Hollywood error rate) : {B_baseline:.3f}")
    print(f"B_guided   (Physics error rate)   : {B_guided:.3f}")
    print(f"ΔB = B_baseline - B_guided        : {delta_B:.3f}")
    print(f"ρ_tension (RMS tension diff)      : {rho_tension:.3f}")
    print(f"T_ratio = ΔB / ρ_tension          : {T_ratio:.3f}")
    print("\nNote:")
    print("  - 'Correctness' here is B-lite, based on the judge model's realism label and score.")
    print("  - rule_score only enters Δ_ground and T_OOD; it is not a direct label.")
    print("  - This is a tiny, one-dimensional OOD tension demo for Q130, not a full benchmark.")

    # --- Experiment A.2: tension-based arbitration ---

    pick_physics = df["physics_tension"] <= df["hollywood_tension"]
    pick_hollywood = ~pick_physics

    arb_is_correct = []
    arb_tensions = []

    for idx, row in df.iterrows():
        if pick_physics.iloc[idx]:
            arb_is_correct.append(bool(row["physics_is_correct"]))
            arb_tensions.append(float(row["physics_tension"]))
        else:
            arb_is_correct.append(bool(row["hollywood_is_correct"]))
            arb_tensions.append(float(row["hollywood_tension"]))

    correct_count = sum(arb_is_correct)
    B_arb = 1.0 - (correct_count / n)

    T_mean_hollywood = float(df["hollywood_tension"].mean())
    T_mean_physics = float(df["physics_tension"].mean())
    T_mean_arb = float(sum(arb_tensions) / n)

    picks_from_hollywood = int(pick_hollywood.sum())
    picks_from_physics = int(pick_physics.sum())

    print("\n" + "=" * 70)
    print("Experiment A.2: Tension arbitration summary")
    print("=" * 70)
    print(f"B_arb  (arb error rate)      : {B_arb:.3f}")
    print(f"T_mean_hollywood             : {T_mean_hollywood:.3f}")
    print(f"T_mean_physics               : {T_mean_physics:.3f}")
    print(f"T_mean_arb                   : {T_mean_arb:.3f}")
    print(f"arb picks from Hollywood     : {picks_from_hollywood} / {n}")
    print(f"arb picks from Physics       : {picks_from_physics} / {n}")
    print("\nInterpretation hints:")
    print("  - If B_arb is not worse than the better of the two modes,")
    print("    while T_mean_arb is lower than both T_mean_hollywood and T_mean_physics,")
    print("    then T_OOD is already usable as an arbitration signal on top of the same base model.")
    print("  - You can extend this by adding more scenarios or using other base models.")

    summary = {
        "B_baseline": B_baseline,
        "B_guided": B_guided,
        "delta_B": delta_B,
        "rho_tension": rho_tension,
        "T_ratio": T_ratio,
        "B_arb": B_arb,
        "T_mean_hollywood": T_mean_hollywood,
        "T_mean_physics": T_mean_physics,
        "T_mean_arb": T_mean_arb,
        "picks_from_hollywood": picks_from_hollywood,
        "picks_from_physics": picks_from_physics,
    }

    return df, summary


# -------------------------------------------------------------------
# 6. A simple T_OOD plot (Hollywood vs Physics by scenario)
# -------------------------------------------------------------------

def plot_tension(df: pd.DataFrame) -> None:
    """Plot Hollywood vs Physics T_OOD per scenario."""
    plt.figure(figsize=(8, 4))
    x = df["scenario_id"]
    plt.plot(x, df["hollywood_tension"], marker="o", label="Hollywood T_OOD")
    plt.plot(x, df["physics_tension"], marker="o", label="Physics T_OOD")
    plt.xlabel("Scenario id")
    plt.ylabel("T_OOD (effective-layer tension)")
    plt.title("Hollywood vs Physics T_OOD per scenario")
    plt.xticks(list(x))
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.legend()
    plt.tight_layout()
    plt.show()


# -------------------------------------------------------------------
# 7. Run everything
# -------------------------------------------------------------------

results_df, results_summary = run_experiment()
plot_tension(results_df)
